Skip to content

Reduce allocations + copies due to rebuffering in LocalDataResponse row serialization#4707

Open
cscotta wants to merge 2 commits intoapache:trunkfrom
cscotta:CASSANDRA-21285
Open

Reduce allocations + copies due to rebuffering in LocalDataResponse row serialization#4707
cscotta wants to merge 2 commits intoapache:trunkfrom
cscotta:CASSANDRA-21285

Conversation

@cscotta
Copy link
Copy Markdown

@cscotta cscotta commented Apr 5, 2026

@netudima
Copy link
Copy Markdown
Contributor

netudima commented Apr 5, 2026

the only potential concern for the approach itself: if we have frequent enough heavy select from one table and light selects from other tables (or even the same table but using different queries) then we will allocate more for every request as a tradeoff. But it is probably an unavoidable issue.., maybe we can limit a max value to limit an impact in such use cases..

@cscotta
Copy link
Copy Markdown
Author

cscotta commented Apr 5, 2026

Thanks, was thinking about this a bit.

The decay is pretty quick (on the basis of the last 1000 LocalDataResponses generated), which probably helps. But a workload that's 99.9% 128-byte responses and 0.01% 10MB responses would still be degenerate. If the table name/ID were available in this scope, that would make it easier to vary by table - but still not perfect.

One simple approach might be to track a histogram of generated response sizes and calculate whether the old behavior or new behavior would be preferable. But I also don't mean to go overkill on it and introduce stats tracking/comparison in every response generation.

I may look at deploying this for a variety of workloads with additional instrumentation that measures "better or worse" as well. Could also feature-flag it.

Interested in your + others' thoughts on this.

@netudima
Copy link
Copy Markdown
Contributor

netudima commented Apr 5, 2026

actually table name/ID could be tricky here because you will have to implement a cleanup (or forget logic) to not leak memory in case of a table drop.
I see the following ways:

  1. the simplest option is to have an upper limit here for the value - min(limit, metricValue) to reduce impact if the metric value is too high. If the configurable limit value is non-positive we can disable the metric usage and apply the old behaviuor just in case.
  2. slightly different way is to apply a limit to the input, not output, of the metric - estimatedResponseBytes.update(min(limit, currentResponseSize)) - to reduce impact of too huge requests to an avg calculation.
  3. use a median, not avg to avoid impact from size pikes by introducing a histogram here but histogram values retrieval is more expensive and we will have to cache them as we do for speculative retry threshold value.

Copy link
Copy Markdown
Contributor

@maedhroz maedhroz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM

In terms of safety, perhaps a good initial path forward here would just be to have a system property that controls the maximum initial size. Then, this whole optimization could be trivially disabled by setting it to the current minimum, which, actually, could also be a system property. They could default to 128 KiB / 1 MiB, and be loaded at static initialization time in LocalDataResponse from CassandraRelevantProperties. You end up with something like...

double bufferSizeEstimate = Double.isNaN(estimatedResponseSize) ? <min> : Math.min(estimatedResponseSize, <max>);

(I think the default max is the one thing to bike-shed, but ultimately that depends on the workload.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants