Add payload size metrics#1289
Conversation
f0fbbf8 to
d503332
Compare
Record logical workflow and activity payload sizes alongside client-side gRPC message body sizes so users can detect payload growth before hitting transport limits. This adds workflow_payload_size metrics for workflow activation inputs and successful completion results, activity_payload_size metrics for activity inputs and successful results, and rpc_message_size metrics for client gRPC request and response bodies. The metrics use byte units, shared payload-size histogram buckets that cover the 4 MiB boundary, and low-cardinality message_direction labels for request and response series. Payload metric names and message_direction labels live in common so client and core emitters use the same names and cardinality values. The client transport wrapper records streamed DATA frame bytes without changing the body, while workflow and activity metrics count payload data plus metadata key/value bytes. Tests cover payload byte accounting, Prometheus series labels/counts for activity and workflow payload metrics, request and response RPC message-size series, and non-zero activity payload sums.
d503332 to
131d8a2
Compare
|
Let me bring this up with the rest of the team. We already record many, many metrics and emitting new ones can impose costs on users. We can potentially add these behind a flag. Often though our answer is to add these yourself via interceptors if you find you have a need for them. However, these particular metrics have come up once or twice before, and are in line with some of our work around storing payloads externally. I'll get back to you. |
|
Definitely, and thanks! I did assume this one might be a bit more controversial, but thought I would throw it up as a proposal in any case; It was the source of motivation for implementing the activity interceptors first :) My take for why this might be a beneficial one for the collective is hitting the 4mb payload limit is an immediate unrecoverable workflow failure, and devs calculating the in production payload creep is an easy one to go unobserved until failure. Something to consider with storing the payloads externally. I've discussed this with some other folks on the temporal team, but there is a catch to the current implementation, at least in the go SDK. The commands are accumulated in a list before getting sent out, so while the individual payloads might not exceed the 4mb limit or trigger the external storage threshold, the accumulation of all the commands before a yield can result in exceeding 4mb. i.e. 500 workflows triggered in a loop before the first future.Get is called. As you mentioned though, with the interceptors this can be done client side now, so either solution works for me. Thanks for the consideration! |
We have a fix coming for this too, where we'll paginate. |
What was changed
This adds payload-size metrics for three paths:
The metrics use byte units, shared payload-size histogram buckets that cover the 4 MiB boundary, and low-cardinality message_direction labels for request and response series. Payload metric names and message_direction label values live in common so client and core emitters use the same names and cardinality values.
The client transport wrapper records streamed DATA frame bytes without changing the body, while workflow and activity metrics count payload data plus metadata key/value bytes.
Why?
These metrics make payload growth visible before it hits transport limits, and separate logical payload sizing from client-side RPC body sizing.
Testing