Skip to content

Add payload size metrics#1289

Open
jeremy-cook wants to merge 2 commits into
temporalio:mainfrom
jeremy-cook:jecook-native-payload-metrics
Open

Add payload size metrics#1289
jeremy-cook wants to merge 2 commits into
temporalio:mainfrom
jeremy-cook:jecook-native-payload-metrics

Conversation

@jeremy-cook
Copy link
Copy Markdown
Contributor

@jeremy-cook jeremy-cook commented May 21, 2026

What was changed

This adds payload-size metrics for three paths:

  • workflow_payload_size for workflow activation inputs and successful completion results
  • activity_payload_size for activity inputs and successful results
  • rpc_message_size for client gRPC request and response bodies

The metrics use byte units, shared payload-size histogram buckets that cover the 4 MiB boundary, and low-cardinality message_direction labels for request and response series. Payload metric names and message_direction label values live in common so client and core emitters use the same names and cardinality values.

The client transport wrapper records streamed DATA frame bytes without changing the body, while workflow and activity metrics count payload data plus metadata key/value bytes.

Why?

These metrics make payload growth visible before it hits transport limits, and separate logical payload sizing from client-side RPC body sizing.

Testing

  • cargo +nightly fmt --all --check
  • cargo test -p temporalio-client body_size_recorder_counts_data_frames
  • cargo test -p temporalio-sdk-core payload_size_counts_data_and_metadata_bytes

@jeremy-cook jeremy-cook requested a review from a team as a code owner May 21, 2026 19:29
@jeremy-cook jeremy-cook force-pushed the jecook-native-payload-metrics branch from f0fbbf8 to d503332 Compare May 21, 2026 21:43
Record logical workflow and activity payload sizes alongside client-side gRPC message body sizes so users can detect payload growth before hitting transport limits.

This adds workflow_payload_size metrics for workflow activation inputs and successful completion results, activity_payload_size metrics for activity inputs and successful results, and rpc_message_size metrics for client gRPC request and response bodies.

The metrics use byte units, shared payload-size histogram buckets that cover the 4 MiB boundary, and low-cardinality message_direction labels for request and response series. Payload metric names and message_direction labels live in common so client and core emitters use the same names and cardinality values.

The client transport wrapper records streamed DATA frame bytes without changing the body, while workflow and activity metrics count payload data plus metadata key/value bytes.

Tests cover payload byte accounting, Prometheus series labels/counts for activity and workflow payload metrics, request and response RPC message-size series, and non-zero activity payload sums.
@jeremy-cook jeremy-cook force-pushed the jecook-native-payload-metrics branch from d503332 to 131d8a2 Compare May 21, 2026 21:48
@Sushisource
Copy link
Copy Markdown
Member

Let me bring this up with the rest of the team. We already record many, many metrics and emitting new ones can impose costs on users. We can potentially add these behind a flag. Often though our answer is to add these yourself via interceptors if you find you have a need for them.

However, these particular metrics have come up once or twice before, and are in line with some of our work around storing payloads externally.

I'll get back to you.

@jeremy-cook
Copy link
Copy Markdown
Contributor Author

Definitely, and thanks! I did assume this one might be a bit more controversial, but thought I would throw it up as a proposal in any case; It was the source of motivation for implementing the activity interceptors first :)

My take for why this might be a beneficial one for the collective is hitting the 4mb payload limit is an immediate unrecoverable workflow failure, and devs calculating the in production payload creep is an easy one to go unobserved until failure.

Something to consider with storing the payloads externally. I've discussed this with some other folks on the temporal team, but there is a catch to the current implementation, at least in the go SDK. The commands are accumulated in a list before getting sent out, so while the individual payloads might not exceed the 4mb limit or trigger the external storage threshold, the accumulation of all the commands before a yield can result in exceeding 4mb. i.e. 500 workflows triggered in a loop before the first future.Get is called.

As you mentioned though, with the interceptors this can be done client side now, so either solution works for me.

Thanks for the consideration!

@Sushisource
Copy link
Copy Markdown
Member

the accumulation of all the commands before a yield can result in exceeding 4mb. i.e. 500 workflows triggered in a loop before the first future.Get is called.

We have a fix coming for this too, where we'll paginate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants