Skip to content

Capture important data events verbatim in filter settings#49

Merged
mjcheetham merged 3 commits intogit-ecosystem:mainfrom
derrickstolee:trace-data-strings
Apr 27, 2026
Merged

Capture important data events verbatim in filter settings#49
mjcheetham merged 3 commits intogit-ecosystem:mainfrom
derrickstolee:trace-data-strings

Conversation

@derrickstolee
Copy link
Copy Markdown
Collaborator

@derrickstolee derrickstolee commented Apr 20, 2026

In #44, we created a custom model for generating performance summaries based on region timings and printf events.

This change introduces a new customization to allow promoting trace2 string data messages into the output telemetry event.

My motivation is that the microsoft/git fork includes a git-gvfs-helper tool that triggers certain string data failures when certain kinds of errors occur. Those messages are not being elevated to the telemetry, so we don't know how often they are happening or if they will go away when we make certain server-side changes.

  • Adds important_events rules to filter.yml that capture specific Trace2 data event values verbatim into a new trace2.process.important_events OTEL span attribute.
  • Rules match by category (exact) and key_prefix (prefix), and collect all matching values into a named array field.
  • Capture runs regardless of active detail level — values appear even at dl:summary without enabling verbose telemetry.
  • Implemented in filter_settings.go alongside ImportantEventRule type and validation; stored separately from SummaryAccumulator.

This will take the events sent that match the category exactly and the key by prefix and put them into a JSON dictionaryas follows:

  {
    "gvfs_helper_caches": [
      "<redacted>"
    ],
    "gvfs_helper_errors": [
      "(curl:35) SSL connect error [hard_fail]"
    ],
    "gvfs_helper_remotes": [
      "https://dev.azure.com/<redacted>"
    ]
  }

This is a real example that I was able to trigger in my own testing along with other non-error strings using an internal telemetry service that consumes the git-ecosystem/trace2reciever via a fork of the git-ecosystem/sample-trace2-otel-collector.

I used agentic coding to produce these results, including the substantial test cases.

Comment thread Docs/Examples/summary_example.yml Outdated
Comment on lines +34 to +42
# Data pattern rules - capture values from data events matching
# a (category, key prefix) pair. Matched values are promoted into
# the summary regardless of detail level or nesting.
data_patterns:
# Capture curl/http error details from gvfs-helper
- category: "gvfs-helper"
key_prefix: "error/"
field_name: "gvfs_helper_errors"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So data_patterns log the actual objects from the event_data events, whereas the message_patterns just log a count.

Something about the difference in summarisation behaviour between two similarly named things makes me feel a little uneasy from a 'permanent API' thing.

message_patterns aggregates, but data_patterns collects.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about something like string_patterns to signal "match which strings to escalate up the stack" or do you feel we need to move these patterns out of the "summary" config?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm pushing a version that includes that model, just in case.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess really it's not a question of the name of the fields, but that message_patterns performs aggregation (counts of matching messages, which makes sense as a summary) vs the new feature that performs matching+elevation (actual data).

Maybe I'd feel better if this was not tied to the summary. I saw summary as a way to define aggregated statistics.

If some types of events are important to capture verbatim perhaps this is more a "important_events" feature or something.. (placeholder name).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest push includes a rewrite to put important_events in the filter config instead of summary config.

I did a second full test with the internal telemetry service consuming this branch, getting this important_events value in the telemetry stream:

  {
    "gvfs_helper_caches": [
      "<redacted>"
    ],
    "gvfs_helper_errors": [
      "(curl:35) SSL connect error [hard_fail]"
    ],
    "gvfs_helper_remotes": [
      "https://dev.azure.com/<redacted>"
    ]
  }

derrickstolee and others added 3 commits April 24, 2026 10:46
Context:
Data events with nesting level n are logically attached to the region
at regionStack[n-2]. The prior bounds check only tested whether the
stack was shorter than rWant, missing two cases: a negative rWant (when
mf_nesting is 0 or 1, which should have been caught by the nesting <= 1
guard but might slip through) and the case where rWant equals the stack
length exactly, which would be an out-of-bounds index.

Justification:
Adding `rWant < 0` makes the guard explicit about the lower bound.
Changing `<` to `<=` corrects the upper bound: a valid index into a
slice of length L is 0..L-1, so `len >= rWant+1`, i.e. `rWant < len`,
i.e. reject when `rWant >= len`, which is `len <= rWant`. The fix
ensures the event is silently ignored (with a TODO log warning) rather
than causing an out-of-bounds panic.

Implementation:
Single comparison change in apply__data_generic. No behavior change
for well-formed event streams; only prevents a potential panic on
malformed or unexpectedly nested data events.
Context:
Operators monitoring Git processes that run gvfs-helper subcommands
need visibility into specific data event values (error strings, curl
codes, etc.) even when verbose telemetry is disabled. At dl:summary
level all data events are currently dropped, so there is no way to
guarantee critical diagnostic values surface in the OTEL process span
without enabling expensive full-verbose collection site-wide.

Justification:
The rules live in filter.yml, not summary.yml, because capture is a
filtering concern: it determines which data events are guaranteed to
appear regardless of detail level, paralleling how filter rules
control verbosity. The captured values are stored in their own OTEL
attribute (trace2.process.important_events) separate from
trace2.process.summary, keeping aggregated metrics distinct from
verbatim strings. The capture call in apply__data_generic runs before
any early-return paths so that orphaned nested events (nesting > 1
with an empty region stack) are never silently dropped.

Implementation:
ImportantEventRule (category exact match, key_prefix prefix match,
field_name output key) is added to FilterSettings and validated in
parseFilterSettingsFromBuffer. TrProcess gains an importantEvents map
that is allocated only when rules exist, so the nil check in
apply__important_events serves as a fast no-op when the feature is
unconfigured. apply__important_events in filter_settings.go matches
each data event against the configured rules and appends matching
values. emitProcessSpan emits trace2.process.important_events as a
JSON object whenever the map is non-empty, at all detail levels.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Context:
The important_events feature added in the previous commit has no
user-facing documentation. Operators need to know the YAML syntax,
the semantics of each field, the output attribute name, and how the
feature relates to summary.yml so they can distinguish verbatim
capture from aggregated metrics.

Implementation:
config-filter-settings.md gains a new "Important Events" section
with YAML syntax, field descriptions, and a worked example showing
gvfs-helper error capture. The filter syntax summary at the bottom
is extended with the important_events block so the complete schema
appears in one place. configure-custom-collector.md documents the
filter: pathname key (previously undocumented in the receiver config
example) and adds a cross-reference to important_events. The summary
example YAML adds a note directing readers to filter.yml for verbatim
value capture, distinguishing it from the aggregated summary output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@derrickstolee derrickstolee changed the title Allow summaries to include string data values Capture important data events verbatim in filter settings Apr 24, 2026
Copy link
Copy Markdown
Contributor

@mjcheetham mjcheetham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for persisting with my concerns and delays in getting back to you on this @derrickstolee - you know your contributions are always thoroughly appreciated ❤️

Comment on lines +259 to +263
In addition to controlling verbosity, the `filter.yml` file can
declare a list of data events that should always be captured verbatim,
regardless of the active detail level. This lets operators guarantee
that specific Trace2 data values are always surfaced in the OTEL
process span even when verbose telemetry is disabled.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this description of the feature - fits well!

@mjcheetham mjcheetham merged commit c720594 into git-ecosystem:main Apr 27, 2026
3 checks passed
@mjcheetham
Copy link
Copy Markdown
Contributor

This is now included in the v0.7.0 tag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants