Skip to content

feat(events): add count, timeseries, and field-value-discovery tools#83

Open
ivanlysiuk-sysdig wants to merge 1 commit into
sysdiglabs:mainfrom
ivanlysiuk-sysdig:feat/event-investigation-tools
Open

feat(events): add count, timeseries, and field-value-discovery tools#83
ivanlysiuk-sysdig wants to merge 1 commit into
sysdiglabs:mainfrom
ivanlysiuk-sysdig:feat/event-investigation-tools

Conversation

@ivanlysiuk-sysdig
Copy link
Copy Markdown
Contributor

Summary

Today the MCP server can list runtime events (list_runtime_events) and
fetch single events by id (get_event_info, get_event_process_tree),
but cannot answer aggregate questions efficiently. A few common
investigation questions force the LLM to pull the underlying event bodies
even when the answer is purely numeric:

  • "How many high-severity events fired in cluster X over the last 24h?"
    → today: paginate list_runtime_events, count the array locally. With
    the documented 200-event-per-call cap, this is O(N/200) round trips
    and O(N · payload-size) tokens for a question whose answer is one
    integer per severity bucket.
  • "When did this burst start / stop?" → today: binary-search the
    window with successively narrower list_runtime_events calls.
  • "What clusters / rules / image names are actually producing events
    right now?"
    → today: guess values, filter, see if anything comes
    back, iterate. Most failed filters are typos against names the model
    cannot know up front.

This PR adds three tools that answer those three questions in a single
call each, using existing Sysdig public APIs.

New tools

Tool Endpoint Question it answers
count_runtime_events GET /api/v1/secureEvents/count "How many events match <filter> in the last N hours?" — returns a histogram across 16 event categories × 8 severity codes in one call. No pagination, no truncation.
runtime_events_timeseries GET /api/v1/secureEvents/timeseriesBy "When did this burst start / stop?" — returns per-bucket counts grouped by a categorical field (default severity). Server picks the coarsest bucket size that fits the rows upper bound; minimum bucket is 1 minute. Lets the model find a burst boundary in two calls (coarse pass + zoom).
discover_runtime_event_field_values GET /secure/events/v2/eventFields/{field} "What clusters / rules / image names are firing in this window?" — returns suggested (values active in the window) and other (values known to the tenant but inactive). Lets the model learn real names before writing a filter instead of guessing.

All three require policy-events.read — the same permission as
list_runtime_events and get_event_info. They're permission-gated by
the same RequiredPermissionsFromTool helper, so the existing
permission-based filtering keeps working.

Shared baseline + DSL fixes for list_runtime_events

The runtime-events baseline filter (not originator in ("benchmarks","compliance","cloudsec","scanning","hostscanning")) is
extracted into secure_events_common.go and reused by all four
runtime-event tools, so the four tools surface a consistent view of
"runtime activity" regardless of which one the model picks.

The same file holds the filter-expression DSL prose, also shared across
the four tools — keeping the LLM's filter intuition identical between
list / count / timeseries / discover.

While touching list_runtime_events to share the baseline, two
examples in its filter_expr description are fixed:

  • host.hostName startsWith "web-"host.hostName starts with "web-"
    (startsWith as one word is rejected by the backend with HTTP 400).
  • container.imageName = "nginx:latest"container.image.repo = "nginx" and container.image.tag = "latest" (container.imageName is
    rejected with HTTP 422 "unsupported metric"; the descriptors that
    exist are container.image.repo, container.image.tag,
    container.image.digest, container.image.id).

These examples currently render in the tool description and may have
been propagating into model-generated filters as syntax errors.

Worked example — "Investigate the most recent burst"

With these tools, a typical investigation can be:

  1. discover_runtime_event_field_values(field: "ruleName", scope_hours: 24)
    → lists rule names actively producing events (the suggested bucket).
  2. runtime_events_timeseries(scope_hours: 24, field: "severity", filter_expr: 'ruleName = "<picked-rule>"', rows: 1000) → coarse
    pass; identifies which 15-minute / 1-hour buckets contain the
    activity.
  3. runtime_events_timeseries(scope_hours: <narrowed>, field: "severity", filter_expr: 'ruleName = "<picked-rule>"', rows: 3600)
    → forces 1-minute buckets across the narrowed range; pinpoints the
    start and end of the burst.
  4. count_runtime_events(scope_hours: <narrowed>, filter_expr: 'ruleName = "<picked-rule>"') → exact total.
  5. list_runtime_events(scope_hours: <narrowed>, filter_expr: 'ruleName = "<picked-rule>"', limit: 5) → a few representative
    events to read in detail.

Four calls instead of dozens of paginating reads, and the model never
needs to count event-array lengths to answer "how many".

Test plan

  • go build ./... clean.
  • go vet ./... clean.
  • go test ./internal/infra/mcp/tools/... passes (existing
    list_runtime_events test still green after the shared-baseline
    refactor; three new test files cover happy-path / defaults /
    client-error / non-2xx for each new tool).
  • go generate ./internal/infra/sysdig/ cleanly regenerates
    mocks/client_extension.go with the three new mock methods.
  • No changes to the OpenAPI spec — new endpoints are added as
    hand-written client extensions following the existing
    client_process_tree.go pattern.
  • No breaking changes — additive only. Tool registration in
    cmd/server/main.go keeps the existing tools in place and appends
    the three new ones.
  • Permission gating: all three new tools declare
    policy-events.read so they're filtered out for tokens that lack it.
  • Each new tool's description includes 4–8 filter examples drawn
    from real customer-investigation shapes, and the DSL prose lists ML
    / severity / engine recipes.

Notes

  • I considered exposing these capabilities under /secure/events/v1/*
    to match the existing event endpoints, but the count,
    timeseriesBy, and eventFields/* endpoints don't exist on that
    family today — they live under /api/v1/secureEvents* and
    /secure/events/v2/eventFields/*. If the backend later exposes them
    under /secure/events/v1/*, the hand-written clients here are easy
    to migrate.
  • The runtime-events 1-minute bucket floor and the 14-day window cap
    are noted in the tool descriptions so the model can reason about
    them up front.

🤖 Generated with Claude Code

Adds three new MCP tools so that end-to-end runtime-event investigations
can be done in a few tool calls instead of paginating event bodies:

- count_runtime_events: returns a 16-category × 8-severity histogram for
  any filter and time window in a single call. No pagination, no
  truncation. Backed by GET /api/v1/secureEvents/count.

- runtime_events_timeseries: buckets event counts over time, grouped by
  a categorical field (default "severity"). Server picks the coarsest
  bucket size that fits the rows cap; minimum bucket is 1 minute. Lets
  the model find when a burst started/ended in two calls (coarse pass +
  zoom). Backed by GET /api/v1/secureEvents/timeseriesBy.

- discover_runtime_event_field_values: enumerates the distinct values
  of a runtime-events field present in a window, split into "suggested"
  (active in window) and "other" (known but inactive). Lets the model
  learn real cluster/rule/image names before writing a filter instead of
  guessing. Backed by GET /secure/events/v2/eventFields/{field}.

Also:

- Extracts the runtime-events baseline filter ("not originator in
  (benchmarks, compliance, cloudsec, scanning, hostscanning)") into a
  shared helper used by all four runtime-event tools.

- Shares the filter-expression DSL documentation across the four tools
  so the LLM applies identical filter intuition everywhere.

- Fixes two filter-DSL examples in list_runtime_events whose syntax was
  rejected by the live API: 'host.hostName startsWith "web-"' is not
  accepted (correct form: 'host.hostName starts with "web-"'), and
  'container.imageName' is not a valid field (correct forms:
  'container.image.repo' and 'container.image.tag').

All three new tools require policy-events.read, the same permission as
list_runtime_events and get_event_info.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 15, 2026 21:47
@ivanlysiuk-sysdig ivanlysiuk-sysdig requested a review from a team as a code owner May 15, 2026 21:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants