Skip to content

fix(telemetry): emit dev command telemetry before blocking#1375

Merged
Hweinstock merged 7 commits into
aws:mainfrom
Hweinstock:telemetry-dynamic-attrs
May 26, 2026
Merged

fix(telemetry): emit dev command telemetry before blocking#1375
Hweinstock merged 7 commits into
aws:mainfrom
Hweinstock:telemetry-dynamic-attrs

Conversation

@Hweinstock
Copy link
Copy Markdown
Contributor

@Hweinstock Hweinstock commented May 26, 2026

Description

The dev command had two telemetry gaps:

  1. Blocking paths never flushed telemetry. dev --logs and browser mode block forever. The process is killed before withCommandRunTelemetry can flush. The workaround was eager client.emit() calls that always reported exit_reason: success, but this is not ideal.

  2. Validation failures were invisible. Errors like "agent not found", "port in use", or "no project" happened outside withCommandRunTelemetry, so no failure metric was emitted.

This PR fixes both by:

  • Adding _telemetryAttrs support to withCommandRunTelemetry (callback can return dynamic attrs that merge with fallback)
  • Converting validation errors from process.exit(1) to thrown errors
  • Wrapping all dev server logic in a single withCommandRunTelemetry where blocking promises are started (not awaited) inside the callback and awaited after telemetry flushes

Related Issue

Closes #

Documentation PR

N/A

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Other (please describe):

Testing

How have you tested the change?

  • I ran npm run test:unit and npm run test:integ
  • I ran npm run typecheck
  • I ran npm run lint
  • If I modified src/assets/, I ran npm run test:update-snapshots and committed the updated snapshots

Additional testing:

  • Added integration tests verifying server startup emits success telemetry and validation failure emits failure telemetry
  • Added unit tests for _telemetryAttrs merge behavior
  • Manually verified all 7 dev command flows with AGENTCORE_TELEMETRY_AUDIT=1

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the
terms of your choice.

@github-actions github-actions Bot added the size/l PR size: L label May 26, 2026
@github-actions github-actions Bot added the agentcore-harness-reviewing AgentCore Harness review in progress label May 26, 2026
@agentcore-devx-automation agentcore-devx-automation Bot added the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026
@agentcore-devx-automation
Copy link
Copy Markdown
Contributor

Claude Security Review: no high-confidence findings. (run)

@agentcore-devx-automation agentcore-devx-automation Bot removed the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Package Tarball

aws-agentcore-0.15.0.tgz

How to install

gh release download pr-1375-tarball --repo aws/agentcore-cli --pattern "*.tgz" --dir /tmp/pr-tarball
npm install -g /tmp/pr-tarball/aws-agentcore-0.15.0.tgz

Copy link
Copy Markdown

@agentcore-cli-automation agentcore-cli-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dual-emit telemetry approach for blocking commands is a good idea, and the validation-error coverage is a real improvement. A few things I'd like to see addressed before this merges — most importantly the type-safety regression around _telemetryAttrs / blockingPromise and a couple of correctness concerns about when success is emitted relative to when the server is actually up.

Comment thread src/cli/telemetry/cli-command-run.ts Outdated
command: C,
attrs: CommandAttrs<C>,
fn: () => R | Promise<R>
fn: () => (R & { _telemetryAttrs?: CommandAttrs<C> }) | Promise<R & { _telemetryAttrs?: CommandAttrs<C> }>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature fn: () => (R & { _telemetryAttrs?: CommandAttrs<C> }) produces unsound return-type narrowing at the call sites. Because the dev command's callbacks only return success-shaped values (e.g. { success: true as const, blockingPromise, _telemetryAttrs }), R gets inferred as the success branch only. After await withCommandRunTelemetry(...) the caller has a result that TS believes is always { success: true, ... }, but the function's catch block on line 112 actually returns { success: false, error } — a value the caller's type system doesn't know about. That's why every call site in command.tsx has to do if (!execResult.success) throw (execResult as unknown as { error: Error }).error; with a TODO comment.

This is a real type safety hole — the cast bypasses TS's exhaustiveness checks, and a future refactor that drops one of those guards will silently swallow errors. Options:

  1. Constrain the callback to () => Result | Promise<Result> (drop the R generic) and have callers narrow on the returned Result directly. Pass _telemetryAttrs and blockingPromise via an explicit second return value, e.g. () => Promise<{ result: Result; telemetryAttrs?: ...; blockingPromise?: Promise<void> }>.
  2. Change the return type to Promise<R | { success: false; error: Error }> so call sites get a proper discriminated union and don't need any cast.
  3. Have the callback return its result and metadata as separate fields (e.g. { outcome: Result, attrs?: CommandAttrs<C>, blocking?: Promise<void> }) so the result type the caller observes is clean.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the follow-up mentioned in the TODO comment. All of these are band aid fixes to the real problem.

Comment thread src/cli/commands/dev/command.tsx Outdated
);
// TODO: Remove cast once withCommandRunTelemetry's return type is narrowed
if (!serverResult.success) throw (serverResult as unknown as { error: Error }).error;
if ('blockingPromise' in serverResult) await serverResult.blockingPromise;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blockingPromise is being smuggled through withCommandRunTelemetry as an extra field on the success result. It works because the wrapper only strips _telemetryAttrs (line 105 of cli-command-run.ts), but:

  1. The wrapper's signature has no knowledge of blockingPromise, so the caller has to use 'blockingPromise' in serverResult — bypassing the type system.
  2. If anyone later updates withCommandRunTelemetry to be stricter about its return shape (e.g. Pick-ing only known keys), this silently breaks dev mode.

Worth making blockingPromise a first-class concept of withCommandRunTelemetry (e.g. callback returns { result, blockingPromise? } and the wrapper awaits it itself after flushing telemetry), or at minimum returning it as a separate value alongside the result rather than mixed in.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real issue is that this check is completely redundant since all paths return the promise. Addressed by removing conditional.

Comment thread src/cli/commands/dev/command.tsx Outdated

return {
success: true as const,
blockingPromise: serverPromise,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In --logs mode, server.start() is called inside the Promise executor (around line 420) and not awaited before this blockingPromise is returned. That means withCommandRunTelemetry records exit_reason: success for the dev_action: server metric before the server has actually started. If server.start() rejects (port in use after our check, container runtime missing for Container builds, image pull failure, etc.), the success metric has already been emitted; the rejection only surfaces later when the outer await serverResult.blockingPromise runs and we exit with code 1.

The whole point of this PR is to fix the case where validation/startup failures were invisible — but startup failures of the server process itself are the most likely thing to fail and are still being reported as success.

Suggestion: await server.start() before constructing/returning the blockingPromise. The promise that's actually blocking is the onExit/SIGINT promise, not start. Roughly:

const server = createDevServer(config, { port: actualPort, envVars: mergedEnvVars, callbacks: devCallbacks });
await server.start();  // throws here propagate as failure telemetry

const serverPromise = new Promise<void>((resolve, reject) => {
  // wire up onExit / SIGINT here, but no server.start() inside
});

The same concern likely applies to runBrowserMode (line 487) — if it has any synchronous "prepare server" phase before its forever-block, that should be awaited before the success metric is emitted. The integ test on line 90 of dev-server.test.ts only verifies the metric is emitted after the server is reachable (happy path); it wouldn't catch the failure case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means withCommandRunTelemetry records exit_reason: success for the dev_action: server metric before the server has actually started.

This is intentional. Once we start the server, its a blocking call with await so we'll never emit. This is the exact issue this PR is aiming to address.

@@ -422,34 +425,30 @@ export const registerDev = (program: Command) => {
server.kill();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior change worth confirming is intentional: the original --logs mode onExit did process.exit(code ?? 0), so a non-zero server exit produced an exit code matching the server's. The new code (line 407–411 above) rejects with new Error('Server exited with code N') and:

  1. Goes through the outer catch at line 509–512 and exits with code 1 (regardless of the server's actual exit code).
  2. Prints Error: Server exited with code N to stderr — new user-facing output.
  3. Telemetry won't classify this as a meaningful category — it's a plain Error, so error_name will be UnknownError.

If this exit-code change is intentional, fine — but if you want telemetry to capture this as a real category, throw a BaseError subclass (a new DevServerExitError or similar) so classifyError recognizes it. Plain Error is the same bucket as any random thrown string.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels acceptable IMO. In the future we can model this error if we see a lot of UnknownError coming from dev.

Comment thread src/cli/telemetry/cli-command-run.ts Outdated
);
const result = await fn();
if (client) {
const merged = result._telemetryAttrs ? { ...attrs, ...result._telemetryAttrs } : attrs;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment on line 83 says "Returned attrs are merged with the upfront attrs (returned takes priority)" but the test cases in client.test.ts only verify override behavior with attrs that share keys. There's no test that actually covers the merge case (returned attrs partially overlapping with upfront, expecting non-overlapping upfront attrs to be preserved).

In practice every call site in command.tsx returns the same shape of attrs as it passed up front (with the same keys), so the merge behavior is never exercised — this is effectively override-only today. Either drop the merge and document this as override-only (simpler — make _telemetryAttrs always be a full CommandAttrs<C>), or add a test for the partial-merge case so the documented behavior is actually verified.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a case that isn't actually supported, but should be. Addressed in follow-up with test verifying behavior.

@github-actions github-actions Bot removed the agentcore-harness-reviewing AgentCore Harness review in progress label May 26, 2026
@github-actions github-actions Bot added size/l PR size: L and removed size/l PR size: L labels May 26, 2026
@agentcore-devx-automation agentcore-devx-automation Bot added the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026
@agentcore-devx-automation
Copy link
Copy Markdown
Contributor

Claude Security Review: no high-confidence findings. (run)

@agentcore-devx-automation agentcore-devx-automation Bot removed the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026
@github-actions github-actions Bot added size/l PR size: L and removed size/l PR size: L labels May 26, 2026
@agentcore-devx-automation agentcore-devx-automation Bot added the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026
@agentcore-devx-automation
Copy link
Copy Markdown
Contributor

Claude Security Review: no high-confidence findings. (run)

@agentcore-devx-automation agentcore-devx-automation Bot removed the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026
@github-actions github-actions Bot removed the size/l PR size: L label May 26, 2026
@github-actions github-actions Bot added the size/l PR size: L label May 26, 2026
@agentcore-devx-automation agentcore-devx-automation Bot added the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026
@agentcore-devx-automation
Copy link
Copy Markdown
Contributor

Claude Security Review: no high-confidence findings. (run)

@agentcore-devx-automation agentcore-devx-automation Bot removed the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026
}
);
if (!execResult.success) throw execResult.error;
// TODO: Remove cast once withCommandRunTelemetry's return type is narrowed
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now withCommandRunTelemetry handles thrown errors and failed results to be compatible in multiple places.

This creates some confusion not only mentally but also with types, so want to address this in a future pr.

Comment thread src/cli/telemetry/cli-command-run.ts Outdated
command: C,
attrs: CommandAttrs<C>,
fn: () => R | Promise<R>
fn: () => (R & { _telemetryAttrs?: CommandAttrs<C> }) | Promise<R & { _telemetryAttrs?: CommandAttrs<C> }>
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the follow-up mentioned in the TODO comment. All of these are band aid fixes to the real problem.

Comment thread src/cli/commands/dev/command.tsx Outdated
);
// TODO: Remove cast once withCommandRunTelemetry's return type is narrowed
if (!serverResult.success) throw (serverResult as unknown as { error: Error }).error;
if ('blockingPromise' in serverResult) await serverResult.blockingPromise;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real issue is that this check is completely redundant since all paths return the promise. Addressed by removing conditional.

Comment thread src/cli/commands/dev/command.tsx Outdated

return {
success: true as const,
blockingPromise: serverPromise,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means withCommandRunTelemetry records exit_reason: success for the dev_action: server metric before the server has actually started.

This is intentional. Once we start the server, its a blocking call with await so we'll never emit. This is the exact issue this PR is aiming to address.

@@ -422,34 +425,30 @@ export const registerDev = (program: Command) => {
server.kill();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels acceptable IMO. In the future we can model this error if we see a lot of UnknownError coming from dev.

Comment thread src/cli/telemetry/cli-command-run.ts Outdated
);
const result = await fn();
if (client) {
const merged = result._telemetryAttrs ? { ...attrs, ...result._telemetryAttrs } : attrs;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a case that isn't actually supported, but should be. Addressed in follow-up with test verifying behavior.

@Hweinstock Hweinstock marked this pull request as ready for review May 26, 2026 13:42
@Hweinstock Hweinstock requested a review from a team May 26, 2026 13:42
Comment thread src/cli/commands/dev/command.tsx Outdated
return { success: true };
return {
success: true as const,
_telemetryAttrs: {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the _telemetryAttrs property confusing, at first glance it looks like you declare the telemetry properties twice and it's not clear what gets overriden vs what gets merged.

What if we instead pass in context from withCommandRunTelemetry so the async function can progressively define the attributes as we learn more about the request?

  withCommandRunTelemetry('dev', async (ctx) => {
    ctx.setAttrs({ agent_protocol: 'unknown', dev_action: 'server' });

    const project = await loadProjectConfig(workingDir);
    if (!project) throw new NoProjectError();

    const protocol = targetAgent.protocol;
    ctx.setAttrs({ agent_protocol: protocol.toLowerCase() });

    return { success: true, blockingPromise: serverPromise };
  });

In this example its very clear to the reader what telemetry is being sent at any given point in the code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was wresting with the same decision since this is the setup we had in toolkits, but I wanted to avoid making the properties mutable to keep it simpler. For example,

  • we set attributes in multiple places that overlap, making it non-obvious what the finals state is.
  • the callback throws an exception, in which case we don't really know which attributes are set yet or not.
  • the ctx must be passed down the call tree which can get pretty gnarly and clutter function interfaces.

I agree the mutable argument to the callback is more powerful, but it also introduces some state management complexity that I wanted to avoid.

I don't think I named these variables quite right to convey the idea. The attributes passed in upfront are fallback (where if it throws, or can't determine an attribute), then the ones are returned are the resolved attributes that take priority. I'm thinking about them as static vs dynamic, i.e. fallback are ones we can determine without running the callback, the returned ones require the callback to run. Maybe I could name them as such to make this more obvious?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After chatting, I think the complexity is worth it and a more familiar pattern to other devs making it more readable. I've swapped the implementation over to use a recorder that allows this.

@github-actions github-actions Bot added size/l PR size: L and removed size/l PR size: L labels May 26, 2026
@agentcore-devx-automation agentcore-devx-automation Bot added the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026
@agentcore-devx-automation
Copy link
Copy Markdown
Contributor

Claude Security Review: no high-confidence findings. (run)

@agentcore-devx-automation agentcore-devx-automation Bot removed the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026
@Hweinstock Hweinstock merged commit 69b4ce6 into aws:main May 26, 2026
27 checks passed
@Hweinstock Hweinstock deleted the telemetry-dynamic-attrs branch May 26, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/l PR size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants