fix(telemetry): emit dev command telemetry before blocking by Hweinstock · Pull Request #1375 · aws/agentcore-cli

Hweinstock · 2026-05-26T13:07:11Z

Description

The dev command had two telemetry gaps:

Blocking paths never flushed telemetry. dev --logs and browser mode block forever. The process is killed before withCommandRunTelemetry can flush. The workaround was eager client.emit() calls that always reported exit_reason: success, but this is not ideal.
Validation failures were invisible. Errors like "agent not found", "port in use", or "no project" happened outside withCommandRunTelemetry, so no failure metric was emitted.

This PR fixes both by:

Adding _telemetryAttrs support to withCommandRunTelemetry (callback can return dynamic attrs that merge with fallback)
Converting validation errors from process.exit(1) to thrown errors
Wrapping all dev server logic in a single withCommandRunTelemetry where blocking promises are started (not awaited) inside the callback and awaited after telemetry flushes

Related Issue

Closes #

Documentation PR

N/A

Type of Change

Testing

How have you tested the change?

I ran npm run test:unit and npm run test:integ
I ran npm run typecheck
I ran npm run lint
If I modified src/assets/, I ran npm run test:update-snapshots and committed the updated snapshots

Additional testing:

Added integration tests verifying server startup emits success telemetry and validation failure emits failure telemetry
Added unit tests for _telemetryAttrs merge behavior
Manually verified all 7 dev command flows with AGENTCORE_TELEMETRY_AUDIT=1

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the
terms of your choice.

agentcore-devx-automation · 2026-05-26T13:07:59Z

Claude Security Review: no high-confidence findings. (run)

github-actions · 2026-05-26T13:08:07Z

Package Tarball

aws-agentcore-0.15.0.tgz

How to install

gh release download pr-1375-tarball --repo aws/agentcore-cli --pattern "*.tgz" --dir /tmp/pr-tarball
npm install -g /tmp/pr-tarball/aws-agentcore-0.15.0.tgz

agentcore-cli-automation

The dual-emit telemetry approach for blocking commands is a good idea, and the validation-error coverage is a real improvement. A few things I'd like to see addressed before this merges — most importantly the type-safety regression around _telemetryAttrs / blockingPromise and a couple of correctness concerns about when success is emitted relative to when the server is actually up.

agentcore-cli-automation · 2026-05-26T13:11:36Z

  command: C,
  attrs: CommandAttrs<C>,
-  fn: () => R | Promise<R>
+  fn: () => (R & { _telemetryAttrs?: CommandAttrs<C> }) | Promise<R & { _telemetryAttrs?: CommandAttrs<C> }>


The signature fn: () => (R & { _telemetryAttrs?: CommandAttrs<C> }) produces unsound return-type narrowing at the call sites. Because the dev command's callbacks only return success-shaped values (e.g. { success: true as const, blockingPromise, _telemetryAttrs }), R gets inferred as the success branch only. After await withCommandRunTelemetry(...) the caller has a result that TS believes is always { success: true, ... }, but the function's catch block on line 112 actually returns { success: false, error } — a value the caller's type system doesn't know about. That's why every call site in command.tsx has to do if (!execResult.success) throw (execResult as unknown as { error: Error }).error; with a TODO comment.

This is a real type safety hole — the cast bypasses TS's exhaustiveness checks, and a future refactor that drops one of those guards will silently swallow errors. Options:

Constrain the callback to () => Result | Promise<Result> (drop the R generic) and have callers narrow on the returned Result directly. Pass _telemetryAttrs and blockingPromise via an explicit second return value, e.g. () => Promise<{ result: Result; telemetryAttrs?: ...; blockingPromise?: Promise<void> }>.

Change the return type to Promise<R | { success: false; error: Error }> so call sites get a proper discriminated union and don't need any cast.

Have the callback return its result and metadata as separate fields (e.g. { outcome: Result, attrs?: CommandAttrs<C>, blocking?: Promise<void> }) so the result type the caller observes is clean.

this is the follow-up mentioned in the TODO comment. All of these are band aid fixes to the real problem.

agentcore-cli-automation · 2026-05-26T13:11:37Z

+        );
+        // TODO: Remove cast once withCommandRunTelemetry's return type is narrowed
+        if (!serverResult.success) throw (serverResult as unknown as { error: Error }).error;
+        if ('blockingPromise' in serverResult) await serverResult.blockingPromise;


blockingPromise is being smuggled through withCommandRunTelemetry as an extra field on the success result. It works because the wrapper only strips _telemetryAttrs (line 105 of cli-command-run.ts), but:

The wrapper's signature has no knowledge of blockingPromise, so the caller has to use 'blockingPromise' in serverResult — bypassing the type system.

If anyone later updates withCommandRunTelemetry to be stricter about its return shape (e.g. Pick-ing only known keys), this silently breaks dev mode.

Worth making blockingPromise a first-class concept of withCommandRunTelemetry (e.g. callback returns { result, blockingPromise? } and the wrapper awaits it itself after flushing telemetry), or at minimum returning it as a separate value alongside the result rather than mixed in.

The real issue is that this check is completely redundant since all paths return the promise. Addressed by removing conditional.

agentcore-cli-automation · 2026-05-26T13:11:37Z

+
+              return {
+                success: true as const,
+                blockingPromise: serverPromise,


In --logs mode, server.start() is called inside the Promise executor (around line 420) and not awaited before this blockingPromise is returned. That means withCommandRunTelemetry records exit_reason: success for the dev_action: server metric before the server has actually started. If server.start() rejects (port in use after our check, container runtime missing for Container builds, image pull failure, etc.), the success metric has already been emitted; the rejection only surfaces later when the outer await serverResult.blockingPromise runs and we exit with code 1.

The whole point of this PR is to fix the case where validation/startup failures were invisible — but startup failures of the server process itself are the most likely thing to fail and are still being reported as success.

Suggestion: await server.start() before constructing/returning the blockingPromise. The promise that's actually blocking is the onExit/SIGINT promise, not start. Roughly:

const server = createDevServer(config, { port: actualPort, envVars: mergedEnvVars, callbacks: devCallbacks }); await server.start(); // throws here propagate as failure telemetry const serverPromise = new Promise<void>((resolve, reject) => { // wire up onExit / SIGINT here, but no server.start() inside });

The same concern likely applies to runBrowserMode (line 487) — if it has any synchronous "prepare server" phase before its forever-block, that should be awaited before the success metric is emitted. The integ test on line 90 of dev-server.test.ts only verifies the metric is emitted after the server is reachable (happy path); it wouldn't catch the failure case.

That means withCommandRunTelemetry records exit_reason: success for the dev_action: server metric before the server has actually started.

This is intentional. Once we start the server, its a blocking call with await so we'll never emit. This is the exact issue this PR is aiming to address.

agentcore-cli-automation · 2026-05-26T13:11:37Z

@@ -422,34 +425,30 @@ export const registerDev = (program: Command) => {
                  server.kill();


Behavior change worth confirming is intentional: the original --logs mode onExit did process.exit(code ?? 0), so a non-zero server exit produced an exit code matching the server's. The new code (line 407–411 above) rejects with new Error('Server exited with code N') and:

Goes through the outer catch at line 509–512 and exits with code 1 (regardless of the server's actual exit code).

Prints Error: Server exited with code N to stderr — new user-facing output.

Telemetry won't classify this as a meaningful category — it's a plain Error, so error_name will be UnknownError.

If this exit-code change is intentional, fine — but if you want telemetry to capture this as a real category, throw a BaseError subclass (a new DevServerExitError or similar) so classifyError recognizes it. Plain Error is the same bucket as any random thrown string.

This feels acceptable IMO. In the future we can model this error if we see a lot of UnknownError coming from dev.

agentcore-cli-automation · 2026-05-26T13:11:37Z

-    );
+    const result = await fn();
+    if (client) {
+      const merged = result._telemetryAttrs ? { ...attrs, ...result._telemetryAttrs } : attrs;


The doc comment on line 83 says "Returned attrs are merged with the upfront attrs (returned takes priority)" but the test cases in client.test.ts only verify override behavior with attrs that share keys. There's no test that actually covers the merge case (returned attrs partially overlapping with upfront, expecting non-overlapping upfront attrs to be preserved).

In practice every call site in command.tsx returns the same shape of attrs as it passed up front (with the same keys), so the merge behavior is never exercised — this is effectively override-only today. Either drop the merge and document this as override-only (simpler — make _telemetryAttrs always be a full CommandAttrs<C>), or add a test for the partial-merge case so the documented behavior is actually verified.

This is a case that isn't actually supported, but should be. Addressed in follow-up with test verifying behavior.

agentcore-devx-automation · 2026-05-26T13:14:39Z

Claude Security Review: no high-confidence findings. (run)

agentcore-devx-automation · 2026-05-26T13:24:00Z

Claude Security Review: no high-confidence findings. (run)

agentcore-devx-automation · 2026-05-26T13:36:51Z

Claude Security Review: no high-confidence findings. (run)

Hweinstock · 2026-05-26T13:13:20Z

            }
          );
-          if (!execResult.success) throw execResult.error;
+          // TODO: Remove cast once withCommandRunTelemetry's return type is narrowed


right now withCommandRunTelemetry handles thrown errors and failed results to be compatible in multiple places.

This creates some confusion not only mentally but also with types, so want to address this in a future pr.

Hweinstock · 2026-05-26T13:16:55Z

  command: C,
  attrs: CommandAttrs<C>,
-  fn: () => R | Promise<R>
+  fn: () => (R & { _telemetryAttrs?: CommandAttrs<C> }) | Promise<R & { _telemetryAttrs?: CommandAttrs<C> }>


this is the follow-up mentioned in the TODO comment. All of these are band aid fixes to the real problem.

Hweinstock · 2026-05-26T13:23:14Z

+        );
+        // TODO: Remove cast once withCommandRunTelemetry's return type is narrowed
+        if (!serverResult.success) throw (serverResult as unknown as { error: Error }).error;
+        if ('blockingPromise' in serverResult) await serverResult.blockingPromise;


The real issue is that this check is completely redundant since all paths return the promise. Addressed by removing conditional.

Hweinstock · 2026-05-26T13:24:21Z

+
+              return {
+                success: true as const,
+                blockingPromise: serverPromise,


That means withCommandRunTelemetry records exit_reason: success for the dev_action: server metric before the server has actually started.

This is intentional. Once we start the server, its a blocking call with await so we'll never emit. This is the exact issue this PR is aiming to address.

Hweinstock · 2026-05-26T13:25:55Z

@@ -422,34 +425,30 @@ export const registerDev = (program: Command) => {
                  server.kill();


This feels acceptable IMO. In the future we can model this error if we see a lot of UnknownError coming from dev.

Hweinstock · 2026-05-26T13:34:05Z

-    );
+    const result = await fn();
+    if (client) {
+      const merged = result._telemetryAttrs ? { ...attrs, ...result._telemetryAttrs } : attrs;


This is a case that isn't actually supported, but should be. Addressed in follow-up with test verifying behavior.

avi-alpert · 2026-05-26T14:21:23Z

-              return { success: true };
+              return {
+                success: true as const,
+                _telemetryAttrs: {


I find the _telemetryAttrs property confusing, at first glance it looks like you declare the telemetry properties twice and it's not clear what gets overriden vs what gets merged.

What if we instead pass in context from withCommandRunTelemetry so the async function can progressively define the attributes as we learn more about the request?

withCommandRunTelemetry('dev', async (ctx) => { ctx.setAttrs({ agent_protocol: 'unknown', dev_action: 'server' }); const project = await loadProjectConfig(workingDir); if (!project) throw new NoProjectError(); const protocol = targetAgent.protocol; ctx.setAttrs({ agent_protocol: protocol.toLowerCase() }); return { success: true, blockingPromise: serverPromise }; });

In this example its very clear to the reader what telemetry is being sent at any given point in the code.

Yeah, I was wresting with the same decision since this is the setup we had in toolkits, but I wanted to avoid making the properties mutable to keep it simpler. For example,

we set attributes in multiple places that overlap, making it non-obvious what the finals state is.

the callback throws an exception, in which case we don't really know which attributes are set yet or not.

the ctx must be passed down the call tree which can get pretty gnarly and clutter function interfaces.

I agree the mutable argument to the callback is more powerful, but it also introduces some state management complexity that I wanted to avoid.

I don't think I named these variables quite right to convey the idea. The attributes passed in upfront are fallback (where if it throws, or can't determine an attribute), then the ones are returned are the resolved attributes that take priority. I'm thinking about them as static vs dynamic, i.e. fallback are ones we can determine without running the callback, the returned ones require the callback to run. Maybe I could name them as such to make this more obvious?

After chatting, I think the complexity is worth it and a more familiar pattern to other devs making it more readable. I've swapped the implementation over to use a recorder that allows this.

agentcore-devx-automation · 2026-05-26T16:46:04Z

Claude Security Review: no high-confidence findings. (run)

Hweinstock added 3 commits May 25, 2026 17:29

feat(telemetry): allow dynamic attribute merging

6681bd8

refactor: avoid killing the process on validation errors

dc2d620

fix(telemetry): capture dev validation failures by wrapping at top-level

5e1bc2d

github-actions Bot added the size/l PR size: L label May 26, 2026

Hweinstock temporarily deployed to e2e-testing May 26, 2026 13:07 — with GitHub Actions Inactive

github-actions Bot added the agentcore-harness-reviewing AgentCore Harness review in progress label May 26, 2026

agentcore-devx-automation Bot added the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026

agentcore-devx-automation Bot removed the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026

agentcore-cli-automation suggested changes May 26, 2026

View reviewed changes

github-actions Bot removed the agentcore-harness-reviewing AgentCore Harness review in progress label May 26, 2026

fix: avoid defaulting to http protocol on telemetry failures

6f559ab

github-actions Bot added size/l PR size: L and removed size/l PR size: L labels May 26, 2026

Hweinstock temporarily deployed to e2e-testing May 26, 2026 13:14 — with GitHub Actions Inactive

agentcore-devx-automation Bot added the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026

agentcore-devx-automation Bot removed the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026

fix: remove redundant property check

e41c716

github-actions Bot added size/l PR size: L and removed size/l PR size: L labels May 26, 2026

Hweinstock temporarily deployed to e2e-testing May 26, 2026 13:23 — with GitHub Actions Inactive

agentcore-devx-automation Bot added the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026

agentcore-devx-automation Bot removed the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026

feat(telemetry): support partial attribute overrides

fd69292

github-actions Bot removed the size/l PR size: L label May 26, 2026

Hweinstock temporarily deployed to e2e-testing May 26, 2026 13:36 — with GitHub Actions Inactive

github-actions Bot added the size/l PR size: L label May 26, 2026

agentcore-devx-automation Bot added the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026

agentcore-devx-automation Bot removed the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026

Hweinstock commented May 26, 2026

View reviewed changes

Hweinstock marked this pull request as ready for review May 26, 2026 13:42

Hweinstock requested a review from a team May 26, 2026 13:42

avi-alpert reviewed May 26, 2026

View reviewed changes

refactor: leverage mutable recorder in favor of immutable returns

bb2fad0

github-actions Bot added size/l PR size: L and removed size/l PR size: L labels May 26, 2026

Hweinstock temporarily deployed to e2e-testing May 26, 2026 16:45 — with GitHub Actions Inactive

agentcore-devx-automation Bot added the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026

agentcore-devx-automation Bot removed the claude-security-reviewing Claude Code /security-review in progress label May 26, 2026

avi-alpert approved these changes May 26, 2026

View reviewed changes

Hweinstock merged commit 69b4ce6 into aws:main May 26, 2026
27 checks passed

Hweinstock deleted the telemetry-dynamic-attrs branch May 26, 2026 17:20

		@@ -422,34 +425,30 @@ export const registerDev = (program: Command) => {
		server.kill();

Conversation

Hweinstock commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

agentcore-devx-automation Bot commented May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026

Package Tarball

How to install

Uh oh!

agentcore-cli-automation left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agentcore-devx-automation Bot commented May 26, 2026

Uh oh!

agentcore-devx-automation Bot commented May 26, 2026

Uh oh!

agentcore-devx-automation Bot commented May 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agentcore-devx-automation Bot commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Hweinstock commented May 26, 2026 •

edited

Loading