fix(telemetry): emit dev command telemetry before blocking#1375
Conversation
|
Claude Security Review: no high-confidence findings. (run) |
Package TarballHow to installgh release download pr-1375-tarball --repo aws/agentcore-cli --pattern "*.tgz" --dir /tmp/pr-tarball
npm install -g /tmp/pr-tarball/aws-agentcore-0.15.0.tgz |
agentcore-cli-automation
left a comment
There was a problem hiding this comment.
The dual-emit telemetry approach for blocking commands is a good idea, and the validation-error coverage is a real improvement. A few things I'd like to see addressed before this merges — most importantly the type-safety regression around _telemetryAttrs / blockingPromise and a couple of correctness concerns about when success is emitted relative to when the server is actually up.
| command: C, | ||
| attrs: CommandAttrs<C>, | ||
| fn: () => R | Promise<R> | ||
| fn: () => (R & { _telemetryAttrs?: CommandAttrs<C> }) | Promise<R & { _telemetryAttrs?: CommandAttrs<C> }> |
There was a problem hiding this comment.
The signature fn: () => (R & { _telemetryAttrs?: CommandAttrs<C> }) produces unsound return-type narrowing at the call sites. Because the dev command's callbacks only return success-shaped values (e.g. { success: true as const, blockingPromise, _telemetryAttrs }), R gets inferred as the success branch only. After await withCommandRunTelemetry(...) the caller has a result that TS believes is always { success: true, ... }, but the function's catch block on line 112 actually returns { success: false, error } — a value the caller's type system doesn't know about. That's why every call site in command.tsx has to do if (!execResult.success) throw (execResult as unknown as { error: Error }).error; with a TODO comment.
This is a real type safety hole — the cast bypasses TS's exhaustiveness checks, and a future refactor that drops one of those guards will silently swallow errors. Options:
- Constrain the callback to
() => Result | Promise<Result>(drop theRgeneric) and have callers narrow on the returnedResultdirectly. Pass_telemetryAttrsandblockingPromisevia an explicit second return value, e.g.() => Promise<{ result: Result; telemetryAttrs?: ...; blockingPromise?: Promise<void> }>. - Change the return type to
Promise<R | { success: false; error: Error }>so call sites get a proper discriminated union and don't need any cast. - Have the callback return its result and metadata as separate fields (e.g.
{ outcome: Result, attrs?: CommandAttrs<C>, blocking?: Promise<void> }) so the result type the caller observes is clean.
There was a problem hiding this comment.
this is the follow-up mentioned in the TODO comment. All of these are band aid fixes to the real problem.
| ); | ||
| // TODO: Remove cast once withCommandRunTelemetry's return type is narrowed | ||
| if (!serverResult.success) throw (serverResult as unknown as { error: Error }).error; | ||
| if ('blockingPromise' in serverResult) await serverResult.blockingPromise; |
There was a problem hiding this comment.
blockingPromise is being smuggled through withCommandRunTelemetry as an extra field on the success result. It works because the wrapper only strips _telemetryAttrs (line 105 of cli-command-run.ts), but:
- The wrapper's signature has no knowledge of
blockingPromise, so the caller has to use'blockingPromise' in serverResult— bypassing the type system. - If anyone later updates
withCommandRunTelemetryto be stricter about its return shape (e.g.Pick-ing only known keys), this silently breaks dev mode.
Worth making blockingPromise a first-class concept of withCommandRunTelemetry (e.g. callback returns { result, blockingPromise? } and the wrapper awaits it itself after flushing telemetry), or at minimum returning it as a separate value alongside the result rather than mixed in.
There was a problem hiding this comment.
The real issue is that this check is completely redundant since all paths return the promise. Addressed by removing conditional.
|
|
||
| return { | ||
| success: true as const, | ||
| blockingPromise: serverPromise, |
There was a problem hiding this comment.
In --logs mode, server.start() is called inside the Promise executor (around line 420) and not awaited before this blockingPromise is returned. That means withCommandRunTelemetry records exit_reason: success for the dev_action: server metric before the server has actually started. If server.start() rejects (port in use after our check, container runtime missing for Container builds, image pull failure, etc.), the success metric has already been emitted; the rejection only surfaces later when the outer await serverResult.blockingPromise runs and we exit with code 1.
The whole point of this PR is to fix the case where validation/startup failures were invisible — but startup failures of the server process itself are the most likely thing to fail and are still being reported as success.
Suggestion: await server.start() before constructing/returning the blockingPromise. The promise that's actually blocking is the onExit/SIGINT promise, not start. Roughly:
const server = createDevServer(config, { port: actualPort, envVars: mergedEnvVars, callbacks: devCallbacks });
await server.start(); // throws here propagate as failure telemetry
const serverPromise = new Promise<void>((resolve, reject) => {
// wire up onExit / SIGINT here, but no server.start() inside
});The same concern likely applies to runBrowserMode (line 487) — if it has any synchronous "prepare server" phase before its forever-block, that should be awaited before the success metric is emitted. The integ test on line 90 of dev-server.test.ts only verifies the metric is emitted after the server is reachable (happy path); it wouldn't catch the failure case.
There was a problem hiding this comment.
That means withCommandRunTelemetry records exit_reason: success for the dev_action: server metric before the server has actually started.
This is intentional. Once we start the server, its a blocking call with await so we'll never emit. This is the exact issue this PR is aiming to address.
| @@ -422,34 +425,30 @@ export const registerDev = (program: Command) => { | |||
| server.kill(); | |||
There was a problem hiding this comment.
Behavior change worth confirming is intentional: the original --logs mode onExit did process.exit(code ?? 0), so a non-zero server exit produced an exit code matching the server's. The new code (line 407–411 above) rejects with new Error('Server exited with code N') and:
- Goes through the outer
catchat line 509–512 and exits with code1(regardless of the server's actual exit code). - Prints
Error: Server exited with code Nto stderr — new user-facing output. - Telemetry won't classify this as a meaningful category — it's a plain
Error, soerror_namewill beUnknownError.
If this exit-code change is intentional, fine — but if you want telemetry to capture this as a real category, throw a BaseError subclass (a new DevServerExitError or similar) so classifyError recognizes it. Plain Error is the same bucket as any random thrown string.
There was a problem hiding this comment.
This feels acceptable IMO. In the future we can model this error if we see a lot of UnknownError coming from dev.
| ); | ||
| const result = await fn(); | ||
| if (client) { | ||
| const merged = result._telemetryAttrs ? { ...attrs, ...result._telemetryAttrs } : attrs; |
There was a problem hiding this comment.
The doc comment on line 83 says "Returned attrs are merged with the upfront attrs (returned takes priority)" but the test cases in client.test.ts only verify override behavior with attrs that share keys. There's no test that actually covers the merge case (returned attrs partially overlapping with upfront, expecting non-overlapping upfront attrs to be preserved).
In practice every call site in command.tsx returns the same shape of attrs as it passed up front (with the same keys), so the merge behavior is never exercised — this is effectively override-only today. Either drop the merge and document this as override-only (simpler — make _telemetryAttrs always be a full CommandAttrs<C>), or add a test for the partial-merge case so the documented behavior is actually verified.
There was a problem hiding this comment.
This is a case that isn't actually supported, but should be. Addressed in follow-up with test verifying behavior.
|
Claude Security Review: no high-confidence findings. (run) |
|
Claude Security Review: no high-confidence findings. (run) |
|
Claude Security Review: no high-confidence findings. (run) |
| } | ||
| ); | ||
| if (!execResult.success) throw execResult.error; | ||
| // TODO: Remove cast once withCommandRunTelemetry's return type is narrowed |
There was a problem hiding this comment.
right now withCommandRunTelemetry handles thrown errors and failed results to be compatible in multiple places.
This creates some confusion not only mentally but also with types, so want to address this in a future pr.
| command: C, | ||
| attrs: CommandAttrs<C>, | ||
| fn: () => R | Promise<R> | ||
| fn: () => (R & { _telemetryAttrs?: CommandAttrs<C> }) | Promise<R & { _telemetryAttrs?: CommandAttrs<C> }> |
There was a problem hiding this comment.
this is the follow-up mentioned in the TODO comment. All of these are band aid fixes to the real problem.
| ); | ||
| // TODO: Remove cast once withCommandRunTelemetry's return type is narrowed | ||
| if (!serverResult.success) throw (serverResult as unknown as { error: Error }).error; | ||
| if ('blockingPromise' in serverResult) await serverResult.blockingPromise; |
There was a problem hiding this comment.
The real issue is that this check is completely redundant since all paths return the promise. Addressed by removing conditional.
|
|
||
| return { | ||
| success: true as const, | ||
| blockingPromise: serverPromise, |
There was a problem hiding this comment.
That means withCommandRunTelemetry records exit_reason: success for the dev_action: server metric before the server has actually started.
This is intentional. Once we start the server, its a blocking call with await so we'll never emit. This is the exact issue this PR is aiming to address.
| @@ -422,34 +425,30 @@ export const registerDev = (program: Command) => { | |||
| server.kill(); | |||
There was a problem hiding this comment.
This feels acceptable IMO. In the future we can model this error if we see a lot of UnknownError coming from dev.
| ); | ||
| const result = await fn(); | ||
| if (client) { | ||
| const merged = result._telemetryAttrs ? { ...attrs, ...result._telemetryAttrs } : attrs; |
There was a problem hiding this comment.
This is a case that isn't actually supported, but should be. Addressed in follow-up with test verifying behavior.
| return { success: true }; | ||
| return { | ||
| success: true as const, | ||
| _telemetryAttrs: { |
There was a problem hiding this comment.
I find the _telemetryAttrs property confusing, at first glance it looks like you declare the telemetry properties twice and it's not clear what gets overriden vs what gets merged.
What if we instead pass in context from withCommandRunTelemetry so the async function can progressively define the attributes as we learn more about the request?
withCommandRunTelemetry('dev', async (ctx) => {
ctx.setAttrs({ agent_protocol: 'unknown', dev_action: 'server' });
const project = await loadProjectConfig(workingDir);
if (!project) throw new NoProjectError();
const protocol = targetAgent.protocol;
ctx.setAttrs({ agent_protocol: protocol.toLowerCase() });
return { success: true, blockingPromise: serverPromise };
});
In this example its very clear to the reader what telemetry is being sent at any given point in the code.
There was a problem hiding this comment.
Yeah, I was wresting with the same decision since this is the setup we had in toolkits, but I wanted to avoid making the properties mutable to keep it simpler. For example,
- we set attributes in multiple places that overlap, making it non-obvious what the finals state is.
- the callback throws an exception, in which case we don't really know which attributes are set yet or not.
- the
ctxmust be passed down the call tree which can get pretty gnarly and clutter function interfaces.
I agree the mutable argument to the callback is more powerful, but it also introduces some state management complexity that I wanted to avoid.
I don't think I named these variables quite right to convey the idea. The attributes passed in upfront are fallback (where if it throws, or can't determine an attribute), then the ones are returned are the resolved attributes that take priority. I'm thinking about them as static vs dynamic, i.e. fallback are ones we can determine without running the callback, the returned ones require the callback to run. Maybe I could name them as such to make this more obvious?
There was a problem hiding this comment.
After chatting, I think the complexity is worth it and a more familiar pattern to other devs making it more readable. I've swapped the implementation over to use a recorder that allows this.
|
Claude Security Review: no high-confidence findings. (run) |
Description
The
devcommand had two telemetry gaps:Blocking paths never flushed telemetry.
dev --logsand browser mode block forever. The process is killed beforewithCommandRunTelemetrycan flush. The workaround was eagerclient.emit()calls that always reportedexit_reason: success, but this is not ideal.Validation failures were invisible. Errors like "agent not found", "port in use", or "no project" happened outside
withCommandRunTelemetry, so no failure metric was emitted.This PR fixes both by:
_telemetryAttrssupport towithCommandRunTelemetry(callback can return dynamic attrs that merge with fallback)process.exit(1)to thrown errorswithCommandRunTelemetrywhere blocking promises are started (not awaited) inside the callback and awaited after telemetry flushesRelated Issue
Closes #
Documentation PR
N/A
Type of Change
Testing
How have you tested the change?
npm run test:unitandnpm run test:integnpm run typechecknpm run lintsrc/assets/, I rannpm run test:update-snapshotsand committed the updated snapshotsAdditional testing:
_telemetryAttrsmerge behaviorAGENTCORE_TELEMETRY_AUDIT=1Checklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the
terms of your choice.