Skip to content

feat(cost): VS Code Copilot Chat export support + Cost Compare + MCP reachability#103

Open
Jfhelin wants to merge 43 commits into
jayparikh:mainfrom
Jfhelin:jfhelin/cost-view-pr
Open

feat(cost): VS Code Copilot Chat export support + Cost Compare + MCP reachability#103
Jfhelin wants to merge 43 commits into
jayparikh:mainfrom
Jfhelin:jfhelin/cost-view-pr

Conversation

@Jfhelin
Copy link
Copy Markdown

@Jfhelin Jfhelin commented May 25, 2026

Summary

Adds first-class support for VS Code Copilot Chat exports (copilot_all_prompts_*.json) in the Cost view, plus a Cost tab on the Compare view, a 'Copy for LLM analysis' export, and an MCP server reachability banner.

Since the initial draft, this PR has grown with substantial Cost-view polish driven by real-world use: smarter tool-row headlines, workspace-root path inference, a tool-result mis-attribution bug fix, cached-row collapse in the LLM inspector, system-prompt diffing in Compare, sub-agent surfacing, and synthesized rows for orphan tool calls. Sized for a single PR (the maintainer's recent merges #94 and #101 were similar scale), but happy to split.

What this PR adds

Single-session Cost view for Copilot Chat exports

Auto-detected when a copilot_all_prompts_*.json file is loaded. A 3-column timeline with three lenses:

  • CTX -- token context buildup per call
  • NET -- truly-new vs cache-recommit tokens per call
  • BILLED -- USD / AI Credits cost per call

Plus per-call cache analysis with unexpected-cache-miss diagnosis ("tool defs changed: X, Y, Z" vs "likely TTL expiry"), tool definition shape classifier (built-in vs MCP, router vs direct), image token cost estimation, and an Overhead toggle.

Cost Compare (two-session)

Side-by-side cost analysis when both compared sessions are Copilot Chat exports: headline verdict, A/B/delta cards, per-bucket cost waterfall sorted by absolute delta, behavioral KPI table (path-noise-resistant metrics), cache-pollution warnings, run drift detection (model / tools / per-section system-prompt diff), and rule-driven recommendations.

Copy for LLM analysis

A button on both the single-session Cost view and the Compare Cost tab that copies a self-contained structured-facts + markdown summary of the analysis to the clipboard. Designed for paste-into-chat sharing with another LLM for deeper investigation. The Compare export now includes a per-section system-prompt block diff.

MCP server reachability banner

When a Copilot Chat export carries an mcpServers array, the Cost view shows a banner like: "3 of 8 listed MCP servers produced all 130 mcp_ tools the model saw; the other 5 produced 0"*. Heuristic label-slug match against mcp_<slug>_* tool names.

New since initial draft

Cost-view UX polish (driven by hands-on use of the view):

  • Per-tool-row smart headlines -- file ops show workspace-relative paths (read_file ./pages/Products.tsx), file writes/edits show size deltas (+12 / -3 lines or 84 lines, 4.6k chars), manage_todo_list shows progress (2/7 done then in-progress title), list_dir strips inferred workspace root, plus an 8-layer generic fallback for unknown tools (URL, path, action+name, command, body size, array preview, etc.).
  • Tool-result chip -- each tool row shows 1.1k tok -> Tool results (where the bytes will land in the next LLM call), with a colored swatch matching the destination bucket.
  • Cached-row collapse -- in the expanded LLM inspector, History and Tool-results lists now split cached vs new; cached rows hide behind a N reused from prior calls header so the diff is what jumps out.
  • Sub-agent surfacing -- multi-sub-agent threads are now distinct cards with color-coded LLM pills linking each call to its parent. Orphan tool calls (no matching LLM row) get a synthesized virtual row with an honest inspector so they're not silently dropped.
  • System anatomy -- system prompts decomposed into sub-agents, MCP tool-prefix instructions, and other named blocks; per-block char/token attribution.
  • Response section -- collapsible reasoning + tool-call rows, output-token sub-bucket breakdown, visible-reply leads the LLM call header.
  • Image pill on LLM rows when this call introduces new image tokens.

Bug fix: Tool results were being paired to tool calls by ordinal index inside the running prompt history, which meant late-turn tool calls got matched to the OLDEST results in the conversation (e.g., the manage_todo_list row was showing a read_file response). Now paired by toolCallId with a fallback for the __vscode-N host-side suffix.

Architecture

  • New files only for the Copilot Chat path -- no existing parsers touched.
  • Existing files modified additively (no behavior changes for Claude Code / Copilot CLI flows):
    • parseSession.ts: auto-detect copilot-chat-export.
    • pricing.js: per-model cacheReadRatio / cacheWriteRatio for OpenAI accuracy.
    • CostView.jsx: delegates to CostViewChatExport when a Copilot Chat export is loaded.
    • CompareView.jsx: adds a Cost tab when both sides are Copilot Chat exports.
    • AppLandingState.jsx + SessionProvider.jsx + one line in App.jsx: drag-drop two files at once to enter Compare landing immediately.

V2 placement

Feature still lives in V1 only. AnalyzeShell is the natural home for V2; InlineCompare could grow a Cost tab the same way CompareView did. Happy to do the V2 port in this PR or as a follow-up.

Tests

  • 11+ new test files, ~75 new tests covering parser, cache analysis, tool shape, MCP reachability, LLM export, comparison, snapshot roundtrip, and the tool-result pairing regression.
  • Existing costAnalysis.test.js extended with cache analysis cases.

Verified on latest commit:

  • npx tsc --noEmit clean
  • npm run build clean
  • 1043/1044 tests pass (1 pre-existing skip)

Scope check

Open to splitting into a stack of smaller PRs if preferred -- clean cuts at (parser + cost view) -> (compare + cost compare) -> (LLM export) -> (cost-view UX polish + sub-agent surfacing). Let me know.

Backstory

I've been using AGENTVIZ on a fork (Jfhelin/agentviz) to experiment with VS Code Copilot Chat exports and build my own intuition for what drives per-call cost -- context buildup, cache recommits, tool-definition footprint, MCP server overhead -- so I can speak with more confidence about UBB cost drivers for end customers. The views in this PR (and the polish that came after) are what I ended up wanting while doing that.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Jfhelin and others added 4 commits May 25, 2026 18:35
…reachability

Adds a dedicated Cost view for VS Code Copilot Chat exports
(`copilot_all_prompts_*.json`), a Cost tab for the Compare view, a
'Copy for LLM analysis' export, and an MCP server reachability banner.

## What this adds

### Single-session Cost view (Copilot Chat exports)

Auto-detected when a `copilot_all_prompts_*.json` file is loaded.
A dedicated 3-column timeline view with three lenses:

- **CTX**  -- token context buildup per call
- **NET**  -- truly-new vs cache-recommit tokens per call
- **BILLED** -- USD / AI Credits cost per call

Plus:
- Per-call cache analysis with unexpected cache-miss diagnosis
  ("tool defs changed: X, Y, Z" vs "tool defs unchanged - likely
  TTL expiry")
- Tool definition shape classifier (built-in vs MCP, router vs
  direct, per-server token cost)
- MCP server reachability banner that flags declared MCP servers
  whose tools never appeared in any chat request
- Image token cost estimation for attached screenshots
- Overhead toggle to hide non-user calls (title, summarization, etc.)

### Cost Compare (two-session)

Side-by-side cost analysis when both compared sessions are Copilot
Chat exports: headline verdict, A/B/delta cards, per-bucket cost
waterfall (sorted by absolute delta), behavioral KPI table
(path-noise-resistant metrics), cache-pollution warnings, run drift
detection (model / tools / system text hash divergence), and
rule-driven recommendations.

### Copy for LLM analysis

A button in the single-session Cost view and the Compare Cost tab
that copies a self-contained structured-facts + markdown summary of
the analysis to the clipboard. Designed for paste-into-chat sharing
with another LLM for deeper investigation.

### MCP server reachability

When a Copilot Chat export carries an `mcpServers` array (the IDE's
declared servers), the Cost view shows a banner like:
"3 of 8 listed MCP servers produced all 130 mcp_* tools the model
saw; the other 5 produced 0". Uses a heuristic label-slug match
against `mcp_<slug>_*` tool names.

## Architecture

- New files only for the Copilot Chat export path:
  `src/lib/copilotChatExportParser.ts`, `src/lib/cacheAnalysis.ts`,
  `src/lib/toolDefinitionShape.ts`, `src/lib/mcpServerReachability.ts`,
  `src/lib/llmAnalysisExport.ts`, `src/lib/compareCost.ts`,
  `src/lib/exportComparison.ts`, `src/lib/runDisplayName.ts`,
  `src/lib/imageTokenEstimate.js`,
  `src/components/CostViewChatExport.jsx`,
  `src/components/CostCompare.jsx`.

- Existing files modified additively:
  - `src/lib/parseSession.ts`: auto-detect `copilot-chat-export`.
  - `src/lib/sessionLibrary.js`, `src/lib/sessionTypes.ts`:
    recognize the new format.
  - `src/lib/pricing.js`: per-model `cacheReadRatio` /
    `cacheWriteRatio` for OpenAI accuracy.
  - `src/lib/theme.js` + `.d.ts`, `docs/color-palette.html`,
    `docs/ui-ux-style-guide.md`: `theme.cost.*` tokens.
  - `src/components/CostView.jsx`: delegates to
    `CostViewChatExport` when a Copilot Chat export is loaded.
  - `src/components/CompareView.jsx`: adds a Cost tab when both
    sides are Copilot Chat exports.
  - `src/components/app/AppLandingState.jsx`: drag-drop two files
    at once to enter Compare landing immediately.
  - `src/contexts/SessionProvider.jsx`: new `handleFilePair`
    callback supporting the drag-drop-pair flow.
  - `src/App.jsx`: one line - pass `handleFilePair` to
    `AppLandingState` as `onLoadPair`.

V2 placement: the new view is currently V1-only. V2's
`AnalyzeShell` would be the natural home for the single-session view
and `InlineCompare` would gain a Cost tab the same way V1's
`CompareView` does -- happy to follow up if maintainer wants the
V2 integration in the same PR or in a separate one.

## Tests

11 new test files, 73 new tests covering parser, cache analysis,
tool shape, MCP reachability, LLM export, comparison, and snapshot
roundtrip. Existing `costAnalysis.test.js` extended with cache
analysis cases.

Verified:
- `npx tsc --noEmit` clean
- `npm run build` clean
- 1018/1019 tests pass (1 pre-existing skip)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…s, docs

- Add 23 new theme.cost.* tokens (chipBg*, miss*, ok*, recommit*, switch*,
  pillLlm/Subagent/Tool) to both dark and light themes, plus theme.d.ts
- Add ctxImages to theme.d.ts (was already in runtime theme)
- Replace 4 hardcoded hex literals in CostViewChatExport.jsx event pills
  with the new pill* tokens; pill text uses theme.text.primary
- Fix magic number fontSize: 9 -> theme.fontSize.xs on the ROUTER chip
- Remove unnecessary || '#eab308' defensive fallback (semantic.warning
  is always defined)
- Remove two console.debug/console.warn calls from copilotChatExportParser
- Fix 'Last verified: May 2026' typo -> May 2025 in pricing.js
- Document all new cost tokens in docs/ui-ux-style-guide.md
- Add swatches for all new cost tokens in docs/color-palette.html
  (both dark and light sections)
- Update CLAUDE.md file tree with new lib modules (cacheAnalysis,
  toolDefinitionShape, compareCost, llmAnalysisExport, exportComparison,
  mcpServerReachability, copilotChatExportParser) and new components
  (CostViewChatExport, CostCompare)

Verified: tsc --noEmit clean, npm run build clean, 1018/1019 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previous solid-bg pills with theme.text.primary had poor contrast
(dark text on mid-saturation backgrounds looked flat). Switch to
the established tinted-chip pattern already used elsewhere in the
Cost view: tinted background + saturated foreground + subtle border.

- LLM call: chipBgAssistant bg + accent.primary fg
- Subagent: chipBgExtension bg + kindExtension fg
- Tool: chipBgBuiltin bg + kindBuiltin fg

Removes the now-redundant pillLlm/pillSubagent/pillTool tokens from
theme.js, theme.d.ts, color-palette.html, and the style guide. The
pills now self-document via the existing chip palette.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix remaining 'May 2026' typos -> 'May 2025' in pricing.js (2 inline
  comments), llmAnalysisExport.ts (verification comment), and README.md
  (file tree pricing.js description)
- Document theme.cost.ctxImages in docs/color-palette.html (dark + light
  swatches: #C77BC2 / #9b4f97) and in the Context-bucket colors table in
  docs/ui-ux-style-guide.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Jfhelin Jfhelin marked this pull request as ready for review May 25, 2026 17:18
Jfhelin and others added 25 commits May 26, 2026 08:50
…anel

When two Copilot Chat exports' system prompts differ (hashes don't match),
the drift panel now shows exactly which top-level <tag> blocks differ
between the runs — e.g. an <instruction forToolsWithPrefix="mcp_azure">
block present in one run but not the other.

Surfaces:
  - 'only A' / 'only B' rows for blocks present on one side only
  - 'size changed' rows for same-key blocks with differing char counts
  - per-row char delta sorted by absolute magnitude, descending
  - hover tooltip with the first 200 chars of the block body

Implementation:
  - copilotChatExportParser: new extractSystemBlocks() walks the system
    text and emits every top-level <tag>...</tag> block (depth-aware,
    so <skill> inside <skills> is not double-counted). New SystemBlock
    field added to ClassifiedCall and to the public LLM-call event.
  - compareCost: RunFingerprint carries systemBlocks; new
    buildSystemPromptDiff() emits a per-key diff sorted by |delta|.
    Diff is attached to DriftReport.systemPromptDiff.
  - CostCompare: new SystemPromptDiffDetail renders under the System
    prompt drift row when hasBlockDrift is true; collapses long lists
    to the top 6 with 'show N more'.

Solves the user-reported confusion: 'I have two runs with the same
system prompt but one is reported as smaller' — turned out two
<instruction> blocks for Azure/Bicep MCP servers were present in one
run and absent in the other. The new diff makes that immediately visible.

Tests: 15 new unit tests covering nesting, attributes, malformed input,
preview truncation, identical/only-A/only-B/chars-differ classification,
delta-descending sort, and the exact real-world scenario.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per-section 'tokens' values in the System anatomy panel are pro-rata
estimates of this run's total prompt_tokens, derived as
`chars / sysChars * sysTok`. Because the run-wide tok/char ratio
varies (BPE tokenization is content-dependent), a section with
identical chars between two runs will show different 'tokens' values
in each run — which made users (correctly) suspect the displayed
token numbers.

This commit makes the single-session view honest about that:

  - Each section's always-visible summary row now shows BOTH the
    exact char count and the token estimate, with the latter clearly
    labeled '~N tok est'.
  - A one-line explainer under the header tells the user: chars are
    exact, per-section tokens are pro-rata estimates that shift when
    other sections change.
  - Per-row tooltips on the new columns explain the same.
  - Sub-rows inside expanded sections (each <scaffolding> tag, each
    skill, each instruction file, each file attachment) also now lead
    with chars and label tokens as estimates.

Net effect: when comparing two runs side-by-side, users can rely on
the chars column to spot real content changes. The token estimate is
still shown for cost intuition, but it no longer masquerades as
ground truth.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 'Copy for LLM analysis' export for Compare now carries the per-section
system-prompt diff data we surface in the UI. Emitted as a 'System prompt
block diff' section right after 'Run drift', whenever the parser surfaced
top-level block data on both sides AND at least one block differs.

Contents (deterministic, no prose):
  - Tagged chars on each side + signed delta
  - Untagged plaintext chars (preamble + interstitial text between tags)
    on each side + signed delta — derived from systemPromptChars minus
    taggedCharsA/B, so the analyst can see whether drift lives in tagged
    blocks or in the prose between them.
  - Markdown table: one row per block, columns = status / block key /
    chars A / chars B / Δ chars. Rows are pre-sorted by |delta| desc by
    the diff builder. 'only-A' / 'only-B' rows show the side's char
    count as a signed value so the table is easy to scan.
  - Footnote calling out that per-section token counts shown elsewhere
    are pro-rata estimates and that chars are the ground truth.

This keeps the analysis prompt itself unchanged — the new content lives
in the data section the prompt instructs the model to reason over.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Includes:
  - Cost view + Cost Compare (Copilot Chat export support)
  - MCP reachability column
  - Per-section system-prompt diff in Compare drift panel
  - Single-session anatomy: leads with chars, marks token counts as estimates
  - 'Copy for LLM analysis' export carries the block diff data

Tarball published as GitHub Release v1.1.0-fork.0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t breakdowns

Cost-view KPI cards:
- Setup overhead card with per-bucket cost for unused tools and skills
- Output card split into 3 lines (visible / thinking / tool-args) with cr
- Tool calls card: top 3 tools on separate lines + ellipsis row
- Tooltip hierarchy: visible by prose/fenced-code-by-language,
  tool-args by tool name

Per-prompt Cost-by-component (Response bucket):
- Collapsed summary now shows 'visible X cr · thinking Y cr · tool-args Z cr'
- Expanded view: 3 flat rows aligned to bucket header columns
  (label / cr / % / inline detail)
- Proportional allocation eliminates the 'unattributed' residual
- Inline detail: prose+fenced-code for visible, top tools for tool-args
- Drops redundant 'across N calls' sample line

Parser:
- fencedCodeStats returns per-language breakdown
- Per-event toolArgCharsByName and codeCharsByLang
- Session totals: toolArgCharsByName, codeCharsByLang, codeChars

llmAnalysisExport:
- output_composition.tool_args_by_tool: top 10 tools with est_tokens
- output_composition.visible_code_by_language: top 5 languages
- Exports detectUnusedTools, aggregateSkillCarry for KPI reuse
- Fix detectUnusedTools return type: declare unusedTokensPerCall

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The system prompt anatomy panel was missing two block types that the parser
silently rolled into "Other / unclassified system text":

- <agents> — sub-agent declarations the model uses with runSubagent
- <instruction forToolsWithPrefix="..."> — MCP-injected tool usage rules

Both cost prompt tokens on every call but were invisible in the UI.

Parser (copilotChatExportParser.ts):
- Add SubAgent and ToolPrefixInstruction interfaces
- extractSubAgents: parse <agents><agent><name/description/argumentHint>
- extractToolPrefixInstructions: parse <instruction forToolsWithPrefix="X">
- Wire subAgents and toolPrefixInstructions through ClassifiedCall and CostAnalysisCall

UI (CostViewChatExport.jsx renderSystemAnatomy):
- New "Sub-agents (N)" row with name + hover for description/argumentHint
- New "MCP / tool-prefix instructions (N)" row keyed by prefix with body preview
- Both rows feed into classifiedChars so "Other" shrinks accordingly
- Empty-check expanded so the panel still renders when only these new sections exist

Cart fixture impact: surfaces the 1,848-char mcp_azure instruction and the
719-char <agents> block (Explore sub-agent) that were previously hidden.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
VS Code's Copilot Chat export sometimes omits the `request` log entry
for the first LLM round-trip of a user turn, while still logging the
`toolCall` entries it dispatched. The result was a confusing timeline
where runSubagent (or other tool calls) appeared before any LLM row.

We now detect orphan toolCalls before the first request, recover the
response text from the next request's message history (last role=2
assistant content), and synthesize a virtual LLM-call row in the
timeline. The row is labeled 'LLM (synth)' with a muted pill and a
tooltip explaining the reconstruction. It carries zero usage/cost
(VS Code didn't log it) and is excluded from KPI totals, cache
analysis, and projections -- it only appears in the timeline so users
can see which LLM call dispatched the orphan tools.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
VS Code's Copilot Chat export uses `name = "tool/runSubagent"` for LLM
calls that run inside a spawned subagent (the inner panel surface).
This was confusing: rows looked like the model was emitting a
`runSubagent` tool request, when in fact the LLM was running inside
a previously-dispatched subagent invocation.

Changes:
1. Relabel: `tool/runSubagent` LLM rows now get a distinct purple
   'SUBAGENT LLM' pill and a 'Subagent turn' friendly label. The
   smart turn label (→ tools dispatched) and step counter still
   apply, so each row reads as 'Subagent turn → list_dir, read_file'.
2. Link: the parser matches subagent prompts to their parent
   runSubagent toolCall by comparing args.prompt to the subagent's
   user-message text, populating a new `invokedBy` field. The first
   LLM row of a subagent prompt shows '← invoked by prompt N ·
   runSubagent (description)' as a clickable jump that scrolls the
   parent row into view and briefly highlights it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the cost-view inspector is opened on a synthesized LLM row, it
previously rendered the normal 3-box layout with zeros for every
numeric field (prompt tok, output tok, cost, cache split, per-bucket
breakdown). That was misleading -- zero looks like 'this call was
free', when the truth is 'VS Code never wrote the request log so we
don't know'.

Synth rows now get a dedicated inspector:
* Top banner explaining that the request log is missing
* Three dashed/striped boxes labeled 'unknown' for Prompt sent,
  Reply written, and Cost (45-degree repeating-linear-gradient is
  the standard 'no data' pattern; symmetric with regular rows so
  layout doesn't collapse)
* A 'Recovered facts' card with the model name (marked '(inferred)'
  since it's pulled from the next request) and the list of tool
  calls this LLM call dispatched
* The recovered response text from the next request's message history

No misleading cache-miss or recommit callouts are emitted for synth
rows, and the per-bucket new-input breakdown is omitted entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each subagent thread already gets a per-letter color in the thread
badge at the top of its prompt card ('SUB A', 'SUB B', ...). Make
the per-LLM-call pill inside that thread use the same color and
label, so they're visually grouped:

  Before: orange 'SUBAGENT LLM' pill on every subagent LLM row
  After:  same-color 'SUB B LLM' pill (matching the 'SUB B' badge)

Falls back to the generic orange 'Subagent LLM' label when the
thread layout doesn't have a letter (e.g. only one subagent across
the whole session, or when buildAgentThreads can't slot it).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The per-bucket drilldown section in the LLM inspector only showed
buckets whose content was NEW on this call. That hid useful detail:
once a subagent's tool results were folded into the context on
step 1, they became cached on step 2+ and disappeared from the
drilldown -- with no way to read what the subagent actually
returned.

Now any bucket with non-zero token content (new OR cached) is
listed. Cached-only buckets render '· cached' in place of the
'X% of new' suffix and use no '+' prefix on the token count.

Renamed the section header from 'What's new in this call' to
'Context buildup' and added a 'N more cached' suffix to the count
line.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Response section in the LLM Detail inspector previously dumped every
reasoning block and tool call inline at full width, which made multi-step
turns scroll forever -- and the model often emits the SAME reasoning
text before each parallel tool_use, doubling or tripling the noise.

Now:

- Each reasoning block is a collapsible row showing 'before <toolName>'
  with the first non-empty line as the preview. Click to expand and read
  the full text untruncated.
- Consecutive identical blocks (same text + same target tool) collapse
  to a single row tagged 'x N identical'. The section header reports
  total blocks and unique count when they differ.
- Each tool call is a collapsible row with the tool-name pill and the
  smart args summary as the preview. Expand to see the pretty-printed
  args (JSON.stringify with 2-space indent; single-key string args
  render as 'key:\n<value>' so long file paths or prompts wrap nicely).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Response section used a small uppercase header ('RESPONSE (2.0K
OUTPUT TOK)') over a loose dashed text block, while every input box
above it (Current prompt, Tool definitions, System scaffolding, etc.)
used the DetailSection card pattern with colored swatch + label on the
left and token count on the right. The result was a visual gear shift
right at the moment the user crossed from 'context sent in' to 'reply
that came out'.

Now the entire Response section is wrapped in a DetailSection-style
card (bg.surface, 1px border, 4px radius, 10/12 padding) with the
standard header layout: purple cost.output swatch + 'Response' label
on the left, '2.0k output tok' right-aligned. The response text block
is full-width inside the card with a solid (not dashed) subtle border.

Reasoning and tool-call subsections inside the card are de-emphasized
so they read as nested children rather than competing cards: no
background fill, no heavy left stripe, separated by a 1px subtle top
divider, and the colored accent now appears as a small 6x6 swatch in
the subsection header to echo the outer card's swatch motif.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 'Reply written' KPI box already splits output tokens into three
buckets -- visible to user, thinking, tool-call args -- estimated from
char share. The aggregate output bucket in the per-call drilldown uses
the same split. The Response section header was only showing the
total, so the reader had to scroll back up to the KPI strip to see how
the spend broke down for THIS particular call.

Now the right side of the Response header shows total + per-bucket
breakdown with matching swatch colors:

  2.0k output tok   o 1.2k visible   o 0.5k thinking   o 0.3k tool args

Reuses the same color mapping as the existing 3-bucket pattern
(theme.cost.fresh = visible, theme.cost.output = thinking,
theme.cost.ctxToolDefs = tool args). Char-based estimate falls back to
reasoning-token split when no per-block char data is available.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The visible reply was rendered as a full-width prose block while
reasoning and tool calls were collapsible rows with first-line
previews. With multi-paragraph replies this dominated the inspector
even when the user only wanted to scan for tool calls below.

Visible text now uses the same CollapsibleRow pattern: a header showing
a green cost.fresh swatch + 'visible' label on the left and the first
non-empty line of the reply as the preview. Click to expand the full
response text. Collapsed by default, matching reasoning + tool-call
rows so the entire Response card has consistent triage ergonomics.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
If the full response text is the same as the first non-empty line
(common for short replies like 'Done.' or 'Now I have all the context
I need. Let me draft the plan.'), the expand chevron previously
hinted at content that wasn't there. Clicking it revealed the same
single line in a styled panel.

Now we detect that case and render the visible row as a plain
non-collapsible row: still the green swatch + 'visible' label + the
line itself, but no chevron and no click-to-expand. The grid keeps
the 14px chevron slot empty so it aligns vertically with the
reasoning + tool-call rows beneath it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The collapsed header for LLM call rows used the smart turn label as
the primary line: either a quoted first line of the response OR an
arrow list of tool calls ('-> read_file x2'). When both existed the
tool list won, and the actual reply text -- the highest-signal hint
about what the model just did -- was hidden inside the inspector.

Header now leads with the model's visible reply (first non-empty line,
markdown-stripped, truncated to 90 chars) and follows with a smaller
mono '-> tool x N' chip. The tool list keeps the same summarization
(top 3 names + '+N more') and a tooltip with the full list.

When the model emitted no visible text (silent tool turn), the primary
slot reads '(no visible reply)' in muted italic, followed by the same
tool chip. The row keeps the same vertical rhythm regardless.

Pushed to expanded-only (visible when the row is open):
  - The raw VS Code surface name (e.g. 'panel/editAgent')
  - The OS/workspace environment chip
  - The model short name in the metric strip
These are useful for diagnosing rare cross-model or cross-workspace
issues but not relevant to scanning a session top-down.

Kept always visible:
  - LLM CALL / Sub X LLM pill, Step N of M, Plan/mode pill, vision
    pill, overhead pill, unexpected-cache-miss warning
  - The full metric strip (ctx / net new / cached / billed-new / cost)
  - The 'invoked by' link on the first LLM row of a subagent prompt

Added stripLeadingMarkdown + firstVisibleSnippet helpers so the
preview handles '# Plan' style headers, blockquotes, bullets, and
ordered lists by walking forward to the first line with real content.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tool-call rows used a generic raw-JSON dump as the args preview (e.g.
'manage_todo_list {"todoList":[{"id":1,"status":"in-progress"...')
and wasted the second line on the literal phrase 'tool call' plus the
result token count. Almost no signal for the reader.

Now the headline reads like the LLM row above it: mono-bold tool name
+ a smart human summary of the most useful field, with the result
token count promoted to a small chip on the right of the title.

Per-tool summarizers (smartToolHeadline):
  read_file/create_file/edit_file/etc. -> file basename
  list_dir/ls                          -> directory path
  grep/semantic_search/file_search     -> search query
  bash/shell (via parsed.command)      -> short command line
  manage_todo_list                     -> 'N todos . X in-progress, Y pending, Z done'
  runSubagent                          -> '<agent> . "<prompt snippet>"'
  unknown                              -> first key:value of args, truncated

The full raw JSON stays available as the tooltip for power users. The
'tool call -> N tok of result' second line is hidden when the row is
collapsed and shown when the row is open (small redundant breadcrumb,
but harmless).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The vision pill rendered on every call that had any image attached,
even when all images were reused from prior calls (cached). On
multi-step turns that meant the same '1 (~1.6k tok)' chip appeared
on Step 2, 3, 4, ... long after the image actually entered the
context. It implied per-call vision cost that wasn't really being
charged for that step.

Pill is now omitted when newImages.length is 0. When at least one
new image is present it reads '+N' to match the '+net new' convention
elsewhere, the token estimate prefers newImageVisionTokens when the
parser supplies it (falling back to visionTokensTotal), and the
tooltip notes how many more were carried in from earlier calls.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two quick wins on the tool-row headline summarizer.

memory tools: previously fell through to the generic '<first-key>:
<truncated-value>' fallback which showed implementation-level noise.
Now extract action/name/content from common memory tool shapes (the
parser may use {action, name, content}, {command, key, value}, etc.)
and render them as 'create . <name> . "<short body>"' so the row
reads like a sentence about what's being remembered.

manage_todo_list: appended ' . now: "<title>"' showing the title of
the first in-progress task. The reader gets a sense of what work the
agent thinks it's doing right now without expanding the row. Counts
still come first since they answer 'how much is left?' at a glance.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The right-side result chip on tool rows just read '1.0k tok' which
hides whether the tool returned a small JSON ack or a large file dump.
Char count is the more intuitive 'how big was this' measurement for
file/listing tools, and reminding the reader where these tokens land
(the NEXT LLM call's input) makes the cost causality obvious from the
collapsed view.

Chip now reads '1.0k tok . 41,283 chars -> next call' when resultChars
is known, falling back to just '1.0k tok' when the parser didn't
capture char count. Tooltip is expanded to match.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The combined 'tok . chars -> next call' chip varied in width depending
on whether resultChars was known, so the rightmost edge of the token
count danced from row to row and the eye couldn't scan a column.

Split into two pieces: chars text (plain muted mono, no chip) flushes
to the right via marginLeft auto, and the token count chip sits to
its right. When chars are unknown the chip itself takes marginLeft
auto and still anchors the right edge. Result: the 'N tok' chip is
always at the same horizontal position across rows.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The chip said 'chars -> next call' which was correct but vague. The
parser already classifies tool outputs into the dedicated tool_results
bucket on the subsequent LLM call (they arrive as role:'tool'
messages, distinct from the History bucket which holds prior
user/assistant turns). Spelling that out gives the reader the cost
trail at a glance and matches the bucket terminology used everywhere
else in the Cost view.

Chip now reads 'N chars -> [swatch] Tool results' using the same color
swatch as the Tool results bucket in the prompt breakdown. Tooltip
explains the role:'tool' message mechanism and contrasts it with
History.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Old format read '7 todos · 1 in-progress, 2 done · now: "..."' --
five comma-separated pieces of information competing for the same
line. The progress signal (how far through the list) was buried.

New format leads with the fraction:
  manage_todo_list  2/7 done · \u25b6 "Create CartPage component"

Rules:
- Always show N/M done as the primary progress indicator.
- Append \u2713 when the list is fully complete (7/7 done \u2713).
- Append 'X blocked' only when non-zero.
- Append \u25b6 '<title>' for the in-progress item when one exists;
  the play arrow reads as 'currently active' without spelling it
  out. Pending count is implied (M - done - in-progress) so we
  drop it from the headline.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Jfhelin and others added 14 commits May 27, 2026 17:35
Bug: the per-call result chip and expanded preview routinely showed
content that didn't belong to the tool row. Most visible case --
'manage_todo_list' rows showing thousands of chars of unrelated text
like 'Here's a complete breakdown of the Navigation component...'.

Cause: copilotChatExportParser paired pending tool calls with role-3
tool_result messages by ordinal index within the request. But
cls.toolResultMsgs is built from EVERY role-3 message in this call's
prompt history -- i.e., results of every tool call across the entire
conversation up to this point. As the conversation grew, pendingToolCalls
(0..N from this round) got paired with the FIRST N tool_result
messages, which were almost always from the earliest turn.

Fix: every role-3 message carries the originating toolCallId, and
each toolCall log's id is the same toolu_* value. Capture toolCallId
on toolResultMsgs and match by id in the pairing loop, with an
ordinal fallback only when ids are missing on either side.

Result: chars/tokens/preview/full now reflect the actual response of
the tool that produced them.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tool-row headlines for list_dir used to show the full absolute path
(e.g. '/Users/jfhelin/Code/GitHub/octodemo/octocat_supply-psychic-d'),
which is mostly noise -- the interesting part is the relative
location inside the workspace.

Infer workspace root as the longest common directory prefix across
every absolute path arg in the session (filePath/path/file/uri/
directory/dir/cwd/...), then strip it from list_dir headlines. Root
itself renders as '.', sub-paths render relative.

Safety floor: only trust the inferred root when it has at least 4
meaningful path segments (e.g. /Users/<name>/Code/<repo>) and at
least 2 absolute paths were observed. Below that we leave paths
untouched so we never strip '/' or '/usr'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previous commit matched tool-call ids exactly against role-3
toolCallId values, but the export carries them in two shapes:

  toolCall log: 'toolu_bdrk_<id>__vscode-<n>'
  role-3 msg:   'toolu_bdrk_<id>'

The '__vscode-<n>' suffix is added host-side. Exact-match therefore
never hit, every tool row lost its resultChars/resultTokens, and the
'N chars -> Tool results' chip disappeared.

Strip everything from the first '__' before matching on both sides.
The bare 'toolu_*' prefix is unique enough to be a reliable key.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two separate spans (chars + token chip) were noisier than needed.
The signal that matters for billing is the token count and where
it lands. Collapse to a single chip:

  1.1k tok -> [swatch] Tool results

Chars moved into the tooltip alongside the role:'tool' explanation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
History and Tool result lists in the expanded LLM call previously
rendered every row, with cached (carried-over) rows dimmed to ~55%
opacity. For long-running sessions this meant scrolling past dozens
of muted rows before reaching what's new this call.

Now cached rows are hidden behind a single dashed header at the top
of each list:

  > 24 cached messages (show)   reused from prior calls

Click to expand. New-this-call rows always render below, unaffected.
Per-list local state -- both lists can be expanded independently.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
File-mutating tool calls now show the size of the change in the
headline, dim-styled after a middle-dot separator:

  create_file              CartContext.tsx \u00b7 84 lines, 4.6k chars
  insert_edit_into_file    Navigation.tsx \u00b7 +12 lines, 380 chars
  replace_string_in_file   Navigation.tsx \u00b7 +12 / -3 lines
  apply_patch              \u00b7 +23 / -8 lines
  read_file                Navigation.tsx \u00b7 lines 1-200

Sources:
- create_file / write     -> args.content|text|code length
- insert_edit_into_file / edit_file -> args.code|content length (treated
  as inserted lines, prefixed '+')
- replace_string_in_file  -> +newString lines / -oldString lines
- apply_patch / patch     -> count +/- lines in args.patch|diff|body
- read_file               -> args.startLine-endLine when present

Chars use k-suffix above 1000 to stay compact. Tools without a size
signal fall back to today's behavior (filename only). The right-side
'-> Tool results' token chip is unaffected.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously the fallback for any tool we hadn't special-cased was
just 'firstKey: firstValue', which often produced unreadable noise
('options: {recursive, ...}' on a delete tool, 'arg0: --flag' on a
shell wrapper, etc).

Replace with a layered probe that tries informative shapes in order
so most unknown tools get a reasonable headline for free:

  1. URL value          -> 'github.com/Jfhelin/agentviz'
  2. Path-like key      -> basename for files, relative dir otherwise
                           (stripped against inferred workspace root)
  3. Action + name pair -> 'delete \u00b7 my-cache-key'
  4. Query/command key  -> truncated single-line text
  5. Body-like key      -> 'content: 42 lines, 1.2k chars'
                           (size signal instead of dumping contents)
  6. Array of strings   -> 'foo.ts +3 more' (basename for path arrays)
  7. Plain name/id      -> truncated string
  8. Final firstKey: value fallback as a last resort

Workspace-root stripping applies to all path-like values so unknown
file-touching tools benefit too. Probes are short-circuit so the
first useful signal wins.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previous strict longest-common-prefix returned '' when even a single
divergent path appeared in the session -- and there always is one in
practice (e.g. the 'memory' tool writes to '/memories/session/plan.md'
which shares no segments with the project root). The result: every
list_dir / read_file headline reverted to showing the full absolute
path.

Replace with a frequency-based pick: tally every directory prefix of
every absolute path, then return the longest prefix that:
- covers at least 80% of the paths, AND
- has at least 4 meaningful segments

This survives a handful of outliers while still requiring strong
majority evidence before trimming.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
'list_dir api' was ambiguous -- could read as a parameter rather
than a path. Adding './' anchors the eye and matches shell muscle
memory:

  list_dir ./api
  list_dir ./frontend/src/components
  list_dir .                  (workspace root unchanged)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously read_file / create_file / edit_file etc collapsed paths
to basename ('Products.tsx'), losing the directory context. Now
they show the workspace-relative path with a './' prefix:

  read_file              ./pages/Products.tsx
  create_file            ./context/CartContext.tsx \u00b7 84 lines, 4.6k chars
  replace_string_in_file ./components/Navigation.tsx \u00b7 +12 / -3 lines
  list_dir               ./api

Paths outside the inferred workspace root still fall back to
basename so we don't show useless absolute prefixes.

Also threads workspaceRoot into the expanded tool-call inspector
(ToolDetail / ToolArgsPreview) so the 'user-friendly view' shown
when a row is expanded uses the same stripped paths -- no more
mixed absolute/relative renderings between collapsed and expanded
views.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Conflicts in parseSession.ts and sessionTypes.ts were both
simple unions -- our copilot-chat-export entries and upstream's
new codex entries needed to coexist. Both kept.

Verified after merge: typecheck clean, 1059/1060 tests pass
(1 pre-existing skip), build clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…used)

Before: '52 unused tools ~15.3 cr'  -- hides the denominator
After:  '52/52 tools unused ~15.3 cr' -- the ratio jumps out

Same change for skills. Same line length, so layout is unaffected.
The fraction makes the underlying story much clearer:
- '52/52 tools unused' immediately reads as 'nothing was used'
- '3/52 tools unused' immediately reads as 'most paid off'

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
aggregateSkillCarry was reading skills from the FIRST non-overhead
LLM call. In sessions where that first call is a lightweight kickoff
request that ships an empty skills array (even though every later
call attaches the full skill set), this returned skillCount: 0 and
the Setup-overhead box claimed there were no unused skills.

On the cart-implementation fixture this hid ~5,119 tok/call of skill
overhead across 36 calls -- roughly 184k tokens of cache-discounted
spend the user had no way to see.

Fix: sample from the LLM call carrying the MOST skills. Skill lists
don't shrink mid-session, so the per-call max is the steady state.
Survives both kinds of false-empty early calls (overhead title-gen
already filtered, plus lightweight kickoff requests).

Verified on 04-plan-implement-cart.json:
  Before: skillCount: 0,  unusedCount: 0
  After:  skillCount: 37, unusedCount: 37, ~5,119 tok/call

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds up to 3 sub-lines under the LLM calls card showing how the
calls split across the main chat thread and each sub-agent that
was invoked. Falls back to today's '23 primary, 8 overhead'
sub-line when no sub-agents were spawned.

Example (cart fixture, 4 prompts, 2 sub-agents):
  LLM calls
  34
  23 main thread (incl. 8 overhead)
  \u21B3 7  Explore frontend components an\u2026
  \u21B3 6  Explore frontend structure

If more than 3 sub-agents were spawned, a '\u2026 +N more' line
shows the rollup with each remaining thread name in the tooltip.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant