feat(cost): VS Code Copilot Chat export support + Cost Compare + MCP reachability#103
Open
Jfhelin wants to merge 43 commits into
Open
feat(cost): VS Code Copilot Chat export support + Cost Compare + MCP reachability#103Jfhelin wants to merge 43 commits into
Jfhelin wants to merge 43 commits into
Conversation
…reachability
Adds a dedicated Cost view for VS Code Copilot Chat exports
(`copilot_all_prompts_*.json`), a Cost tab for the Compare view, a
'Copy for LLM analysis' export, and an MCP server reachability banner.
## What this adds
### Single-session Cost view (Copilot Chat exports)
Auto-detected when a `copilot_all_prompts_*.json` file is loaded.
A dedicated 3-column timeline view with three lenses:
- **CTX** -- token context buildup per call
- **NET** -- truly-new vs cache-recommit tokens per call
- **BILLED** -- USD / AI Credits cost per call
Plus:
- Per-call cache analysis with unexpected cache-miss diagnosis
("tool defs changed: X, Y, Z" vs "tool defs unchanged - likely
TTL expiry")
- Tool definition shape classifier (built-in vs MCP, router vs
direct, per-server token cost)
- MCP server reachability banner that flags declared MCP servers
whose tools never appeared in any chat request
- Image token cost estimation for attached screenshots
- Overhead toggle to hide non-user calls (title, summarization, etc.)
### Cost Compare (two-session)
Side-by-side cost analysis when both compared sessions are Copilot
Chat exports: headline verdict, A/B/delta cards, per-bucket cost
waterfall (sorted by absolute delta), behavioral KPI table
(path-noise-resistant metrics), cache-pollution warnings, run drift
detection (model / tools / system text hash divergence), and
rule-driven recommendations.
### Copy for LLM analysis
A button in the single-session Cost view and the Compare Cost tab
that copies a self-contained structured-facts + markdown summary of
the analysis to the clipboard. Designed for paste-into-chat sharing
with another LLM for deeper investigation.
### MCP server reachability
When a Copilot Chat export carries an `mcpServers` array (the IDE's
declared servers), the Cost view shows a banner like:
"3 of 8 listed MCP servers produced all 130 mcp_* tools the model
saw; the other 5 produced 0". Uses a heuristic label-slug match
against `mcp_<slug>_*` tool names.
## Architecture
- New files only for the Copilot Chat export path:
`src/lib/copilotChatExportParser.ts`, `src/lib/cacheAnalysis.ts`,
`src/lib/toolDefinitionShape.ts`, `src/lib/mcpServerReachability.ts`,
`src/lib/llmAnalysisExport.ts`, `src/lib/compareCost.ts`,
`src/lib/exportComparison.ts`, `src/lib/runDisplayName.ts`,
`src/lib/imageTokenEstimate.js`,
`src/components/CostViewChatExport.jsx`,
`src/components/CostCompare.jsx`.
- Existing files modified additively:
- `src/lib/parseSession.ts`: auto-detect `copilot-chat-export`.
- `src/lib/sessionLibrary.js`, `src/lib/sessionTypes.ts`:
recognize the new format.
- `src/lib/pricing.js`: per-model `cacheReadRatio` /
`cacheWriteRatio` for OpenAI accuracy.
- `src/lib/theme.js` + `.d.ts`, `docs/color-palette.html`,
`docs/ui-ux-style-guide.md`: `theme.cost.*` tokens.
- `src/components/CostView.jsx`: delegates to
`CostViewChatExport` when a Copilot Chat export is loaded.
- `src/components/CompareView.jsx`: adds a Cost tab when both
sides are Copilot Chat exports.
- `src/components/app/AppLandingState.jsx`: drag-drop two files
at once to enter Compare landing immediately.
- `src/contexts/SessionProvider.jsx`: new `handleFilePair`
callback supporting the drag-drop-pair flow.
- `src/App.jsx`: one line - pass `handleFilePair` to
`AppLandingState` as `onLoadPair`.
V2 placement: the new view is currently V1-only. V2's
`AnalyzeShell` would be the natural home for the single-session view
and `InlineCompare` would gain a Cost tab the same way V1's
`CompareView` does -- happy to follow up if maintainer wants the
V2 integration in the same PR or in a separate one.
## Tests
11 new test files, 73 new tests covering parser, cache analysis,
tool shape, MCP reachability, LLM export, comparison, and snapshot
roundtrip. Existing `costAnalysis.test.js` extended with cache
analysis cases.
Verified:
- `npx tsc --noEmit` clean
- `npm run build` clean
- 1018/1019 tests pass (1 pre-existing skip)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…s, docs - Add 23 new theme.cost.* tokens (chipBg*, miss*, ok*, recommit*, switch*, pillLlm/Subagent/Tool) to both dark and light themes, plus theme.d.ts - Add ctxImages to theme.d.ts (was already in runtime theme) - Replace 4 hardcoded hex literals in CostViewChatExport.jsx event pills with the new pill* tokens; pill text uses theme.text.primary - Fix magic number fontSize: 9 -> theme.fontSize.xs on the ROUTER chip - Remove unnecessary || '#eab308' defensive fallback (semantic.warning is always defined) - Remove two console.debug/console.warn calls from copilotChatExportParser - Fix 'Last verified: May 2026' typo -> May 2025 in pricing.js - Document all new cost tokens in docs/ui-ux-style-guide.md - Add swatches for all new cost tokens in docs/color-palette.html (both dark and light sections) - Update CLAUDE.md file tree with new lib modules (cacheAnalysis, toolDefinitionShape, compareCost, llmAnalysisExport, exportComparison, mcpServerReachability, copilotChatExportParser) and new components (CostViewChatExport, CostCompare) Verified: tsc --noEmit clean, npm run build clean, 1018/1019 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previous solid-bg pills with theme.text.primary had poor contrast (dark text on mid-saturation backgrounds looked flat). Switch to the established tinted-chip pattern already used elsewhere in the Cost view: tinted background + saturated foreground + subtle border. - LLM call: chipBgAssistant bg + accent.primary fg - Subagent: chipBgExtension bg + kindExtension fg - Tool: chipBgBuiltin bg + kindBuiltin fg Removes the now-redundant pillLlm/pillSubagent/pillTool tokens from theme.js, theme.d.ts, color-palette.html, and the style guide. The pills now self-document via the existing chip palette. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix remaining 'May 2026' typos -> 'May 2025' in pricing.js (2 inline comments), llmAnalysisExport.ts (verification comment), and README.md (file tree pricing.js description) - Document theme.cost.ctxImages in docs/color-palette.html (dark + light swatches: #C77BC2 / #9b4f97) and in the Context-bucket colors table in docs/ui-ux-style-guide.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…anel
When two Copilot Chat exports' system prompts differ (hashes don't match),
the drift panel now shows exactly which top-level <tag> blocks differ
between the runs — e.g. an <instruction forToolsWithPrefix="mcp_azure">
block present in one run but not the other.
Surfaces:
- 'only A' / 'only B' rows for blocks present on one side only
- 'size changed' rows for same-key blocks with differing char counts
- per-row char delta sorted by absolute magnitude, descending
- hover tooltip with the first 200 chars of the block body
Implementation:
- copilotChatExportParser: new extractSystemBlocks() walks the system
text and emits every top-level <tag>...</tag> block (depth-aware,
so <skill> inside <skills> is not double-counted). New SystemBlock
field added to ClassifiedCall and to the public LLM-call event.
- compareCost: RunFingerprint carries systemBlocks; new
buildSystemPromptDiff() emits a per-key diff sorted by |delta|.
Diff is attached to DriftReport.systemPromptDiff.
- CostCompare: new SystemPromptDiffDetail renders under the System
prompt drift row when hasBlockDrift is true; collapses long lists
to the top 6 with 'show N more'.
Solves the user-reported confusion: 'I have two runs with the same
system prompt but one is reported as smaller' — turned out two
<instruction> blocks for Azure/Bicep MCP servers were present in one
run and absent in the other. The new diff makes that immediately visible.
Tests: 15 new unit tests covering nesting, attributes, malformed input,
preview truncation, identical/only-A/only-B/chars-differ classification,
delta-descending sort, and the exact real-world scenario.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per-section 'tokens' values in the System anatomy panel are pro-rata
estimates of this run's total prompt_tokens, derived as
`chars / sysChars * sysTok`. Because the run-wide tok/char ratio
varies (BPE tokenization is content-dependent), a section with
identical chars between two runs will show different 'tokens' values
in each run — which made users (correctly) suspect the displayed
token numbers.
This commit makes the single-session view honest about that:
- Each section's always-visible summary row now shows BOTH the
exact char count and the token estimate, with the latter clearly
labeled '~N tok est'.
- A one-line explainer under the header tells the user: chars are
exact, per-section tokens are pro-rata estimates that shift when
other sections change.
- Per-row tooltips on the new columns explain the same.
- Sub-rows inside expanded sections (each <scaffolding> tag, each
skill, each instruction file, each file attachment) also now lead
with chars and label tokens as estimates.
Net effect: when comparing two runs side-by-side, users can rely on
the chars column to spot real content changes. The token estimate is
still shown for cost intuition, but it no longer masquerades as
ground truth.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 'Copy for LLM analysis' export for Compare now carries the per-section
system-prompt diff data we surface in the UI. Emitted as a 'System prompt
block diff' section right after 'Run drift', whenever the parser surfaced
top-level block data on both sides AND at least one block differs.
Contents (deterministic, no prose):
- Tagged chars on each side + signed delta
- Untagged plaintext chars (preamble + interstitial text between tags)
on each side + signed delta — derived from systemPromptChars minus
taggedCharsA/B, so the analyst can see whether drift lives in tagged
blocks or in the prose between them.
- Markdown table: one row per block, columns = status / block key /
chars A / chars B / Δ chars. Rows are pre-sorted by |delta| desc by
the diff builder. 'only-A' / 'only-B' rows show the side's char
count as a signed value so the table is easy to scan.
- Footnote calling out that per-section token counts shown elsewhere
are pro-rata estimates and that chars are the ground truth.
This keeps the analysis prompt itself unchanged — the new content lives
in the data section the prompt instructs the model to reason over.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Includes: - Cost view + Cost Compare (Copilot Chat export support) - MCP reachability column - Per-section system-prompt diff in Compare drift panel - Single-session anatomy: leads with chars, marks token counts as estimates - 'Copy for LLM analysis' export carries the block diff data Tarball published as GitHub Release v1.1.0-fork.0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t breakdowns Cost-view KPI cards: - Setup overhead card with per-bucket cost for unused tools and skills - Output card split into 3 lines (visible / thinking / tool-args) with cr - Tool calls card: top 3 tools on separate lines + ellipsis row - Tooltip hierarchy: visible by prose/fenced-code-by-language, tool-args by tool name Per-prompt Cost-by-component (Response bucket): - Collapsed summary now shows 'visible X cr · thinking Y cr · tool-args Z cr' - Expanded view: 3 flat rows aligned to bucket header columns (label / cr / % / inline detail) - Proportional allocation eliminates the 'unattributed' residual - Inline detail: prose+fenced-code for visible, top tools for tool-args - Drops redundant 'across N calls' sample line Parser: - fencedCodeStats returns per-language breakdown - Per-event toolArgCharsByName and codeCharsByLang - Session totals: toolArgCharsByName, codeCharsByLang, codeChars llmAnalysisExport: - output_composition.tool_args_by_tool: top 10 tools with est_tokens - output_composition.visible_code_by_language: top 5 languages - Exports detectUnusedTools, aggregateSkillCarry for KPI reuse - Fix detectUnusedTools return type: declare unusedTokensPerCall Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The system prompt anatomy panel was missing two block types that the parser silently rolled into "Other / unclassified system text": - <agents> — sub-agent declarations the model uses with runSubagent - <instruction forToolsWithPrefix="..."> — MCP-injected tool usage rules Both cost prompt tokens on every call but were invisible in the UI. Parser (copilotChatExportParser.ts): - Add SubAgent and ToolPrefixInstruction interfaces - extractSubAgents: parse <agents><agent><name/description/argumentHint> - extractToolPrefixInstructions: parse <instruction forToolsWithPrefix="X"> - Wire subAgents and toolPrefixInstructions through ClassifiedCall and CostAnalysisCall UI (CostViewChatExport.jsx renderSystemAnatomy): - New "Sub-agents (N)" row with name + hover for description/argumentHint - New "MCP / tool-prefix instructions (N)" row keyed by prefix with body preview - Both rows feed into classifiedChars so "Other" shrinks accordingly - Empty-check expanded so the panel still renders when only these new sections exist Cart fixture impact: surfaces the 1,848-char mcp_azure instruction and the 719-char <agents> block (Explore sub-agent) that were previously hidden. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
VS Code's Copilot Chat export sometimes omits the `request` log entry for the first LLM round-trip of a user turn, while still logging the `toolCall` entries it dispatched. The result was a confusing timeline where runSubagent (or other tool calls) appeared before any LLM row. We now detect orphan toolCalls before the first request, recover the response text from the next request's message history (last role=2 assistant content), and synthesize a virtual LLM-call row in the timeline. The row is labeled 'LLM (synth)' with a muted pill and a tooltip explaining the reconstruction. It carries zero usage/cost (VS Code didn't log it) and is excluded from KPI totals, cache analysis, and projections -- it only appears in the timeline so users can see which LLM call dispatched the orphan tools. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
VS Code's Copilot Chat export uses `name = "tool/runSubagent"` for LLM calls that run inside a spawned subagent (the inner panel surface). This was confusing: rows looked like the model was emitting a `runSubagent` tool request, when in fact the LLM was running inside a previously-dispatched subagent invocation. Changes: 1. Relabel: `tool/runSubagent` LLM rows now get a distinct purple 'SUBAGENT LLM' pill and a 'Subagent turn' friendly label. The smart turn label (→ tools dispatched) and step counter still apply, so each row reads as 'Subagent turn → list_dir, read_file'. 2. Link: the parser matches subagent prompts to their parent runSubagent toolCall by comparing args.prompt to the subagent's user-message text, populating a new `invokedBy` field. The first LLM row of a subagent prompt shows '← invoked by prompt N · runSubagent (description)' as a clickable jump that scrolls the parent row into view and briefly highlights it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the cost-view inspector is opened on a synthesized LLM row, it previously rendered the normal 3-box layout with zeros for every numeric field (prompt tok, output tok, cost, cache split, per-bucket breakdown). That was misleading -- zero looks like 'this call was free', when the truth is 'VS Code never wrote the request log so we don't know'. Synth rows now get a dedicated inspector: * Top banner explaining that the request log is missing * Three dashed/striped boxes labeled 'unknown' for Prompt sent, Reply written, and Cost (45-degree repeating-linear-gradient is the standard 'no data' pattern; symmetric with regular rows so layout doesn't collapse) * A 'Recovered facts' card with the model name (marked '(inferred)' since it's pulled from the next request) and the list of tool calls this LLM call dispatched * The recovered response text from the next request's message history No misleading cache-miss or recommit callouts are emitted for synth rows, and the per-bucket new-input breakdown is omitted entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each subagent thread already gets a per-letter color in the thread
badge at the top of its prompt card ('SUB A', 'SUB B', ...). Make
the per-LLM-call pill inside that thread use the same color and
label, so they're visually grouped:
Before: orange 'SUBAGENT LLM' pill on every subagent LLM row
After: same-color 'SUB B LLM' pill (matching the 'SUB B' badge)
Falls back to the generic orange 'Subagent LLM' label when the
thread layout doesn't have a letter (e.g. only one subagent across
the whole session, or when buildAgentThreads can't slot it).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The per-bucket drilldown section in the LLM inspector only showed buckets whose content was NEW on this call. That hid useful detail: once a subagent's tool results were folded into the context on step 1, they became cached on step 2+ and disappeared from the drilldown -- with no way to read what the subagent actually returned. Now any bucket with non-zero token content (new OR cached) is listed. Cached-only buckets render '· cached' in place of the 'X% of new' suffix and use no '+' prefix on the token count. Renamed the section header from 'What's new in this call' to 'Context buildup' and added a 'N more cached' suffix to the count line. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Response section in the LLM Detail inspector previously dumped every reasoning block and tool call inline at full width, which made multi-step turns scroll forever -- and the model often emits the SAME reasoning text before each parallel tool_use, doubling or tripling the noise. Now: - Each reasoning block is a collapsible row showing 'before <toolName>' with the first non-empty line as the preview. Click to expand and read the full text untruncated. - Consecutive identical blocks (same text + same target tool) collapse to a single row tagged 'x N identical'. The section header reports total blocks and unique count when they differ. - Each tool call is a collapsible row with the tool-name pill and the smart args summary as the preview. Expand to see the pretty-printed args (JSON.stringify with 2-space indent; single-key string args render as 'key:\n<value>' so long file paths or prompts wrap nicely). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Response section used a small uppercase header ('RESPONSE (2.0K
OUTPUT TOK)') over a loose dashed text block, while every input box
above it (Current prompt, Tool definitions, System scaffolding, etc.)
used the DetailSection card pattern with colored swatch + label on the
left and token count on the right. The result was a visual gear shift
right at the moment the user crossed from 'context sent in' to 'reply
that came out'.
Now the entire Response section is wrapped in a DetailSection-style
card (bg.surface, 1px border, 4px radius, 10/12 padding) with the
standard header layout: purple cost.output swatch + 'Response' label
on the left, '2.0k output tok' right-aligned. The response text block
is full-width inside the card with a solid (not dashed) subtle border.
Reasoning and tool-call subsections inside the card are de-emphasized
so they read as nested children rather than competing cards: no
background fill, no heavy left stripe, separated by a 1px subtle top
divider, and the colored accent now appears as a small 6x6 swatch in
the subsection header to echo the outer card's swatch motif.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 'Reply written' KPI box already splits output tokens into three buckets -- visible to user, thinking, tool-call args -- estimated from char share. The aggregate output bucket in the per-call drilldown uses the same split. The Response section header was only showing the total, so the reader had to scroll back up to the KPI strip to see how the spend broke down for THIS particular call. Now the right side of the Response header shows total + per-bucket breakdown with matching swatch colors: 2.0k output tok o 1.2k visible o 0.5k thinking o 0.3k tool args Reuses the same color mapping as the existing 3-bucket pattern (theme.cost.fresh = visible, theme.cost.output = thinking, theme.cost.ctxToolDefs = tool args). Char-based estimate falls back to reasoning-token split when no per-block char data is available. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The visible reply was rendered as a full-width prose block while reasoning and tool calls were collapsible rows with first-line previews. With multi-paragraph replies this dominated the inspector even when the user only wanted to scan for tool calls below. Visible text now uses the same CollapsibleRow pattern: a header showing a green cost.fresh swatch + 'visible' label on the left and the first non-empty line of the reply as the preview. Click to expand the full response text. Collapsed by default, matching reasoning + tool-call rows so the entire Response card has consistent triage ergonomics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
If the full response text is the same as the first non-empty line (common for short replies like 'Done.' or 'Now I have all the context I need. Let me draft the plan.'), the expand chevron previously hinted at content that wasn't there. Clicking it revealed the same single line in a styled panel. Now we detect that case and render the visible row as a plain non-collapsible row: still the green swatch + 'visible' label + the line itself, but no chevron and no click-to-expand. The grid keeps the 14px chevron slot empty so it aligns vertically with the reasoning + tool-call rows beneath it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The collapsed header for LLM call rows used the smart turn label as
the primary line: either a quoted first line of the response OR an
arrow list of tool calls ('-> read_file x2'). When both existed the
tool list won, and the actual reply text -- the highest-signal hint
about what the model just did -- was hidden inside the inspector.
Header now leads with the model's visible reply (first non-empty line,
markdown-stripped, truncated to 90 chars) and follows with a smaller
mono '-> tool x N' chip. The tool list keeps the same summarization
(top 3 names + '+N more') and a tooltip with the full list.
When the model emitted no visible text (silent tool turn), the primary
slot reads '(no visible reply)' in muted italic, followed by the same
tool chip. The row keeps the same vertical rhythm regardless.
Pushed to expanded-only (visible when the row is open):
- The raw VS Code surface name (e.g. 'panel/editAgent')
- The OS/workspace environment chip
- The model short name in the metric strip
These are useful for diagnosing rare cross-model or cross-workspace
issues but not relevant to scanning a session top-down.
Kept always visible:
- LLM CALL / Sub X LLM pill, Step N of M, Plan/mode pill, vision
pill, overhead pill, unexpected-cache-miss warning
- The full metric strip (ctx / net new / cached / billed-new / cost)
- The 'invoked by' link on the first LLM row of a subagent prompt
Added stripLeadingMarkdown + firstVisibleSnippet helpers so the
preview handles '# Plan' style headers, blockquotes, bullets, and
ordered lists by walking forward to the first line with real content.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tool-call rows used a generic raw-JSON dump as the args preview (e.g.
'manage_todo_list {"todoList":[{"id":1,"status":"in-progress"...')
and wasted the second line on the literal phrase 'tool call' plus the
result token count. Almost no signal for the reader.
Now the headline reads like the LLM row above it: mono-bold tool name
+ a smart human summary of the most useful field, with the result
token count promoted to a small chip on the right of the title.
Per-tool summarizers (smartToolHeadline):
read_file/create_file/edit_file/etc. -> file basename
list_dir/ls -> directory path
grep/semantic_search/file_search -> search query
bash/shell (via parsed.command) -> short command line
manage_todo_list -> 'N todos . X in-progress, Y pending, Z done'
runSubagent -> '<agent> . "<prompt snippet>"'
unknown -> first key:value of args, truncated
The full raw JSON stays available as the tooltip for power users. The
'tool call -> N tok of result' second line is hidden when the row is
collapsed and shown when the row is open (small redundant breadcrumb,
but harmless).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The vision pill rendered on every call that had any image attached, even when all images were reused from prior calls (cached). On multi-step turns that meant the same '1 (~1.6k tok)' chip appeared on Step 2, 3, 4, ... long after the image actually entered the context. It implied per-call vision cost that wasn't really being charged for that step. Pill is now omitted when newImages.length is 0. When at least one new image is present it reads '+N' to match the '+net new' convention elsewhere, the token estimate prefers newImageVisionTokens when the parser supplies it (falling back to visionTokensTotal), and the tooltip notes how many more were carried in from earlier calls. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two quick wins on the tool-row headline summarizer.
memory tools: previously fell through to the generic '<first-key>:
<truncated-value>' fallback which showed implementation-level noise.
Now extract action/name/content from common memory tool shapes (the
parser may use {action, name, content}, {command, key, value}, etc.)
and render them as 'create . <name> . "<short body>"' so the row
reads like a sentence about what's being remembered.
manage_todo_list: appended ' . now: "<title>"' showing the title of
the first in-progress task. The reader gets a sense of what work the
agent thinks it's doing right now without expanding the row. Counts
still come first since they answer 'how much is left?' at a glance.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The right-side result chip on tool rows just read '1.0k tok' which hides whether the tool returned a small JSON ack or a large file dump. Char count is the more intuitive 'how big was this' measurement for file/listing tools, and reminding the reader where these tokens land (the NEXT LLM call's input) makes the cost causality obvious from the collapsed view. Chip now reads '1.0k tok . 41,283 chars -> next call' when resultChars is known, falling back to just '1.0k tok' when the parser didn't capture char count. Tooltip is expanded to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The combined 'tok . chars -> next call' chip varied in width depending on whether resultChars was known, so the rightmost edge of the token count danced from row to row and the eye couldn't scan a column. Split into two pieces: chars text (plain muted mono, no chip) flushes to the right via marginLeft auto, and the token count chip sits to its right. When chars are unknown the chip itself takes marginLeft auto and still anchors the right edge. Result: the 'N tok' chip is always at the same horizontal position across rows. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The chip said 'chars -> next call' which was correct but vague. The parser already classifies tool outputs into the dedicated tool_results bucket on the subsequent LLM call (they arrive as role:'tool' messages, distinct from the History bucket which holds prior user/assistant turns). Spelling that out gives the reader the cost trail at a glance and matches the bucket terminology used everywhere else in the Cost view. Chip now reads 'N chars -> [swatch] Tool results' using the same color swatch as the Tool results bucket in the prompt breakdown. Tooltip explains the role:'tool' message mechanism and contrasts it with History. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Old format read '7 todos · 1 in-progress, 2 done · now: "..."' -- five comma-separated pieces of information competing for the same line. The progress signal (how far through the list) was buried. New format leads with the fraction: manage_todo_list 2/7 done · \u25b6 "Create CartPage component" Rules: - Always show N/M done as the primary progress indicator. - Append \u2713 when the list is fully complete (7/7 done \u2713). - Append 'X blocked' only when non-zero. - Append \u25b6 '<title>' for the in-progress item when one exists; the play arrow reads as 'currently active' without spelling it out. Pending count is implied (M - done - in-progress) so we drop it from the headline. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bug: the per-call result chip and expanded preview routinely showed content that didn't belong to the tool row. Most visible case -- 'manage_todo_list' rows showing thousands of chars of unrelated text like 'Here's a complete breakdown of the Navigation component...'. Cause: copilotChatExportParser paired pending tool calls with role-3 tool_result messages by ordinal index within the request. But cls.toolResultMsgs is built from EVERY role-3 message in this call's prompt history -- i.e., results of every tool call across the entire conversation up to this point. As the conversation grew, pendingToolCalls (0..N from this round) got paired with the FIRST N tool_result messages, which were almost always from the earliest turn. Fix: every role-3 message carries the originating toolCallId, and each toolCall log's id is the same toolu_* value. Capture toolCallId on toolResultMsgs and match by id in the pairing loop, with an ordinal fallback only when ids are missing on either side. Result: chars/tokens/preview/full now reflect the actual response of the tool that produced them. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tool-row headlines for list_dir used to show the full absolute path (e.g. '/Users/jfhelin/Code/GitHub/octodemo/octocat_supply-psychic-d'), which is mostly noise -- the interesting part is the relative location inside the workspace. Infer workspace root as the longest common directory prefix across every absolute path arg in the session (filePath/path/file/uri/ directory/dir/cwd/...), then strip it from list_dir headlines. Root itself renders as '.', sub-paths render relative. Safety floor: only trust the inferred root when it has at least 4 meaningful path segments (e.g. /Users/<name>/Code/<repo>) and at least 2 absolute paths were observed. Below that we leave paths untouched so we never strip '/' or '/usr'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previous commit matched tool-call ids exactly against role-3 toolCallId values, but the export carries them in two shapes: toolCall log: 'toolu_bdrk_<id>__vscode-<n>' role-3 msg: 'toolu_bdrk_<id>' The '__vscode-<n>' suffix is added host-side. Exact-match therefore never hit, every tool row lost its resultChars/resultTokens, and the 'N chars -> Tool results' chip disappeared. Strip everything from the first '__' before matching on both sides. The bare 'toolu_*' prefix is unique enough to be a reliable key. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two separate spans (chars + token chip) were noisier than needed. The signal that matters for billing is the token count and where it lands. Collapse to a single chip: 1.1k tok -> [swatch] Tool results Chars moved into the tooltip alongside the role:'tool' explanation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
History and Tool result lists in the expanded LLM call previously rendered every row, with cached (carried-over) rows dimmed to ~55% opacity. For long-running sessions this meant scrolling past dozens of muted rows before reaching what's new this call. Now cached rows are hidden behind a single dashed header at the top of each list: > 24 cached messages (show) reused from prior calls Click to expand. New-this-call rows always render below, unaffected. Per-list local state -- both lists can be expanded independently. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
File-mutating tool calls now show the size of the change in the headline, dim-styled after a middle-dot separator: create_file CartContext.tsx \u00b7 84 lines, 4.6k chars insert_edit_into_file Navigation.tsx \u00b7 +12 lines, 380 chars replace_string_in_file Navigation.tsx \u00b7 +12 / -3 lines apply_patch \u00b7 +23 / -8 lines read_file Navigation.tsx \u00b7 lines 1-200 Sources: - create_file / write -> args.content|text|code length - insert_edit_into_file / edit_file -> args.code|content length (treated as inserted lines, prefixed '+') - replace_string_in_file -> +newString lines / -oldString lines - apply_patch / patch -> count +/- lines in args.patch|diff|body - read_file -> args.startLine-endLine when present Chars use k-suffix above 1000 to stay compact. Tools without a size signal fall back to today's behavior (filename only). The right-side '-> Tool results' token chip is unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously the fallback for any tool we hadn't special-cased was
just 'firstKey: firstValue', which often produced unreadable noise
('options: {recursive, ...}' on a delete tool, 'arg0: --flag' on a
shell wrapper, etc).
Replace with a layered probe that tries informative shapes in order
so most unknown tools get a reasonable headline for free:
1. URL value -> 'github.com/Jfhelin/agentviz'
2. Path-like key -> basename for files, relative dir otherwise
(stripped against inferred workspace root)
3. Action + name pair -> 'delete \u00b7 my-cache-key'
4. Query/command key -> truncated single-line text
5. Body-like key -> 'content: 42 lines, 1.2k chars'
(size signal instead of dumping contents)
6. Array of strings -> 'foo.ts +3 more' (basename for path arrays)
7. Plain name/id -> truncated string
8. Final firstKey: value fallback as a last resort
Workspace-root stripping applies to all path-like values so unknown
file-touching tools benefit too. Probes are short-circuit so the
first useful signal wins.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previous strict longest-common-prefix returned '' when even a single divergent path appeared in the session -- and there always is one in practice (e.g. the 'memory' tool writes to '/memories/session/plan.md' which shares no segments with the project root). The result: every list_dir / read_file headline reverted to showing the full absolute path. Replace with a frequency-based pick: tally every directory prefix of every absolute path, then return the longest prefix that: - covers at least 80% of the paths, AND - has at least 4 meaningful segments This survives a handful of outliers while still requiring strong majority evidence before trimming. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
'list_dir api' was ambiguous -- could read as a parameter rather than a path. Adding './' anchors the eye and matches shell muscle memory: list_dir ./api list_dir ./frontend/src/components list_dir . (workspace root unchanged) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously read_file / create_file / edit_file etc collapsed paths
to basename ('Products.tsx'), losing the directory context. Now
they show the workspace-relative path with a './' prefix:
read_file ./pages/Products.tsx
create_file ./context/CartContext.tsx \u00b7 84 lines, 4.6k chars
replace_string_in_file ./components/Navigation.tsx \u00b7 +12 / -3 lines
list_dir ./api
Paths outside the inferred workspace root still fall back to
basename so we don't show useless absolute prefixes.
Also threads workspaceRoot into the expanded tool-call inspector
(ToolDetail / ToolArgsPreview) so the 'user-friendly view' shown
when a row is expanded uses the same stripped paths -- no more
mixed absolute/relative renderings between collapsed and expanded
views.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Conflicts in parseSession.ts and sessionTypes.ts were both simple unions -- our copilot-chat-export entries and upstream's new codex entries needed to coexist. Both kept. Verified after merge: typecheck clean, 1059/1060 tests pass (1 pre-existing skip), build clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…used) Before: '52 unused tools ~15.3 cr' -- hides the denominator After: '52/52 tools unused ~15.3 cr' -- the ratio jumps out Same change for skills. Same line length, so layout is unaffected. The fraction makes the underlying story much clearer: - '52/52 tools unused' immediately reads as 'nothing was used' - '3/52 tools unused' immediately reads as 'most paid off' Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
aggregateSkillCarry was reading skills from the FIRST non-overhead LLM call. In sessions where that first call is a lightweight kickoff request that ships an empty skills array (even though every later call attaches the full skill set), this returned skillCount: 0 and the Setup-overhead box claimed there were no unused skills. On the cart-implementation fixture this hid ~5,119 tok/call of skill overhead across 36 calls -- roughly 184k tokens of cache-discounted spend the user had no way to see. Fix: sample from the LLM call carrying the MOST skills. Skill lists don't shrink mid-session, so the per-call max is the steady state. Survives both kinds of false-empty early calls (overhead title-gen already filtered, plus lightweight kickoff requests). Verified on 04-plan-implement-cart.json: Before: skillCount: 0, unusedCount: 0 After: skillCount: 37, unusedCount: 37, ~5,119 tok/call Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds up to 3 sub-lines under the LLM calls card showing how the calls split across the main chat thread and each sub-agent that was invoked. Falls back to today's '23 primary, 8 overhead' sub-line when no sub-agents were spawned. Example (cart fixture, 4 prompts, 2 sub-agents): LLM calls 34 23 main thread (incl. 8 overhead) \u21B3 7 Explore frontend components an\u2026 \u21B3 6 Explore frontend structure If more than 3 sub-agents were spawned, a '\u2026 +N more' line shows the rollup with each remaining thread name in the tooltip. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds first-class support for VS Code Copilot Chat exports (
copilot_all_prompts_*.json) in the Cost view, plus a Cost tab on the Compare view, a 'Copy for LLM analysis' export, and an MCP server reachability banner.Since the initial draft, this PR has grown with substantial Cost-view polish driven by real-world use: smarter tool-row headlines, workspace-root path inference, a tool-result mis-attribution bug fix, cached-row collapse in the LLM inspector, system-prompt diffing in Compare, sub-agent surfacing, and synthesized rows for orphan tool calls. Sized for a single PR (the maintainer's recent merges #94 and #101 were similar scale), but happy to split.
What this PR adds
Single-session Cost view for Copilot Chat exports
Auto-detected when a
copilot_all_prompts_*.jsonfile is loaded. A 3-column timeline with three lenses:Plus per-call cache analysis with unexpected-cache-miss diagnosis ("tool defs changed: X, Y, Z" vs "likely TTL expiry"), tool definition shape classifier (built-in vs MCP, router vs direct), image token cost estimation, and an Overhead toggle.
Cost Compare (two-session)
Side-by-side cost analysis when both compared sessions are Copilot Chat exports: headline verdict, A/B/delta cards, per-bucket cost waterfall sorted by absolute delta, behavioral KPI table (path-noise-resistant metrics), cache-pollution warnings, run drift detection (model / tools / per-section system-prompt diff), and rule-driven recommendations.
Copy for LLM analysis
A button on both the single-session Cost view and the Compare Cost tab that copies a self-contained structured-facts + markdown summary of the analysis to the clipboard. Designed for paste-into-chat sharing with another LLM for deeper investigation. The Compare export now includes a per-section system-prompt block diff.
MCP server reachability banner
When a Copilot Chat export carries an
mcpServersarray, the Cost view shows a banner like: "3 of 8 listed MCP servers produced all 130 mcp_ tools the model saw; the other 5 produced 0"*. Heuristic label-slug match againstmcp_<slug>_*tool names.New since initial draft
Cost-view UX polish (driven by hands-on use of the view):
read_file ./pages/Products.tsx), file writes/edits show size deltas (+12 / -3 linesor84 lines, 4.6k chars),manage_todo_listshows progress (2/7 donethen in-progress title),list_dirstrips inferred workspace root, plus an 8-layer generic fallback for unknown tools (URL, path, action+name, command, body size, array preview, etc.).1.1k tok -> Tool results(where the bytes will land in the next LLM call), with a colored swatch matching the destination bucket.N reused from prior callsheader so the diff is what jumps out.Bug fix: Tool results were being paired to tool calls by ordinal index inside the running prompt history, which meant late-turn tool calls got matched to the OLDEST results in the conversation (e.g., the
manage_todo_listrow was showing aread_fileresponse). Now paired bytoolCallIdwith a fallback for the__vscode-Nhost-side suffix.Architecture
parseSession.ts: auto-detectcopilot-chat-export.pricing.js: per-modelcacheReadRatio/cacheWriteRatiofor OpenAI accuracy.CostView.jsx: delegates toCostViewChatExportwhen a Copilot Chat export is loaded.CompareView.jsx: adds a Cost tab when both sides are Copilot Chat exports.AppLandingState.jsx+SessionProvider.jsx+ one line inApp.jsx: drag-drop two files at once to enter Compare landing immediately.V2 placement
Feature still lives in V1 only.
AnalyzeShellis the natural home for V2;InlineComparecould grow a Cost tab the same wayCompareViewdid. Happy to do the V2 port in this PR or as a follow-up.Tests
costAnalysis.test.jsextended with cache analysis cases.Verified on latest commit:
npx tsc --noEmitcleannpm run buildcleanScope check
Open to splitting into a stack of smaller PRs if preferred -- clean cuts at (parser + cost view) -> (compare + cost compare) -> (LLM export) -> (cost-view UX polish + sub-agent surfacing). Let me know.
Backstory
I've been using AGENTVIZ on a fork (
Jfhelin/agentviz) to experiment with VS Code Copilot Chat exports and build my own intuition for what drives per-call cost -- context buildup, cache recommits, tool-definition footprint, MCP server overhead -- so I can speak with more confidence about UBB cost drivers for end customers. The views in this PR (and the polish that came after) are what I ended up wanting while doing that.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com