A Rust CLI for improving prompt cache hit rate in coding-agent workflows.
Phase 1 is Codex-first. The longer-term goal is to make the same cache-hit discipline useful for Claude Code, Cursor, custom agent runners, and multi-agent routers.
The idea is simple:
Do not remove context. Make repeated context cacheable.
Many researchers with training resources work on the model side: training, architecture, distillation, and serving. Many others work on the inference side: cache, KV cache, batching, kernels, and lower-latency serving. These are still mostly model-layer or serving-layer directions.
This project did not start from "let's make agents cheaper." It started from trying to understand why DeepSeek v4-style systems make tokens cheaper, what cache hit really means, and how this differs from ordinary model use.
At the same time, while building claude-trace and codex-trace style tooling for agent visualization and explainability, we saw the concrete request payload. A coding-agent request is not just the user's latest message. It is a harness-assembled bundle of stable instructions, tool schemas, repo rules, session state, transport choices, and dynamic task data.
That made the key question practical:
In the agent era, what can an individual builder do outside the model?
The answer we explore here is small but useful: make the repeated prefix stable and measurable, so prompt cache hit rate can improve without removing context.
Model-side work such as cheaper training, sparse inference, distillation, batching, and serving optimization makes each model call cheaper by improving the model or inference stack.
make-agents-cheaper works outside the model, at the agent harness layer:
model-side efficiency:
make the model cheaper
harness layer:
make repeated agent context cheaper to reuse
The tool does not train a model and does not change model weights. It also is not ordinary prompt shortening. It focuses on the agent harness layer: the configuration, request envelope, transport, session route, and stable prompt prefix that wrap every coding-agent call.
The practical rule is:
Structure your prompt so stable components come first
and dynamic components come later.
For this project, "prompt" means the full agent harness payload, not just the user's natural-language instruction.
This is why it can later be packaged as a reusable skill. The same cache-hit discipline can be applied by different agents even if their model providers and UI surfaces differ.
There are three related layers, but they should not be mixed:
make-agents-cheaper: the Rust audit/eval tool. This is the experiment and measurement engine. It fingerprints prompt layers, checks tool schema stability, analyzes cache breakpoints, records token usage, and compares baseline vs cache-friendly runs.make-agents-cheaper-skill: the reusable skill packaging layer. A skill turns the method into instructions and runbooks that another agent can apply, but the skill itself is not the primary measurement instrument.cheapcodeor a future cheaper agent: a possible full agent harness that would own prompt assembly, tools, memory, and routing directly. This is a later product direction, not the current experimental object.
In current experiments, Codex is the development environment used to build the tooling and write the reports. The studied harness is Claude Code, and the backend model/provider in the current setup is MiMo, such as mimo-v2.5-pro. The paper should therefore describe the object of study as a Claude Code harness running on a MiMo-compatible model route, with make-agents-cheaper used as the audit/eval instrumentation.
So yes: experiments use the audit/eval layer, not the skill layer, as evidence. The skill layer is for reuse and deployment of the same cache-friendly discipline after the method has been made explicit and measurable.
Coding agents are expensive in long sessions because every turn can resend a large repeated prefix:
- system and developer instructions
- tool definitions and JSON schemas
- repo rules such as
AGENTS.md - stable project context
- previous session and conversation identifiers
Prompt caching can make that repeated prefix cheaper, but only when the provider sees the same beginning of the request again. The cache is strict: similar text is not enough; the prefix has to stay stable enough to match.
make-agents-cheaper helps with the parts a user can control:
- Stable provider: do not bounce the same task between providers or upstream keys.
- Stable transport: prefer one agent path for the task, especially Responses API for Codex.
- Stable session: WebSocket mode and session-aware routing make it easier for later turns to land near existing cache.
- Stable model settings: model and reasoning effort changes can create different request buckets.
- Stable static context: keep repeated rules and tool context stable; avoid injecting changing bridge text before it.
The savings come from the provider charging or processing cached input more cheaply than uncached input. This project does not hide context from the agent, truncate important instructions, or rewrite the model's task. It makes the official cache path easier to hit.
The rough mental model is:
same long prefix + same session route + compatible transport
-> higher prompt-cache hit probability
-> less repeated prefill work
-> lower repeated-input cost and latency
In other words:
It reduces paid uncached input, not necessarily total input.
The latest V2 direct-json pilot is intentionally published as mixed/negative evidence, not as a savings claim.
Summary:
control-steady: cache-friendly uncached input was 7,305 vs 8,162 baseline (0.895x), with 3/3 task success on both sides.dynamic-drift: cache-friendly uncached input was 13,930 vs 6,470 baseline (2.153x), with 3/3 task success on both sides.- Aggregate: cache-friendly uncached input was 21,235 vs 14,632 baseline (1.451x).
So the current V2 pilot does not support the primary savings claim. The useful result is methodological: the toolchain preserves the evidence, separates slices, and prevents overclaiming before a larger matrix is run.
See docs/v2-direct-json-pilot.md and docs/data/v2-direct-json-pilot-summary.csv for the derived, commit-safe data. Raw run logs stay ignored under runs/.
Cache-aware routers can improve cache hits in the routing layer. This repository focuses on the missing client-side step:
Before blaming the router or model, verify that your local Codex config is actually cache-hit friendly.
The bundled Rust CLI is read-only by default. It inspects a Codex config.toml and reports:
- whether the configured provider has a stable
base_url - whether
wire_api = "responses"is set - whether WebSocket mode is enabled when you expect long sessions
- whether
env_keyis configured and present in the current shell - whether model and reasoning settings are stable enough for repeat sessions
- whether the config looks likely to drift between providers or transport modes
It also prints HTTP and WebSocket configuration templates with placeholder router settings. Set MAKE_AGENTS_CHEAPER_EXPECTED_BASE_URL when you want the audit to verify a private endpoint without putting that endpoint in source control.
Ask Codex:
Use $make-agents-cheaper to inspect my Codex config and tell me whether it is prompt-cache friendly.
Or run the CLI directly:
cargo run --quietRun explicit Codex config audit:
cargo run --quiet -- audit --config ~/.codex/config.tomlPrint the recommended WebSocket template:
cargo run --quiet -- --print-ws-configPrint the simpler HTTP template:
cargo run --quiet -- --print-http-configInspect a custom config path:
cargo run --quiet -- --config /path/to/config.tomlThese commands are the first executable pieces of the portable cache-hit layer for existing agents.
Fingerprint prompt or harness layers without printing private prompt text:
cargo run --quiet -- fingerprint --input layers.json
cargo run --quiet -- fingerprint --input current-layers.json --previous previous-layers.jsonInspect tool schema stability:
cargo run --quiet -- tool-schema --input tools.json
cargo run --quiet -- tool-schema --input current-tools.json --previous previous-tools.jsonInspect explicit cache_control breakpoint placement:
cargo run --quiet -- breakpoints --input request.jsonCompare baseline and cache-friendly benchmark records:
cargo run --quiet -- eval --baseline baseline.jsonl --candidate cache-friendly.jsonlPrint per-task token usage:
cargo run --quiet -- task-report --baseline baseline.jsonl --candidate cache-friendly.jsonlWrite paper-facing Markdown tables and interpretation guardrails:
cargo run --quiet -- analysis-report \
--baseline baseline.jsonl \
--candidate cache-friendly.jsonl \
--output runs/exp/analysis-report.mdNormalize direct Claude Code JSON output into the eval schema:
cargo run --quiet -- claude-json-import \
--input runs/exp/raw/claude-json/run-1.json \
--run-id run-1 \
--task-id docs-token-accounting \
--condition cache-friendly \
--slice dynamic-drift \
--repeat-id 1 \
--phase measured \
--output runs/exp/cache-friendly.jsonl \
--validation-path runs/exp/validation/run-1.txt \
--validation-passed trueOptional: if a raw claude-trace JSONL file exists, normalize it into the eval schema and request/layer/tool artifacts:
cargo run --quiet -- trace-import \
--input runs/exp/raw/claude-trace/run-1.jsonl \
--run-id run-1 \
--task-id docs-token-accounting \
--condition baseline \
--slice dynamic-drift \
--repeat-id 1 \
--phase measured \
--output runs/exp/baseline.jsonl \
--artifacts-dir runs/exp \
--validation-path runs/exp/validation/run-1.txt \
--validation-passed trueThe current roadmap uses direct Claude JSON as the default evidence path. It preserves usage/cost/validation accounting, but it cannot prove request-shape facts such as system/tool/message ordering.
Compare with provider prices, expressed as USD per million tokens:
cargo run --quiet -- eval \
--baseline baseline.jsonl \
--candidate cache-friendly.jsonl \
--uncached-input-per-mtok <USD> \
--cached-input-per-mtok <USD> \
--output-per-mtok <USD>Print a cache-aware compact / reactivation template:
cargo run --quiet -- compact-templateThe expected JSONL benchmark record format is documented in docs/evaluation-metrics.md.
Initialize a reproducible experiment log directory:
cargo run --quiet -- init-experiment --dir runs/2026-05-09-claude-mimo-cacheGenerate a paired pilot command plan from the V2 task manifest:
cargo run --quiet -- pilot-plan \
--manifest docs/task-suites/real-coding-ablation-v2.manifest.json \
--task docs-token-accounting \
--experiment-dir runs/2026-05-09-claude-mimo-real-coding-v2-pilot \
--slice dynamic-drift \
--repeats 1Generate the full task-matrix command plan:
cargo run --quiet -- matrix-plan \
--manifest docs/task-suites/real-coding-ablation-v2.manifest.json \
--experiment-dir runs/2026-05-09-claude-mimo-real-coding-v2-full \
--repeats 3Full protocol: docs/evaluation-protocol.md.
Use this when you want stronger long-session continuity:
model_provider = "cache_router"
model = "gpt-5.4"
model_reasoning_effort = "xhigh"
plan_mode_reasoning_effort = "xhigh"
model_reasoning_summary = "none"
model_verbosity = "medium"
approval_policy = "never"
sandbox_mode = "danger-full-access"
suppress_unstable_features_warning = true
[model_providers.cache_router]
name = "OpenAI"
base_url = "https://router.example/v1"
wire_api = "responses"
requires_openai_auth = false
env_key = "CACHE_ROUTER_API_KEY"
supports_websockets = true
[features]
responses_websockets_v2 = trueUse this when you prefer a simpler, broadly compatible setup:
model_provider = "cache_router"
model = "gpt-5.4"
model_reasoning_effort = "xhigh"
plan_mode_reasoning_effort = "xhigh"
model_reasoning_summary = "none"
model_verbosity = "medium"
approval_policy = "never"
sandbox_mode = "danger-full-access"
[model_providers.cache_router]
name = "OpenAI"
base_url = "https://router.example/v1"
wire_api = "responses"
requires_openai_auth = false
env_key = "CACHE_ROUTER_API_KEY"- Keep static instructions, tool schemas, and repo rules stable.
- Avoid switching providers, models, or transport modes mid-task.
- Prefer Responses API for Codex-style workflows.
- Use WebSocket mode for long interactive sessions when available.
- Keep session and conversation continuity intact.
- Put dynamic task details after stable context when you control prompt layout.
- Do not chase artificial cache metrics by rewriting request semantics.
- It does not make every token cheap.
- It does not train or fine-tune a model.
- It does not cache model outputs or replay old answers.
- It does not share cache across organizations.
- It does not mutate
~/.codex/config.tomlunless a future command explicitly implements that and you ask for it. - It does not print API keys.
- It does not claim support for every agent yet; Codex is the first supported target.
- Phase 1: Codex config audit and cache-aware router-friendly templates.
- Phase 2: prefix fingerprinting, tool-schema drift checks, breakpoint analysis, benchmark comparison, and cache-aware compact templates.
- Phase 3: package reusable agent skills for Codex-first workflows, then Claude Code and Cursor cache-friendliness checks where reliable local signals exist.
- Phase 4: router and multi-agent workflow diagnostics.
- LaTeX report:
paper/main.tex - Evaluation metric spec:
docs/evaluation-metrics.md - Full experiment protocol:
docs/evaluation-protocol.md - Paired ablation runbook:
docs/paired-ablation-runbook.md - First task-suite dataset:
docs/task-suites/claude-cache-ablation-v1.md - Real coding-task suite:
docs/task-suites/real-coding-ablation-v1.md - Phenomena analysis log:
docs/phenomena-analysis.md - MiMo token accounting note:
docs/mimo-token-accounting.md - V2 direct-json pilot snapshot:
docs/v2-direct-json-pilot.md - Long-term task plan:
taskplan/roadmap.md
The evaluation goal is not to show fewer total tokens. It is to show:
cached tokens go up
uncached paid input goes down
estimated cost goes down
latency does not regress
task success does not regress
cargo build --releaseThe binary will be available at:
target/release/make-agents-cheaperRun validation:
cargo testAsk Codex:
Install the make-agents-cheaper skill from https://github.com/3873225350/make-agents-cheaper
Or clone/copy this folder into your Codex skills directory as make-agents-cheaper.
Report mode does not write files. It prints only configuration health and hides environment variable values. It never prints API keys.
If you share reports publicly, review local paths and provider names first.