Add prompt-compression example: BANKING77 65K -> 3.2K (95.1%) by ZhengyaoJiang · Pull Request #147 · WecoAI/weco-cli

ZhengyaoJiang · 2026-05-01T14:52:39Z

Summary

New examples/prompt-compression/ showing the minimize-chars-with-accuracy-floor shape of Weco optimization, complementing the maximize-accuracy shape in examples/prompt/.

The example deliberately bloats a BANKING77 classifier system prompt to 65,887 characters (mimicking real production prompt patterns: operating principles, per-class blocks, FAQ, worked examples), then has Weco compress it while constraining accuracy to stay at or above the baseline minus 2pp.

Headline result

claude-opus-4-7 × 50 steps: 65,887 → 3,229 chars (95.1% reduction) holding accuracy at 0.7500 on a 200-sample BANKING77 test slice (baseline 0.7700, threshold 0.7500). gpt-5.5 found a different plateau at 6,828 chars (89.6%).

See the trajectory: https://weco.ai/share/XSRQdS7vfMdt9beD3KR1tlhg7By-FFIo

What's in the folder

File	Purpose
`optimize.py`	Baked baseline + `classify(query, model)`. Weco mutates only the `SYSTEM_PROMPT` string in the marked WECO-MUTABLE REGION.
`eval.py`	200 BANKING77 samples → emits `accuracy:`, `chars:`, `metric:`.
`labels.py`	77 canonical labels + robust `parse_predicted_label()`.
`build_bloated_prompt.py`	Deterministic baseline generator.
`bake_optimize.py`	One-shot: writes `optimize.py` from the generator.
`measure_baseline.py`	Measures baseline accuracy → `baseline_accuracy.json`.
`prompt_guide.md`	`--additional-instructions` content for the optimizer.
`baseline_accuracy.json`	Pre-measured 0.7700 (gpt-5-mini, seed=0).
`README.md`	Setup + run + headline result + structure of the compressed prompt.

Also updates examples/README.md to add this example to the table of contents, the at-a-glance table, and the quick-starts section.

Note on the baseline

The 65K-char baseline is synthetic — generated to mimic real production classifier prompts at the same scale as a representative customer prompt. BANKING77 itself is real and public (PolyAI/banking77 on HuggingFace). The compression ratios match what we've seen on real customer prompts of similar size and shape.

Test plan

Smoke-tested locally: optimize.py imports, all 77 labels round-trip through parser, eval.py imports and threshold loads from baseline_accuracy.json
Independent eval of the 3,229-char winner: accuracy=0.7500, parse_rate=1.0
Reviewer: spot-check weco run end-to-end on a fresh checkout

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 395bf80883

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-01T14:57:12Z

+    """Convert a snake_case label into a human-readable topic phrase."""
+    s = label.rstrip("?").replace("_", " ").lower()
+    # tidy a few label-specific oddities
+    s = s.replace(" or ", " or ").replace("pin", "PIN").replace("atm", "ATM")


Avoid uppercasing pin inside unrelated words

The global replacement replace("pin", "PIN") rewrites any pin substring, which corrupts generated prompt text such as topping_up_by_card into topPINg up by card (and the same typo propagates into optimize.py via bake_optimize.py). Because this example’s optimization target is classification accuracy under a strict threshold, injecting malformed intent wording into the baseline prompt can systematically hurt intent recognition for affected classes and distort the measured baseline/threshold used by eval.py.

Useful? React with 👍 / 👎.

…K (95.1%) Demonstrates the minimize-chars-with-accuracy-floor shape, complementing the maximize-accuracy shape in examples/prompt/. Treats SYSTEM_PROMPT char count as a cost to minimize and classification accuracy as a constraint to preserve. Headline: claude-opus-4-7 × 50 steps compressed 65,887 → 3,229 chars (95.1% reduction) holding accuracy at 0.7500 (baseline 0.7700, threshold 0.7500). gpt-5.5 found a different plateau at 6,828 chars (89.6%). See https://weco.ai/share/XSRQdS7vfMdt9beD3KR1tlhg7By-FFIo for one full run trajectory. The bloated baseline is synthetic — generated to mimic real production classifier prompts (operating principles, per-class blocks, FAQ, worked examples) at the same scale as a representative real customer prompt. BANKING77 itself is real and public (PolyAI/banking77 on HuggingFace). Also updates examples/README.md to add this example to the table of contents, the at-a-glance table, and the quick-starts section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed May 1, 2026

View reviewed changes

ZhengyaoJiang force-pushed the feature/prompt-compression-example branch from 395bf80 to d559682 Compare May 1, 2026 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prompt-compression example: BANKING77 65K -> 3.2K (95.1%)#147

Add prompt-compression example: BANKING77 65K -> 3.2K (95.1%)#147
ZhengyaoJiang wants to merge 1 commit intomainfrom
feature/prompt-compression-example

ZhengyaoJiang commented May 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZhengyaoJiang commented May 1, 2026

Summary

Headline result

What's in the folder

Note on the baseline

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant