Add prompt-compression example: BANKING77 65K -> 3.2K (95.1%)#147
Add prompt-compression example: BANKING77 65K -> 3.2K (95.1%)#147ZhengyaoJiang wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 395bf80883
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| """Convert a snake_case label into a human-readable topic phrase.""" | ||
| s = label.rstrip("?").replace("_", " ").lower() | ||
| # tidy a few label-specific oddities | ||
| s = s.replace(" or ", " or ").replace("pin", "PIN").replace("atm", "ATM") |
There was a problem hiding this comment.
Avoid uppercasing
pin inside unrelated words
The global replacement replace("pin", "PIN") rewrites any pin substring, which corrupts generated prompt text such as topping_up_by_card into topPINg up by card (and the same typo propagates into optimize.py via bake_optimize.py). Because this example’s optimization target is classification accuracy under a strict threshold, injecting malformed intent wording into the baseline prompt can systematically hurt intent recognition for affected classes and distort the measured baseline/threshold used by eval.py.
Useful? React with 👍 / 👎.
…K (95.1%) Demonstrates the minimize-chars-with-accuracy-floor shape, complementing the maximize-accuracy shape in examples/prompt/. Treats SYSTEM_PROMPT char count as a cost to minimize and classification accuracy as a constraint to preserve. Headline: claude-opus-4-7 × 50 steps compressed 65,887 → 3,229 chars (95.1% reduction) holding accuracy at 0.7500 (baseline 0.7700, threshold 0.7500). gpt-5.5 found a different plateau at 6,828 chars (89.6%). See https://weco.ai/share/XSRQdS7vfMdt9beD3KR1tlhg7By-FFIo for one full run trajectory. The bloated baseline is synthetic — generated to mimic real production classifier prompts (operating principles, per-class blocks, FAQ, worked examples) at the same scale as a representative real customer prompt. BANKING77 itself is real and public (PolyAI/banking77 on HuggingFace). Also updates examples/README.md to add this example to the table of contents, the at-a-glance table, and the quick-starts section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
395bf80 to
d559682
Compare
Summary
New
examples/prompt-compression/showing the minimize-chars-with-accuracy-floor shape of Weco optimization, complementing the maximize-accuracy shape inexamples/prompt/.The example deliberately bloats a BANKING77 classifier system prompt to 65,887 characters (mimicking real production prompt patterns: operating principles, per-class blocks, FAQ, worked examples), then has Weco compress it while constraining accuracy to stay at or above the baseline minus 2pp.
Headline result
claude-opus-4-7× 50 steps: 65,887 → 3,229 chars (95.1% reduction) holding accuracy at 0.7500 on a 200-sample BANKING77 test slice (baseline 0.7700, threshold 0.7500).gpt-5.5found a different plateau at 6,828 chars (89.6%).See the trajectory: https://weco.ai/share/XSRQdS7vfMdt9beD3KR1tlhg7By-FFIo
What's in the folder
optimize.pyclassify(query, model). Weco mutates only theSYSTEM_PROMPTstring in the marked WECO-MUTABLE REGION.eval.pyaccuracy:,chars:,metric:.labels.pyparse_predicted_label().build_bloated_prompt.pybake_optimize.pyoptimize.pyfrom the generator.measure_baseline.pybaseline_accuracy.json.prompt_guide.md--additional-instructionscontent for the optimizer.baseline_accuracy.jsonREADME.mdAlso updates
examples/README.mdto add this example to the table of contents, the at-a-glance table, and the quick-starts section.Note on the baseline
The 65K-char baseline is synthetic — generated to mimic real production classifier prompts at the same scale as a representative customer prompt. BANKING77 itself is real and public (PolyAI/banking77 on HuggingFace). The compression ratios match what we've seen on real customer prompts of similar size and shape.
Test plan
optimize.pyimports, all 77 labels round-trip through parser,eval.pyimports and threshold loads frombaseline_accuracy.jsonweco runend-to-end on a fresh checkout