Skip to content

Add prompt-compression example: BANKING77 65K -> 3.2K (95.1%)#147

Open
ZhengyaoJiang wants to merge 1 commit intomainfrom
feature/prompt-compression-example
Open

Add prompt-compression example: BANKING77 65K -> 3.2K (95.1%)#147
ZhengyaoJiang wants to merge 1 commit intomainfrom
feature/prompt-compression-example

Conversation

@ZhengyaoJiang
Copy link
Copy Markdown
Contributor

Summary

New examples/prompt-compression/ showing the minimize-chars-with-accuracy-floor shape of Weco optimization, complementing the maximize-accuracy shape in examples/prompt/.

The example deliberately bloats a BANKING77 classifier system prompt to 65,887 characters (mimicking real production prompt patterns: operating principles, per-class blocks, FAQ, worked examples), then has Weco compress it while constraining accuracy to stay at or above the baseline minus 2pp.

Headline result

claude-opus-4-7 × 50 steps: 65,887 → 3,229 chars (95.1% reduction) holding accuracy at 0.7500 on a 200-sample BANKING77 test slice (baseline 0.7700, threshold 0.7500). gpt-5.5 found a different plateau at 6,828 chars (89.6%).

See the trajectory: https://weco.ai/share/XSRQdS7vfMdt9beD3KR1tlhg7By-FFIo

What's in the folder

File Purpose
optimize.py Baked baseline + classify(query, model). Weco mutates only the SYSTEM_PROMPT string in the marked WECO-MUTABLE REGION.
eval.py 200 BANKING77 samples → emits accuracy:, chars:, metric:.
labels.py 77 canonical labels + robust parse_predicted_label().
build_bloated_prompt.py Deterministic baseline generator.
bake_optimize.py One-shot: writes optimize.py from the generator.
measure_baseline.py Measures baseline accuracy → baseline_accuracy.json.
prompt_guide.md --additional-instructions content for the optimizer.
baseline_accuracy.json Pre-measured 0.7700 (gpt-5-mini, seed=0).
README.md Setup + run + headline result + structure of the compressed prompt.

Also updates examples/README.md to add this example to the table of contents, the at-a-glance table, and the quick-starts section.

Note on the baseline

The 65K-char baseline is synthetic — generated to mimic real production classifier prompts at the same scale as a representative customer prompt. BANKING77 itself is real and public (PolyAI/banking77 on HuggingFace). The compression ratios match what we've seen on real customer prompts of similar size and shape.

Test plan

  • Smoke-tested locally: optimize.py imports, all 77 labels round-trip through parser, eval.py imports and threshold loads from baseline_accuracy.json
  • Independent eval of the 3,229-char winner: accuracy=0.7500, parse_rate=1.0
  • Reviewer: spot-check weco run end-to-end on a fresh checkout

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 395bf80883

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"""Convert a snake_case label into a human-readable topic phrase."""
s = label.rstrip("?").replace("_", " ").lower()
# tidy a few label-specific oddities
s = s.replace(" or ", " or ").replace("pin", "PIN").replace("atm", "ATM")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid uppercasing pin inside unrelated words

The global replacement replace("pin", "PIN") rewrites any pin substring, which corrupts generated prompt text such as topping_up_by_card into topPINg up by card (and the same typo propagates into optimize.py via bake_optimize.py). Because this example’s optimization target is classification accuracy under a strict threshold, injecting malformed intent wording into the baseline prompt can systematically hurt intent recognition for affected classes and distort the measured baseline/threshold used by eval.py.

Useful? React with 👍 / 👎.

…K (95.1%)

Demonstrates the minimize-chars-with-accuracy-floor shape, complementing
the maximize-accuracy shape in examples/prompt/. Treats SYSTEM_PROMPT
char count as a cost to minimize and classification accuracy as a
constraint to preserve.

Headline: claude-opus-4-7 × 50 steps compressed 65,887 → 3,229 chars
(95.1% reduction) holding accuracy at 0.7500 (baseline 0.7700, threshold
0.7500). gpt-5.5 found a different plateau at 6,828 chars (89.6%).
See https://weco.ai/share/XSRQdS7vfMdt9beD3KR1tlhg7By-FFIo for one full
run trajectory.

The bloated baseline is synthetic — generated to mimic real production
classifier prompts (operating principles, per-class blocks, FAQ, worked
examples) at the same scale as a representative real customer prompt.
BANKING77 itself is real and public (PolyAI/banking77 on HuggingFace).

Also updates examples/README.md to add this example to the table of
contents, the at-a-glance table, and the quick-starts section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ZhengyaoJiang ZhengyaoJiang force-pushed the feature/prompt-compression-example branch from 395bf80 to d559682 Compare May 1, 2026 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant