Benchmark harness for comparing code context engines head-to-head.
Tests Delphi, Context7, and Nia across 11 benchmark phases — code retrieval, multi-hop, adversarial, hallucination, validated datasets, LLM-as-judge, position-debiased enhanced judge, SWE-Agent code generation, diff-aware indexing (does the engine correctly refresh its index after a real commit?), and real-session replay of cases where one engine visibly beat another.
100 queries per engine per phase. All validated datasets scored with LLM judge (--match-mode llm). Full methodology in docs/BENCHMARK_REPORT.md.
| Benchmark | Metric | Delphi | Context7 | Nia |
|---|---|---|---|---|
| Retrieval | MRR | 0.962 | 0.790 | 0.728 |
| Code QA | Accuracy | 0.310 | 0.270 | 0.263 |
| Adversarial | Discrimination | 0.530 | 0.429 | 0.435 |
| Hallucination | Rate (lower is better) | 40.0% | 46.0% | 50.0% |
| Dataset | Metric | Delphi | Context7 | Nia |
|---|---|---|---|---|
| CodeSearchNet | MRR | 0.864 | 0.010 | 0.040 |
| CoSQA | MRR | 0.722 | 0.110 | 0.298 |
Each query scored twice with swapped chunk ordering to eliminate positional bias. Scored on Relevance, Completeness, Specificity, and Faithfulness (0-3 each).
| Dataset | Engine | Total (0-3) | Wins |
|---|---|---|---|
| CodeSearchNet | Delphi | 1.705 | 84 |
| CodeSearchNet | Context7 | 0.410 | 3 |
| CodeSearchNet | Nia | 0.345 | 3 |
| CoSQA | Delphi | 1.225 | 51 |
| CoSQA | Nia | 0.875 | 20 |
| CoSQA | Context7 | 0.598 | 12 |
Does feeding context engine results to an LLM actually produce better code? 25 hand-crafted SWE tasks across 3 knowledge tiers.
| Metric | Baseline (no context) | Delphi | Context7 | Nia |
|---|---|---|---|---|
| Judge composite | 0.665 | 0.806 (+21%) | 0.821 (+23%) | 0.802 (+21%) |
| Criteria pass | 90% | 92% | 88% | 89% |
| Context utilization | — | 12% | 16% | 21% |
| Knowledge Tier | Delphi delta | Context7 delta | Nia delta |
|---|---|---|---|
| A (well-known) | +0.108 | +0.164 | +0.083 |
| B (niche/recent) | +0.128 | +0.150 | +0.120 |
| C (version-specific) | +0.215 | +0.247 | +0.247 |
Supplementary: AdvTest (structurally disadvantages library-lookup engines)
AdvTest uses obfuscated queries without library names, which structurally disadvantages engines like Context7 that require a library name to search. Results are reported separately.
| Dataset | Metric | Delphi | Context7 | Nia |
|---|---|---|---|---|
| AdvTest | MRR | 0.970 | 0.000 | 0.030 |
| Dataset | Engine | Total (0-3) | Wins |
|---|---|---|---|
| AdvTest | Delphi | 1.740 | 93 |
| AdvTest | Context7 | 0.393 | 2 |
| AdvTest | Nia | 0.365 | 1 |
This benchmark addresses common fairness concerns in engine comparison:
- LLM-as-Judge — Claude Sonnet evaluates result quality regardless of format (code vs docs), replacing biased file-path matching
- Wall-clock latency — all engines measured identically with
time.perf_counter(). Delphi averages ~2.2-2.5s/query on production deployment. - Position debiasing — enhanced judge shuffles chunk order to eliminate ~10% positional bias
- Equal adapter treatment — no artificial handicaps, equalized timeouts (120s), no fallback libraries
- Consistent sample size — 100 queries per engine per phase
| # | Phase | Tests | Queries |
|---|---|---|---|
| 1 | Retrieval Quality | P@K, Recall@K, NDCG@K, MRR against known ground truth | 100 |
| 2 | Multi-Hop Retrieval | Queries needing context from 2+ files/repos | 100 |
| 3 | Code QA | Definitions, call sites, imports, inheritance, return types | 100 |
| 4 | Adversarial Near-Miss | Decoys: same name/wrong context, test vs prod, version confusion | 100 |
| 5 | Hallucination Rate | Does engine context prevent LLMs from making stuff up? | 100 |
| 6 | CodeSearchNet | Function-level code search (Husain et al. 2019) | 100 |
| 7 | CoSQA | Real web search queries (Huang et al. 2021) | 100 |
| 8 | Enhanced Judge | Position-debiased 4D + faithfulness + RAGAS metrics | 200 |
| 9 | SWE-Agent | Code generation with/without context, no-context baseline | 25 |
| 10 | Diff-Aware Indexing | Does the engine correctly refresh its index after a real commit? Stale / fresh / stability checks. | 15 |
| 11 | Session Replay | Real moments from production research sessions where one engine beat another | 10 |
| — | AdvTest (supplementary) | Adversarial/obfuscated code queries | 100 |
The diagnosis that drove the Phase 10/11 additions: pure code-retrieval benchmarks understate context-engine value-add. Real work needs the index to be fresh (not just accurate) and to handle the messy, mid-session moments where the engine that wins yesterday loses today. Phases 10 and 11 measure those.
15 cases, each a real (commit_A, commit_B) pair on a public Python library (fastapi, httpx, pydantic, sqlalchemy, polars, litestar, msgspec, requests). Every case ships three queries:
| Query class | What it asks | Engine wins by |
|---|---|---|
stale_query |
A symbol that existed in A but was deleted in B | Not returning the deleted symbol |
fresh_query |
A symbol added in B (didn't exist in A) | Returning the new symbol |
stable_query |
A symbol unchanged across A and B | Returning the unchanged symbol (regression guard) |
Composite correctness = mean of [1−stale_hit, fresh_hit, stable_hit]. Three diagnostic rates are also reported: staleness (lower better), freshness, stability.
Real moments from production research sessions, each labeled with the original failure cause (missing_index_coverage, bad_retrieval, bad_ranking, bad_packaging, tool_ergonomics, benchmark_blind_spot). The replay then re-classifies the failure under the live taxonomy so the report can show "10 cases labeled bad_retrieval, now 6 still failing / 2 reclassified as bad_ranking / 2 resolved." See benchmarks/scoring/failure_taxonomy.py.
uv sync
cp benchmarks/.env.local.example benchmarks/.env.local
# fill in your API keys# run everything
uv run python -m benchmarks
# run specific phases
uv run python -m benchmarks --skip-indexing --max-queries 100 --match-mode llm -v
# single engine
uv run python -m benchmarks --engines synsc --skip-indexing --max-queries 50
# validated datasets only
uv run python -m benchmarks --validated-only --match-mode llm --dataset codesearchnet cosqa advtest
# enhanced judge only
uv run python -m benchmarks --enhanced-judge-only| Variable | What it does |
|---|---|
SYNSC_API_URL |
Delphi server URL (default http://localhost:8742) |
SYNSC_API_KEY |
Delphi API key |
NIA_API_KEY |
Nia API key |
CONTEXT7_ENABLED |
Set true for Context7 |
BENCH_LLM_PROVIDER |
anthropic, gemini, or openai |
BENCH_LLM_MODEL |
Model ID for judge benchmarks |
BENCH_LLM_API_KEY |
API key for the judge LLM |
All CLI flags
| Flag | Effect |
|---|---|
--engines synsc nia context7 |
Pick engines |
--engines synsc-mcp |
Use Delphi via the MCP proxy (agent-realistic) |
--synsc-quality-mode {agent,default} |
Pass-through to the Delphi adapter |
--skip-indexing |
Skip repo indexing |
--skip-retrieval |
Skip retrieval phase |
--skip-multihop |
Skip multi-hop phase |
--skip-code-qa |
Skip code QA phase |
--skip-adversarial |
Skip adversarial phase |
--skip-hallucination |
Skip hallucination phase |
--skip-validated |
Skip validated dataset phases |
--skip-diff-aware |
Skip diff-aware indexing (Phase 10) |
--skip-session-replay |
Skip Session Replay (Phase 11) |
--diff-aware-only |
Run only diff-aware indexing |
--session-replay-only |
Run only session replay |
--validated-only |
Run only validated datasets |
--enhanced-judge-only |
Run only enhanced judge |
--match-mode llm |
Use LLM judge for validated scoring |
--judge-top-k N |
Judge top-N (was hard-coded to 3; default now 10) |
--seed N |
RNG seed for query sampling |
--num-seeds N |
Run N seeds and aggregate (3-5 for CIs) |
--no-debiasing |
Disable position debiasing (2x faster) |
--no-significance |
Skip statistical analysis |
--dataset cosqa codesearchnet advtest |
Pick specific datasets |
--swe-agent-only |
Run only SWE-Agent benchmark |
--skip-swe-agent |
Skip SWE-Agent benchmark |
--real-patch |
Enable real-patch SWE evaluation (clone + run tests) |
--with-agent-queries |
Also run AI-generated queries (default: gold only) |
--engines none |
Baseline-only mode (with --swe-agent-only) |
--max-queries N |
Limit query count per phase |
-v |
Verbose logging |
Every pairwise comparison includes paired t-tests, Wilcoxon signed-rank, bootstrap CIs (10K resamples), Cohen's d, Cliff's delta, and Bonferroni correction. The enhanced judge tracks its own reliability with Cohen's kappa and Position Consistency metrics.
| Engine | Adapter | Notes |
|---|---|---|
| Delphi | benchmarks/adapters/synsc.py |
HTTP API, needs indexed repos |
| Nia | benchmarks/adapters/nia.py |
REST API, global knowledge search |
| Context7 | benchmarks/adapters/context7.py |
HTTP API, pre-crawled docs |
Add a new engine by implementing ContextEngineAdapter from benchmarks/adapters/base.py.
synsci-context-bench/
├── README.md this file
├── ARCHITECTURE.md high-level design + diagnosis traceability
├── CHANGELOG.md release notes (1.0 → 1.1 → reorg)
├── CONTRIBUTING.md how to add phases / engines / metrics
├── pyproject.toml
│
├── benchmarks/
│ ├── README.md package overview
│ ├── __main__.py cli entry point
│ ├── runner.py phase orchestrator
│ ├── config.py env + path config (curated_dir, validated_dir, seeds)
│ │
│ ├── adapters/ one file per engine
│ │ ├── synsc.py Delphi HTTP (quality_mode aware)
│ │ ├── synsc_mcp.py Delphi MCP-proxy (build_context_pack)
│ │ ├── nia.py Nia REST (full-latency accounted)
│ │ ├── context7.py Context7 HTTP (full-latency accounted)
│ │ └── base.py ContextEngineAdapter interface
│ │
│ ├── phases/ one module per benchmark phase
│ │ ├── multihop.py Phase 2
│ │ ├── code_qa.py Phase 3
│ │ ├── adversarial.py Phase 4
│ │ ├── hallucination.py Phase 5
│ │ ├── validated_eval.py Phase 6 — CodeSearchNet / CoSQA / AdvTest
│ │ ├── swe_agent.py Phase 9 — code generation benchmark
│ │ ├── swe_real_patch.py Phase 9b — real-patch eval (opt-in)
│ │ ├── diff_aware.py Phase 10 — diff-aware indexing
│ │ └── session_replay.py Phase 11 — production session replay
│ │
│ ├── judges/ LLM-as-judge implementations
│ │ ├── llm_judge.py 3D blind scoring
│ │ └── enhanced_judge.py 4D position-debiased + RAGAS
│ │
│ ├── scoring/ deterministic scoring + analysis
│ │ ├── metrics.py MRR, NDCG, P@K, R@K (dedup-fixed), MAP
│ │ ├── semantic_metrics.py CodeBLEU + AST similarity
│ │ ├── context_grounding.py citation, utilization, hallucination-reduction
│ │ ├── leaderboards.py per-category leaderboards
│ │ ├── failure_taxonomy.py classify failures into actionable buckets
│ │ └── statistical_analysis.py paired tests, bootstrap, effect sizes
│ │
│ ├── infra/ operational glue
│ │ ├── logging_config.py structured logging + per-query traces
│ │ ├── sampling.py seeded + stratified sampling
│ │ ├── latency.py end-to-end latency meter
│ │ └── consistency.py repeat-run consistency checks
│ │
│ ├── utils/ standalone helpers
│ │ ├── dataset_loader.py downloads CodeSearchNet / CoSQA / ...
│ │ └── create_benchmark_repo.py fixture builder
│ │
│ ├── datasets/
│ │ ├── curated/ hand-built cases owned by this repo
│ │ │ ├── retrieval_ground_truth.json
│ │ │ ├── multihop_test_cases.json
│ │ │ ├── code_qa_test_cases.json
│ │ │ ├── adversarial_test_cases.json
│ │ │ ├── hallucination_test_cases.json
│ │ │ ├── swe_agent_test_cases.json
│ │ │ ├── diff_aware_test_cases.json
│ │ │ └── session_replay_cases.json
│ │ └── validated/ downloaded standard datasets
│ │ ├── codesearchnet_benchmark.json
│ │ ├── cosqa_benchmark.json
│ │ ├── advtest_benchmark.json
│ │ └── ...
│ │
│ └── results/ run_<ts>/ directories — traces, manifests, CSVs
│
├── docs/
│ ├── README.md docs index
│ ├── PHASES.md per-phase deep dive
│ ├── METRICS.md per-metric reference
│ └── BENCHMARK_REPORT.md last full-run report
│
├── scripts/
│ └── generate_charts.py regenerate assets/charts/results.png
│
└── assets/
└── charts/
Every subdirectory has its own README.md that describes what's in it
and the local conventions.
python scripts/generate_charts.pyHusain et al. (2019) CodeSearchNet Challenge. arXiv:1909.09436 · Huang et al. (2021) CoSQA. ACL 2021 · Zheng et al. (2023) Judging LLM-as-a-Judge · Shi et al. (2025) Judging the Judges · Es et al. (2024) RAGAS · Ren et al. (2020) CodeBLEU · Thakur et al. (2021) BEIR. NeurIPS
Built by the Synthetic Sciences team
Questions? hello@syntheticsciences.ai