Skip to content

Latest commit

 

History

History
74 lines (55 loc) · 2.36 KB

File metadata and controls

74 lines (55 loc) · 2.36 KB

SWE-bench

CodeWhale's SWE-bench adapter writes the prediction file that the official SWE-bench evaluation harness expects. It does not replace the harness; it generates model_patch rows from a local task workspace.

One Instance

Start from a workspace checked out at the SWE-bench instance base commit, with the issue text saved locally:

codewhale swebench run \
  --instance-id django__django-12345 \
  --issue-file issue.md \
  --predictions-path all_preds.jsonl

run invokes tool-backed non-interactive mode, equivalent to codewhale exec --auto, with stream-json output by default. When the turn finishes, CodeWhale exports git diff --binary --no-ext-diff as one JSONL prediction row:

{"instance_id":"django__django-12345","model_name_or_path":"codewhale/deepseek-v4-pro","model_patch":"diff --git ..."}

If you already ran CodeWhale, or edited the workspace manually, export the current diff without another model turn:

codewhale swebench export \
  --instance-id django__django-12345 \
  --predictions-path all_preds.jsonl

Both commands update the row for the same instance_id instead of appending a duplicate row. Untracked files are marked with git add -N before diff export so newly-created files appear in the patch.

Evaluate

Install SWE-bench and Docker using the official SWE-bench setup instructions, then pass the prediction file to the official harness:

python -m swebench.harness.run_evaluation \
  --dataset_name princeton-nlp/SWE-bench_Lite \
  --predictions_path all_preds.jsonl \
  --max_workers 1 \
  --run_id codewhale-smoke

On Apple Silicon, the official SWE-bench docs recommend adding --namespace '' so images build locally instead of pulling Linux images.

Batch Driver Shape

A simple batch runner should prepare each instance workspace, write the issue body to issue.md, run codewhale swebench run, then call the harness once on the accumulated all_preds.jsonl.

For reproducible runs, pin:

  • CodeWhale version and commit: codewhale --version
  • Model label: --model-name-or-path codewhale/deepseek-v4-pro
  • Dataset and split used by the harness
  • Docker platform and worker count
  • The all_preds.jsonl file and CodeWhale stream logs

Official references: