CodeWhale's SWE-bench adapter writes the prediction file that the official
SWE-bench evaluation harness expects. It does not replace the harness; it
generates model_patch rows from a local task workspace.
Start from a workspace checked out at the SWE-bench instance base commit, with the issue text saved locally:
codewhale swebench run \
--instance-id django__django-12345 \
--issue-file issue.md \
--predictions-path all_preds.jsonlrun invokes tool-backed non-interactive mode, equivalent to
codewhale exec --auto, with stream-json output by default. When the turn
finishes, CodeWhale exports git diff --binary --no-ext-diff as one JSONL
prediction row:
{"instance_id":"django__django-12345","model_name_or_path":"codewhale/deepseek-v4-pro","model_patch":"diff --git ..."}If you already ran CodeWhale, or edited the workspace manually, export the current diff without another model turn:
codewhale swebench export \
--instance-id django__django-12345 \
--predictions-path all_preds.jsonlBoth commands update the row for the same instance_id instead of appending a
duplicate row. Untracked files are marked with git add -N before diff export
so newly-created files appear in the patch.
Install SWE-bench and Docker using the official SWE-bench setup instructions, then pass the prediction file to the official harness:
python -m swebench.harness.run_evaluation \
--dataset_name princeton-nlp/SWE-bench_Lite \
--predictions_path all_preds.jsonl \
--max_workers 1 \
--run_id codewhale-smokeOn Apple Silicon, the official SWE-bench docs recommend adding
--namespace '' so images build locally instead of pulling Linux images.
A simple batch runner should prepare each instance workspace, write the issue
body to issue.md, run codewhale swebench run, then call the harness once
on the accumulated all_preds.jsonl.
For reproducible runs, pin:
- CodeWhale version and commit:
codewhale --version - Model label:
--model-name-or-path codewhale/deepseek-v4-pro - Dataset and split used by the harness
- Docker platform and worker count
- The
all_preds.jsonlfile and CodeWhale stream logs
Official references:
- SWE-bench repository: https://github.com/SWE-bench/SWE-bench
- SWE-bench harness docs: https://www.swebench.com/SWE-bench/api/harness/