diff --git a/docs/VENV.md b/docs/VENV.md
index 553500f..da9f53b 100644
--- a/docs/VENV.md
+++ b/docs/VENV.md
@@ -4,7 +4,29 @@
 
 Instead of using pre-built containers, you can run evaluations with your own Python virtual environment by passing `--venv-path`.
 
-## Setup
+## Choosing your venv
+
+Most evaluations share **one general venv**. A handful of framework-level
+suites have hard dependency conflicts and need their own venv:
+
+| Task group(s) | Engine | Venv | Setup |
+|---|---|---|---|
+| `open-sci-*`, `belebele_*_cf`, all text/multilingual tasks | `lm-eval-harness`, `lighteval` | **general** | [Setup](#setup-general-venv) |
+| `image-*`, `video-*`, `audio-*` (modality-prefixed) | `lmms-eval` | **general** | [Setup](#setup-general-venv) |
+| `dclm-core-22` | `lm-eval-harness` (pinned 0.4.9.2) | **dclm** | [DCLM-core-22](#dclm-core-22) |
+| `reasoning` (GPQA/MATH500/AIME/MBPP/etc.) | `evalchemy` + forked lm-eval | **evalchemy** | [Evalchemy](#evalchemy-reasoning) |
+
+Custom contrib benchmarks bring their own dependency stacks and are
+documented in `oellm/contrib/<name>/README.md`:
+
+| Task group(s) | Contrib | README |
+|---|---|---|
+| `audio-audiobench*` | `audiobench` | [`oellm/contrib/audiobench/README.md`](../oellm/contrib/audiobench/README.md) |
+| `regiondial-*` | `regiondial_bench` | [`oellm/contrib/regiondial_bench/README.md`](../oellm/contrib/regiondial_bench/README.md) |
+
+Use `oellm list-tasks` to see which suite a given task group routes to.
+
+## Setup (general venv)
 
 1. Create a venv with Python 3.12:
    ```bash
diff --git a/oellm/contrib/audiobench/README.md b/oellm/contrib/audiobench/README.md
new file mode 100644
index 0000000..60ab326
--- /dev/null
+++ b/oellm/contrib/audiobench/README.md
@@ -0,0 +1,179 @@
+# AudioBench
+
+AudioBench (AudioLLMs/AudioBench, [arXiv 2406.16020](https://arxiv.org/abs/2406.16020))
+is a broad audio-understanding benchmark covering ASR, speech translation,
+spoken reasoning, audio scene QA, and paralinguistics. This contrib plugin
+wraps AudioBench as a callable `audiobench` suite inside elliot-cli so WP4
+can produce numbers directly comparable with the AudioBench paper and
+leaderboard, without the scoring-normalisation drift that would come from
+running the same datasets through lmms-eval.
+
+## Scope
+
+**27 judge-free tasks** across ASR (WER), speech translation (BLEU), spoken
+reasoning (accuracy / string_match), and AudioCaps (METEOR). Of these:
+
+- **20 tasks are genuinely new** to the platform — not in any of our
+  existing lmms-eval `audio-*` groups. Examples: `earnings21_test`,
+  `earnings22_test`, GigaSpeech2 (Thai / Indonesian / Vietnamese),
+  SEAME code-switch, Spoken-MQA reasoning splits, MMAU mini.
+- **7 tasks are dual-registered** duplicates of benchmarks we already run
+  through lmms-eval (LibriSpeech test-clean/other, Common Voice 15 EN,
+  GigaSpeech, People's Speech, TED-LIUM 3, CoVoST2 en→zh). These use
+  AudioBench's own scorer and normaliser so WP4 can report numbers
+  aligned with the AudioBench paper.
+
+Every AudioBench task is namespaced with an `audiobench_` prefix so the CSV
+`task_path` column unambiguously identifies which scorer produced a number
+(e.g. `audiobench_librispeech_test_clean` is AudioBench-scored;
+`librispeech_test_clean` remains the lmms-eval version).
+
+Judge-dependent tasks (SLUE-SQA5, Spoken-SQuAD, AudioCaps-QA, IEMOCAP /
+MELD / VoxCeleb probes, AudioLLM-InstructionFollowing) are not included
+and depend on a vLLM judge service being provisioned on Leonardo.
+
+## Prerequisites
+
+AudioBench is not pip-installable (no upstream build backend, bare imports
+in `src/main_evaluate.py`); the plugin invokes it as a subprocess from an
+on-cluster clone. A dedicated venv is required: the `[audiobench]` extra
+pins `transformers<5` and `jiwer<3`, which conflict with the general eval
+venv (see [`docs/VENV.md`](../../../docs/VENV.md) for the framework venvs).
+
+### 1. Clone AudioBench and configure `clusters.yaml`
+
+```bash
+git clone https://github.com/AudioLLMs/AudioBench /path/to/AudioBench
+```
+
+Add `AUDIOBENCH_DIR` to your cluster block in
+`oellm/resources/clusters.yaml`:
+
+```yaml
+leonardo:
+  ...
+  AUDIOBENCH_DIR: "/path/to/AudioBench"
+```
+
+### 2. Create the venv
+
+```bash
+uv venv --python 3.12 audiobench-venv
+source audiobench-venv/bin/activate
+uv pip install -e ".[audiobench]"
+```
+
+The `[audiobench]` extra pins `transformers>=4.45,<5`, `jiwer<3`,
+`sacrebleu`, `pythainlp`, `evaluate`, `soundfile`, `librosa`.
+
+### 3. Install AudioBench's runtime dependencies
+
+```bash
+# AudioBench's own requirements (filter vllm; only used by deferred judge tasks)
+grep -v -i '^vllm' /path/to/AudioBench/requirements.txt > /tmp/ab-reqs.txt
+uv pip install -r /tmp/ab-reqs.txt
+
+# PyTorch for cluster's CUDA driver — PyPI defaults target a newer runtime
+# than most HPC drivers (Leonardo / JURECA report CUDA 12.2) and crash with
+# `NVIDIA driver too old`.  Use the cu121 index.
+uv pip install torch torchvision torchaudio \
+    --index-url https://download.pytorch.org/whl/cu121
+
+# rapidfuzz C extension — without this, jiwer's WER scoring hits the
+# pure-Python fallback and raises NotImplementedError on Levenshtein.editops.
+uv pip install --reinstall rapidfuzz
+```
+
+> Verify the venv works:
+> ```bash
+> python -c "
+> from transformers import Qwen2AudioForConditionalGeneration
+> from rapidfuzz.distance import Levenshtein
+> Levenshtein.editops('a', 'b')   # must not raise
+> print('audiobench venv OK')
+> "
+> ```
+
+### Dataset pre-download
+
+No manual steps required. `schedule-eval` pre-downloads every
+`AudioLLMs/*` HF repo referenced by the requested task group on the login
+node via `huggingface_hub.snapshot_download(max_workers=2)`, so compute
+nodes do not need internet access.
+
+## Running
+
+### Available task groups
+
+| Task group                       | Leaves | What it covers                                                  |
+|----------------------------------|--------|-----------------------------------------------------------------|
+| `audio-audiobench`               | 27     | Full suite (everything below).                                  |
+| `audio-audiobench-asr`           | 15     | WER tasks — 9 new + 6 dual-registered with lmms-eval.           |
+| `audio-audiobench-st`            | 6      | BLEU speech-translation — 5 new + 1 dual (en→zh).               |
+| `audio-audiobench-reasoning`     | 6      | Spoken-MQA × 4, MMAU mini, AudioCaps METEOR.                    |
+
+### Example
+
+```bash
+# Full AudioBench suite on a Qwen2-Audio model:
+oellm schedule-eval \
+    --models Qwen/Qwen2-Audio-7B-Instruct \
+    --task-groups audio-audiobench \
+    --venv-path audiobench-venv
+
+# ASR only:
+oellm schedule-eval \
+    --models Qwen/Qwen2-Audio-7B-Instruct \
+    --task-groups audio-audiobench-asr \
+    --venv-path audiobench-venv
+
+# Smoke test with --limit:
+oellm schedule-eval \
+    --models Qwen/Qwen2-Audio-7B-Instruct \
+    --task-groups audio-audiobench-asr \
+    --limit 100 \
+    --venv-path audiobench-venv
+```
+
+`--limit N` is forwarded to AudioBench's `--number_of_samples N`. When
+unset, the full test split is evaluated.
+
+### Collecting results
+
+```bash
+oellm collect-results \
+    --eval-output-dir /path/to/evals \
+    --output-csv audiobench_results.csv
+```
+
+The primary metric per task is what's registered in `task_metrics`
+(`wer` / `bleu` / `accuracy` / `string_match` / `meteor`). Dual-registered
+tasks land in the CSV **alongside** their lmms-eval counterparts, with
+different `task_path` values (`audiobench_librispeech_test_clean` vs
+`librispeech_test_clean`) and different `eval_suite` values (`audiobench`
+vs `lmms_eval`) — no silent averaging.
+
+## Supported model adapters
+
+AudioBench dispatches on a fixed list of literal `model_name` strings
+(see `$AUDIOBENCH_DIR/src/model.py`); each loader under `model_src/`
+fetches its own HF repo. Arbitrary HF checkpoints are not supported —
+only the variants below:
+
+| Model path substring (lowered)                 | AudioBench `model_name` (literal)         |
+|------------------------------------------------|-------------------------------------------|
+| `qwen2-audio-7b-instruct` / `qwen2_audio_7b_instruct` | `Qwen2-Audio-7B-Instruct`          |
+| `qwen-audio-chat` / `qwen_audio_chat`          | `Qwen-Audio-Chat`                         |
+| `salmonn`                                      | `SALMONN_7B`                              |
+| `meralion-audiollm` / `meralion_audiollm`      | `MERaLiON-AudioLLM-Whisper-SEA-LION`      |
+| `whisper-large-v3` / `whisper_large_v3`        | `whisper_large_v3`                        |
+| `whisper-large-v2` / `whisper_large_v2`        | `whisper_large_v2`                        |
+| `phi-4-multimodal` / `phi_4_multimodal`        | `phi_4_multimodal_instruct`               |
+| `seallms-audio-7b` / `seallms_audio_7b`        | `seallms_audio_7b`                        |
+| `wavllm`                                       | `WavLLM_fairseq`                          |
+| (anything else)                                | error — no generic loader upstream        |
+
+To override detection, pass the literal AudioBench key as a suffix:
+`audiobench:Qwen2-Audio-7B-Instruct`. Case is preserved end-to-end
+(AudioBench's match is case-sensitive).
+
diff --git a/oellm/contrib/audiobench/__init__.py b/oellm/contrib/audiobench/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/oellm/contrib/audiobench/adapter.py b/oellm/contrib/audiobench/adapter.py
new file mode 100644
index 0000000..66a9655
--- /dev/null
+++ b/oellm/contrib/audiobench/adapter.py
@@ -0,0 +1,69 @@
+"""AudioBench model adapter.
+
+Maps a HuggingFace model path to AudioBench's literal ``--model_name`` value.
+
+AudioBench's ``Model`` class (in ``$AUDIOBENCH_DIR/src/model.py``) dispatches
+on **exact-string** match against a fixed list — there is no family-level
+indirection and no fallback.  Each supported model has a hardcoded loader
+under ``model_src/`` that loads its own HF repo internally; AudioBench
+**cannot evaluate arbitrary HF checkpoints**, only the variants it knows
+about.  If we can't map the user's ``model_path`` to one of those literals,
+we return ``None`` and ``suite.run`` raises a clear error.
+"""
+
+from __future__ import annotations
+
+from oellm.core.base_model_adapter import BaseModelAdapter
+
+# (audiobench_model_name, substrings_to_match_in_lower(model_path)).
+# Order matters — first match wins; put more-specific patterns first.
+# Keys MUST be the exact literals AudioBench's model.py dispatch expects.
+_PATTERNS: list[tuple[str, tuple[str, ...]]] = [
+    ("Qwen2-Audio-7B-Instruct", ("qwen2-audio-7b-instruct", "qwen2_audio_7b_instruct")),
+    ("Qwen-Audio-Chat", ("qwen-audio-chat", "qwen_audio_chat")),
+    ("SALMONN_7B", ("salmonn",)),
+    ("MERaLiON-AudioLLM-Whisper-SEA-LION", ("meralion-audiollm", "meralion_audiollm")),
+    ("whisper_large_v3", ("whisper-large-v3", "whisper_large_v3")),
+    ("whisper_large_v2", ("whisper-large-v2", "whisper_large_v2")),
+    ("phi_4_multimodal_instruct", ("phi-4-multimodal", "phi_4_multimodal")),
+    ("seallms_audio_7b", ("seallms-audio-7b", "seallms_audio_7b")),
+    ("WavLLM_fairseq", ("wavllm",)),
+]
+
+
+class AudioBenchModelAdapter(BaseModelAdapter):
+    """Adapter resolving the ``--model_name`` value for the AudioBench subprocess."""
+
+    def __init__(self, model_path: str) -> None:
+        self._path = model_path
+
+    @property
+    def model_path(self) -> str:
+        return self._path
+
+    def to_lm_eval_args(self) -> str:
+        # Unused — AudioBench doesn't route through lm-eval.  Required by
+        # BaseModelAdapter.
+        return f"pretrained={self._path},trust_remote_code=True"
+
+    def to_lmms_eval_args(self) -> str:
+        # Unused — see to_lm_eval_args().
+        return f"pretrained={self._path}"
+
+    def to_contrib_flags(self) -> str | None:
+        """Return AudioBench's ``model_name`` dispatch key, or ``None`` if no match.
+
+        Returning ``None`` is intentional: AudioBench has no generic loader,
+        so an unmatched model path must fail loudly rather than fall through
+        to a fictitious ``generic`` key that AudioBench doesn't recognize.
+        """
+        lowered = self._path.lower()
+        for key, needles in _PATTERNS:
+            if any(n in lowered for n in needles):
+                return key
+        return None
+
+
+def detect_audiobench_model_type(model_path: str) -> str | None:
+    """Convenience wrapper around :meth:`AudioBenchModelAdapter.to_contrib_flags`."""
+    return AudioBenchModelAdapter(model_path).to_contrib_flags()
diff --git a/oellm/contrib/audiobench/suite.py b/oellm/contrib/audiobench/suite.py
new file mode 100644
index 0000000..7ec28e4
--- /dev/null
+++ b/oellm/contrib/audiobench/suite.py
@@ -0,0 +1,320 @@
+"""AudioBench contrib suite — plugin protocol implementation.
+
+AudioBench is not pip-installable (upstream has no build backend and uses
+bare imports like ``from dataset import ...``), so :func:`run` invokes its
+``src/main_evaluate.py`` entry point as a subprocess with ``cwd`` set to
+``$AUDIOBENCH_DIR``.  :func:`run` then re-shapes AudioBench's result JSON
+into a lmms-eval-compatible payload that :func:`oellm.main.collect_results`
+can parse unchanged.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import os
+import subprocess
+from pathlib import Path
+
+from oellm.contrib.audiobench.task import (
+    AUDIOBENCH_TASKS,
+    SUITE_NAME,
+    AudioBenchTaskSpec,
+    get_task_spec,
+)
+
+logger = logging.getLogger(__name__)
+
+CLUSTER_ENV_VARS = ["AUDIOBENCH_DIR"]
+
+_FAMILY_GROUPS = {
+    "asr": (
+        "audio-audiobench-asr",
+        "AudioBench ASR tasks (WER).",
+    ),
+    "st": (
+        "audio-audiobench-st",
+        "AudioBench speech-translation tasks (BLEU).",
+    ),
+    "reasoning": (
+        "audio-audiobench-reasoning",
+        "AudioBench spoken reasoning / captioning (accuracy / string_match / METEOR).",
+    ),
+}
+
+_TOP_LEVEL_GROUP = "audio-audiobench"
+_TOP_LEVEL_DESC = (
+    "AudioBench suite — ASR (WER), speech translation (BLEU), spoken "
+    "reasoning (accuracy/string_match), and AudioCaps captioning (METEOR)."
+)
+
+
+def _build_task_groups() -> dict:
+    """Build ``TASK_GROUPS`` from :data:`AUDIOBENCH_TASKS`.
+
+    Always zero-shot — AudioBench does not support in-context examples.
+    """
+    task_metrics: dict[str, str] = {t.name: t.metric for t in AUDIOBENCH_TASKS}
+
+    def _task_entry(t: AudioBenchTaskSpec) -> dict:
+        # No ``subset`` — for gigaspeech2 / spoken-mqa the upstream split
+        # selection is encoded in ``upstream_name`` itself (e.g.
+        # ``gigaspeech2_thai``).  The ``audio-*`` group prefix triggers
+        # full-repo snapshot_download in :func:`_collect_dataset_specs`.
+        return {"task": t.name, "dataset": t.hf_repo}
+
+    groups: dict[str, dict] = {}
+
+    tasks_by_family: dict[str, list[AudioBenchTaskSpec]] = {
+        "asr": [],
+        "st": [],
+        "reasoning": [],
+    }
+    for t in AUDIOBENCH_TASKS:
+        tasks_by_family[t.family].append(t)
+
+    for family, (group_name, desc) in _FAMILY_GROUPS.items():
+        entries = tasks_by_family[family]
+        if not entries:
+            continue
+        groups[group_name] = {
+            "suite": SUITE_NAME,
+            "n_shots": [0],
+            "description": desc,
+            "tasks": [_task_entry(t) for t in entries],
+        }
+
+    groups[_TOP_LEVEL_GROUP] = {
+        "suite": SUITE_NAME,
+        "n_shots": [0],
+        "description": _TOP_LEVEL_DESC,
+        "tasks": [_task_entry(t) for t in AUDIOBENCH_TASKS],
+    }
+
+    return {"task_metrics": task_metrics, "task_groups": groups}
+
+
+TASK_GROUPS: dict = _build_task_groups()
+
+
+def detect_model_flags(model_path: str) -> str | None:
+    """Return AudioBench's literal ``--model_name`` dispatch key for *model_path*.
+
+    Returns ``None`` when *model_path* does not match any AudioBench-supported
+    model family — :func:`run` then raises a clear error.  AudioBench has no
+    generic loader, so silently falling back to a fictitious key would just
+    move the error deeper inside the subprocess.
+    """
+    from oellm.contrib.audiobench.adapter import AudioBenchModelAdapter
+
+    return AudioBenchModelAdapter(model_path).to_contrib_flags()
+
+
+def run(
+    *,
+    model_path: str,
+    task: str,
+    n_shot: int,
+    output_path: Path,
+    model_flags: str | None,
+    env: dict[str, str],
+) -> None:
+    """Execute one AudioBench task and write a lmms-eval-shaped result JSON.
+
+    Raises ``RuntimeError`` if AudioBench exits non-zero or produces no
+    parseable output, and ``KeyError`` if *task* is not registered.
+    """
+    ab_dir = env.get("AUDIOBENCH_DIR")
+    if not ab_dir:
+        raise RuntimeError(
+            "AUDIOBENCH_DIR must be set.  Add it to clusters.yaml — "
+            "it should point at a local clone of "
+            "https://github.com/AudioLLMs/AudioBench."
+        )
+
+    entrypoint = Path(ab_dir) / "src" / "main_evaluate.py"
+    if not entrypoint.exists():
+        raise FileNotFoundError(
+            f"AudioBench entry point not found: {entrypoint}\n"
+            f"Check that AUDIOBENCH_DIR={ab_dir!r} points at a valid "
+            "AudioBench clone."
+        )
+
+    spec = get_task_spec(task)
+    if not model_flags:
+        raise RuntimeError(
+            f"Could not map model_path={model_path!r} to an AudioBench-supported "
+            f"model.  AudioBench dispatches on a fixed list of literal "
+            f"model_name strings (Qwen2-Audio-7B-Instruct, SALMONN_7B, "
+            f"whisper_large_v3, …) — see oellm/contrib/audiobench/adapter.py.  "
+            f"AudioBench cannot evaluate arbitrary HF checkpoints; it loads "
+            f"its own hardcoded HF repos per model family."
+        )
+    model_key = model_flags  # AudioBench's dispatch key, e.g. "Qwen2-Audio-7B-Instruct"
+
+    cmd = [
+        "python",
+        "src/main_evaluate.py",
+        "--dataset_name",
+        spec.upstream_name,
+        "--model_name",
+        model_key,
+        "--metrics",
+        spec.upstream_metric,
+        # Force re-eval — AudioBench skips by default if a stale score file
+        # already exists under log_for_all_models/.
+        "--overwrite",
+        "True",
+    ]
+
+    limit = env.get("LIMIT", "").strip()
+    if limit:
+        cmd.extend(["--number_of_samples", str(limit)])
+
+    logger.info("AudioBench cmd: %s (cwd=%s)", " ".join(cmd), ab_dir)
+    completed = subprocess.run(
+        cmd,
+        cwd=ab_dir,
+        env=env,
+        check=False,
+    )
+    if completed.returncode != 0:
+        raise RuntimeError(
+            f"AudioBench exited with code {completed.returncode} for "
+            f"task={task!r} model={model_path!r} (dispatch key={model_key!r})"
+        )
+
+    metrics = _extract_metrics(
+        audiobench_dir=Path(ab_dir), model_key=model_key, spec=spec
+    )
+    _write_lmms_shaped_json(
+        output_path=output_path,
+        model_path=model_path,
+        task_name=task,
+        n_shot=n_shot,
+        metrics=metrics,
+    )
+    logger.info("Results written to %s", output_path)
+
+
+def _extract_metrics(
+    *,
+    audiobench_dir: Path,
+    model_key: str,
+    spec: AudioBenchTaskSpec,
+) -> dict[str, float]:
+    """Read AudioBench's score file from its hardcoded output path.
+
+    AudioBench writes to ``$cwd/log_for_all_models/<model_name>/<dataset_name>_<metric>_score.json``
+    (see ``main_evaluate.py:118``).  Path is fixed — there is no ``--log_dir``.
+    """
+    score_file = (
+        audiobench_dir
+        / "log_for_all_models"
+        / model_key
+        / f"{spec.upstream_name}_{spec.upstream_metric}_score.json"
+    )
+    if not score_file.exists():
+        raise RuntimeError(
+            f"AudioBench did not write expected score file at {score_file}.  "
+            f"Either AudioBench crashed silently, or the dispatch key "
+            f"{model_key!r} / dataset_name {spec.upstream_name!r} / metric "
+            f"{spec.upstream_metric!r} is wrong.  Check stdout/stderr."
+        )
+
+    try:
+        with open(score_file) as f:
+            body = json.load(f)
+    except (json.JSONDecodeError, OSError) as e:
+        raise RuntimeError(
+            f"Could not read AudioBench score file {score_file}: {e}"
+        ) from e
+
+    value = _find_metric(body, spec.upstream_metric)
+    if value is None:
+        raise RuntimeError(
+            f"Could not locate metric {spec.upstream_metric!r} in AudioBench "
+            f"score file {score_file}.  Body: {body!r}"
+        )
+    # Emit under our canonical key so collect_results' metric resolution
+    # picks up task_metrics.yaml.
+    return {spec.metric: float(value)}
+
+
+def _find_metric(body: object, key: str) -> float | None:
+    """Recursive search for a numeric value keyed by *key*.
+
+    Tolerates both ``{"wer": 0.04}`` and ``{"metrics": {"wer": {"score":
+    0.04}}}`` layouts — upstream log shape has drifted across releases.
+    """
+    if isinstance(body, dict):
+        if key in body:
+            candidate = body[key]
+            if isinstance(candidate, int | float):
+                return float(candidate)
+            if isinstance(candidate, dict) and "score" in candidate:
+                score = candidate["score"]
+                if isinstance(score, int | float):
+                    return float(score)
+        for v in body.values():
+            found = _find_metric(v, key)
+            if found is not None:
+                return found
+    elif isinstance(body, list):
+        for item in body:
+            found = _find_metric(item, key)
+            if found is not None:
+                return found
+    return None
+
+
+def _write_lmms_shaped_json(
+    *,
+    output_path: Path,
+    model_path: str,
+    task_name: str,
+    n_shot: int,
+    metrics: dict[str, float],
+) -> None:
+    payload = {
+        "model_name_or_path": model_path,
+        "results": {task_name: metrics},
+        "configs": {task_name: {"num_fewshot": n_shot}},
+    }
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_path, "w") as f:
+        json.dump(payload, f, indent=2)
+
+
+def parse_results(data: dict) -> tuple[str, str, int, dict[str, float]] | None:
+    """Recognise a JSON dict produced by :func:`run` and return
+    ``(model_id, task_name, n_shot, metrics)``; ``None`` if it's not ours.
+    """
+    results = data.get("results", {})
+    if not isinstance(results, dict):
+        return None
+    for task_name, task_results in results.items():
+        if not isinstance(task_name, str) or not task_name.startswith("audiobench_"):
+            continue
+        if not isinstance(task_results, dict):
+            continue
+        model_id = data.get("model_name_or_path") or data.get("model_name") or "unknown"
+        n_shot = data.get("configs", {}).get(task_name, {}).get("num_fewshot", 0)
+        coerced: dict[str, float] = {}
+        for k, v in task_results.items():
+            if isinstance(v, int | float):
+                coerced[k] = float(v)
+        return model_id, task_name, int(n_shot), coerced
+    return None
+
+
+__all__ = [
+    "CLUSTER_ENV_VARS",
+    "SUITE_NAME",
+    "TASK_GROUPS",
+    "detect_model_flags",
+    "parse_results",
+    "run",
+]
+
+_ = os  # exported via env dict passed to subprocess.run
diff --git a/oellm/contrib/audiobench/task.py b/oellm/contrib/audiobench/task.py
new file mode 100644
index 0000000..3209888
--- /dev/null
+++ b/oellm/contrib/audiobench/task.py
@@ -0,0 +1,192 @@
+"""AudioBench task registry.
+
+Single source of truth for the task set.  Consumed by
+:mod:`oellm.contrib.audiobench.suite` to build ``TASK_GROUPS`` and to look up
+per-task metadata (HF repo, upstream task name, metric) at dispatch time.
+
+Every canonical task name is prefixed ``audiobench_`` so the CSV ``task_path``
+column uniquely identifies the scorer and doesn't collide with lmms-eval's
+names for the same benchmark.
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+SUITE_NAME = "audiobench"
+_TASK_NAME_PREFIX = "audiobench_"
+
+
+@dataclass(frozen=True)
+class AudioBenchTaskSpec:
+    """Metadata for a single AudioBench task.
+
+    ``upstream_name`` is the literal string AudioBench's ``--dataset_name``
+    expects (matched exactly against ``$AUDIOBENCH_DIR/src/dataset.py``'s
+    dispatch table).  ``upstream_metric`` is what ``--metrics`` expects
+    (usually identical to our canonical ``metric``).
+    """
+
+    name: str
+    upstream_name: str
+    hf_repo: str
+    metric: str
+    upstream_metric: str
+    family: str
+
+    @property
+    def task_group(self) -> str:
+        return f"audio-audiobench-{self.family}"
+
+
+def _t(
+    upstream_name: str,
+    hf_repo: str,
+    metric: str,
+    family: str,
+    *,
+    upstream_metric: str | None = None,
+    name: str | None = None,
+) -> AudioBenchTaskSpec:
+    """Build a spec with ``name = audiobench_<upstream_name>`` by default."""
+    return AudioBenchTaskSpec(
+        name=name if name is not None else _TASK_NAME_PREFIX + upstream_name,
+        upstream_name=upstream_name,
+        hf_repo=hf_repo,
+        metric=metric,
+        upstream_metric=upstream_metric or metric,
+        family=family,
+    )
+
+
+# Tasks not covered by our lmms-eval task groups.
+_NEW_ASR = [
+    _t("aishell_asr_zh_test", "AudioLLMs/aishell_1_zh_test", "wer", "asr"),
+    _t("earnings21_test", "AudioLLMs/earnings21_test", "wer", "asr"),
+    _t("earnings22_test", "AudioLLMs/earnings22_test", "wer", "asr"),
+    _t("tedlium3_long_form_test", "AudioLLMs/tedlium3_long_form_test", "wer", "asr"),
+    # GigaSpeech2 — 3 languages share one HF repo.  AudioBench dispatches via
+    # the dataset_name string itself (gigaspeech2_thai/indo/viet), not via a
+    # --data_dir flag (which doesn't exist upstream).
+    _t(
+        "gigaspeech2_thai",
+        "AudioLLMs/gigaspeech2-test",
+        "wer",
+        "asr",
+        name="audiobench_gigaspeech2_thai",
+    ),
+    _t(
+        "gigaspeech2_indo",
+        "AudioLLMs/gigaspeech2-test",
+        "wer",
+        "asr",
+        name="audiobench_gigaspeech2_indo",
+    ),
+    _t(
+        "gigaspeech2_viet",
+        "AudioLLMs/gigaspeech2-test",
+        "wer",
+        "asr",
+        name="audiobench_gigaspeech2_viet",
+    ),
+    _t("seame_dev_man", "AudioLLMs/seame_dev_man", "wer", "asr"),
+    _t("seame_dev_sge", "AudioLLMs/seame_dev_sge", "wer", "asr"),
+]
+
+_NEW_ST = [
+    _t("covost2_en_id_test", "AudioLLMs/covost2_en_id_test", "bleu", "st"),
+    _t("covost2_en_ta_test", "AudioLLMs/covost2_en_ta_test", "bleu", "st"),
+    _t("covost2_id_en_test", "AudioLLMs/covost2_id_en_test", "bleu", "st"),
+    _t("covost2_zh_en_test", "AudioLLMs/covost2_zh_en_test", "bleu", "st"),
+    _t("covost2_ta_en_test", "AudioLLMs/covost2_ta_en_test", "bleu", "st"),
+]
+
+_NEW_REASONING = [
+    # Spoken-MQA — 4 splits share one HF repo.  AudioBench dispatches via
+    # the hyphen-prefixed dataset_name (spoken-mqa_<split>), not --data_dir.
+    _t(
+        "spoken-mqa_short_digit",
+        "amao0o0/spoken-mqa",
+        "accuracy",
+        "reasoning",
+        upstream_metric="acc",
+        name="audiobench_spoken_mqa_short_digit",
+    ),
+    _t(
+        "spoken-mqa_long_digit",
+        "amao0o0/spoken-mqa",
+        "accuracy",
+        "reasoning",
+        upstream_metric="acc",
+        name="audiobench_spoken_mqa_long_digit",
+    ),
+    _t(
+        "spoken-mqa_single_step_reasoning",
+        "amao0o0/spoken-mqa",
+        "accuracy",
+        "reasoning",
+        upstream_metric="acc",
+        name="audiobench_spoken_mqa_single_step_reasoning",
+    ),
+    _t(
+        "spoken-mqa_multi_step_reasoning",
+        "amao0o0/spoken-mqa",
+        "accuracy",
+        "reasoning",
+        upstream_metric="acc",
+        name="audiobench_spoken_mqa_multi_step_reasoning",
+    ),
+    _t("mmau_mini", "AudioLLMs/MMAU-mini", "string_match", "reasoning"),
+    _t("audiocaps_test", "AudioLLMs/audiocaps_test", "meteor", "reasoning"),
+]
+
+# Dual-registered duplicates of benchmarks also in lmms-eval.  These use
+# AudioBench's scorer/normaliser for paper-comparable numbers; the lmms-eval
+# versions stay in place.  HF repos differ (AudioLLMs/* vs lmms-lab/*) so
+# snapshot_download does not collide.
+_DUAL = [
+    _t("librispeech_test_clean", "AudioLLMs/librispeech_test_clean", "wer", "asr"),
+    _t("librispeech_test_other", "AudioLLMs/librispeech_test_other", "wer", "asr"),
+    _t("common_voice_15_en_test", "AudioLLMs/common_voice_15_en_test", "wer", "asr"),
+    _t("gigaspeech_test", "AudioLLMs/gigaspeech_test", "wer", "asr"),
+    _t("peoples_speech_test", "AudioLLMs/peoples_speech_test", "wer", "asr"),
+    _t("tedlium3_test", "AudioLLMs/tedlium3_test", "wer", "asr"),
+    _t("covost2_en_zh_test", "AudioLLMs/covost2_en_zh_test", "bleu", "st"),
+]
+
+
+AUDIOBENCH_TASKS: list[AudioBenchTaskSpec] = [
+    *_NEW_ASR,
+    *_NEW_ST,
+    *_NEW_REASONING,
+    *_DUAL,
+]
+
+
+def _validate() -> None:
+    seen_names: set[str] = set()
+    for t in AUDIOBENCH_TASKS:
+        if t.name in seen_names:
+            raise RuntimeError(f"Duplicate AudioBench task name {t.name!r} in registry")
+        seen_names.add(t.name)
+        if not t.name.startswith(_TASK_NAME_PREFIX):
+            raise RuntimeError(
+                f"AudioBench task {t.name!r} missing required prefix "
+                f"{_TASK_NAME_PREFIX!r}"
+            )
+        if t.family not in {"asr", "st", "reasoning"}:
+            raise RuntimeError(
+                f"AudioBench task {t.name!r} has unknown family {t.family!r}"
+            )
+
+
+_validate()
+
+
+def get_task_spec(name: str) -> AudioBenchTaskSpec:
+    """Look up a spec by canonical task name; raises ``KeyError`` if missing."""
+    for t in AUDIOBENCH_TASKS:
+        if t.name == name:
+            return t
+    known = sorted(t.name for t in AUDIOBENCH_TASKS)
+    raise KeyError(f"Unknown AudioBench task {name!r}.  Known tasks: {', '.join(known)}")
diff --git a/oellm/contrib/regiondial_bench/README.md b/oellm/contrib/regiondial_bench/README.md
index b540d6e..57f646e 100644
--- a/oellm/contrib/regiondial_bench/README.md
+++ b/oellm/contrib/regiondial_bench/README.md
@@ -16,17 +16,17 @@ plus per-round breakdown (R1–R7) for gIoU and bbox_AP.
 
 ## Prerequisites
 
-### 1. Clone RegionReasoner
+The benchmark calls `test/evaluation/evaluation_multi_segmentation.py` and
+the `test/vision_reasoner/` model wrapper from the RegionReasoner
+repository as a subprocess, so the repo must be present on the cluster
+filesystem. A dedicated venv is required for `flash-attn` (specific
+pre-built wheel) and HEIF image support (`pi-heif`); see
+[`docs/VENV.md`](../../../docs/VENV.md) for the framework venvs.
 
-The benchmark relies on the inference script
-`test/evaluation/evaluation_multi_segmentation.py` and the model wrapper
-`test/vision_reasoner/` from the RegionReasoner repository. These are **not
-packaged** — the platform calls them directly as a subprocess, so the repo
-must be present on the cluster filesystem.
+### 1. Clone RegionReasoner
 
 ```bash
-git clone https://github.com/lmsdss/RegionReasoner \
-    /path/to/RegionReasoner
+git clone https://github.com/lmsdss/RegionReasoner /path/to/RegionReasoner
 ```
 
 ### 2. Configure clusters.yaml
@@ -38,40 +38,41 @@ my-cluster:
   ...
   HF_HOME: "/path/to/large/filesystem/huggingface"   # must have ~30 GB free
   REGION_REASONER_DIR: "/path/to/RegionReasoner"
-  GPUS_PER_NODE: 4                                   # controls both SLURM --gres and shard count
+  GPUS_PER_NODE: 4                                   # controls SLURM --gres and shard count
 ```
 
-> **`HF_HOME`** must point to a filesystem with at least **30 GB** of free
-> space. On CINECA Leonardo, use the work filesystem
-> (`/leonardo_work/<project>/huggingface`), not the home filesystem (50 GB
-> quota, fills up quickly).
+> `HF_HOME` must point to a filesystem with at least 30 GB free. On
+> CINECA Leonardo, use the work filesystem
+> (`/leonardo_work/<project>/huggingface`), not the home filesystem
+> (50 GB quota).
 
-### 3. Install dependencies in your venv
+### 3. Create a venv and install dependencies
 
 ```bash
-# PyTorch — match the CUDA version available on your cluster
-pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu121
+uv venv --python 3.12 regiondial-venv
+source regiondial-venv/bin/activate
+uv pip install -e .
 
-# Matching torchvision
-pip install torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121
+# PyTorch — match the cluster's CUDA driver (cu121 for driver supporting CUDA 12.2)
+uv pip install torch==2.5.1 torchvision==0.20.1 \
+    --index-url https://download.pytorch.org/whl/cu121
 
-# flash-attn pre-built wheel (no compilation needed)
+# flash-attn pre-built wheel (Python 3.12 / CUDA 12.x / torch 2.5.1)
 wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
-pip install flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
+uv pip install flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
 
 # HEIF image support
-pip install pi-heif
+uv pip install pi-heif
 ```
 
-> **flash-attn note:** The pre-built wheel above is for Python 3.12, CUDA 12.x,
-> torch 2.5.1. If your configuration differs, find the matching wheel at
-> https://github.com/Dao-AILab/flash-attention/releases
+> If your Python / CUDA / torch combination differs, find the matching
+> flash-attn wheel at
+> <https://github.com/Dao-AILab/flash-attention/releases>.
 
-### 4. What gets auto-downloaded
+### What gets auto-downloaded
 
-When you run `oellm schedule-eval`, the platform automatically pre-downloads
-the following on the login node (before SLURM submission, so compute nodes do
-not need internet access):
+`oellm schedule-eval` pre-downloads the following on the login node so
+compute nodes do not need internet access:
 
 | Asset | HF repo | Size |
 |---|---|---|
diff --git a/oellm/resources/template.sbatch b/oellm/resources/template.sbatch
index 42900f9..03a20ee 100644
--- a/oellm/resources/template.sbatch
+++ b/oellm/resources/template.sbatch
@@ -13,7 +13,10 @@
 CSV_PATH="{csv_path}"
 NUM_JOBS={num_jobs}
 TOTAL_EVALS={total_evals}
-LIMIT="{limit}"
+# Exported so contrib suite plugins (which spawn their own Python subprocesses
+# via oellm.contrib.dispatch) can read it from os.environ.  Built-in suites
+# below still interpolate $LIMIT directly into their CLI flags.
+export LIMIT="{limit}"
 VENV_PATH="{venv_path}"
 LM_EVAL_INCLUDE_PATH="{lm_eval_include_path}"
 
diff --git a/oellm/scheduler.py b/oellm/scheduler.py
index af203c9..61a80dd 100644
--- a/oellm/scheduler.py
+++ b/oellm/scheduler.py
@@ -306,7 +306,16 @@ def schedule_evals(
         logging.warning("No evaluation jobs to schedule.")
         return None
 
-    df["eval_suite"] = df["eval_suite"].str.lower()
+    # Lowercase the suite name only, preserve any ``:model_flags`` suffix
+    # verbatim — contrib dispatch keys can be case-sensitive (e.g.
+    # AudioBench's ``Qwen2-Audio-7B-Instruct`` is matched literally).
+    def _lower_suite_only(s: str) -> str:
+        if ":" in s:
+            head, tail = s.split(":", 1)
+            return f"{head.lower()}:{tail}"
+        return s.lower()
+
+    df["eval_suite"] = df["eval_suite"].map(_lower_suite_only)
 
     # Ensure that all datasets required by the tasks are cached locally to avoid
     # network access on compute nodes.
diff --git a/pyproject.toml b/pyproject.toml
index e866264..8dcdb8f 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -33,6 +33,19 @@ audio = [
     "librosa",
     "jiwer",
 ]
+# AudioBench contrib plugin.  AudioBench itself is not pip-installable
+# (no build backend upstream, bare imports), so AUDIOBENCH_DIR in
+# clusters.yaml points at a local git clone and suite.py subprocesses into
+# ``python src/main_evaluate.py``.  These are our post-processing deps.
+audiobench = [
+    "jiwer<3",                     # AudioBench uses jiwer.compute_measures, removed in 3.0
+    "transformers>=4.45,<5",       # AudioBench's Qwen2-Audio loader uses the v4 processor API (`audios=` kwarg); v5 silently drops audio inputs and produces garbage predictions
+    "sacrebleu",                   # BLEU verification (covost2)
+    "pythainlp",                   # Thai tokenisation for gigaspeech2_thai
+    "evaluate",                    # MMAU / METEOR post-processing
+    "soundfile",
+    "librosa",
+]
 
 [project.scripts]
 oellm = "oellm.main:main"
diff --git a/tests/test_audiobench.py b/tests/test_audiobench.py
new file mode 100644
index 0000000..d7617eb
--- /dev/null
+++ b/tests/test_audiobench.py
@@ -0,0 +1,946 @@
+"""Tests for the AudioBench contrib benchmark integration.
+
+The shape of these tests reflects AudioBench's actual upstream API
+(``$AUDIOBENCH_DIR/src/main_evaluate.py``), which we discovered while
+debugging the first cluster smoke test:
+
+* ``main()`` accepts only ``dataset_name`` / ``model_name`` / ``metrics`` /
+  ``overwrite`` / ``number_of_samples`` — no ``--model``, no ``--log_dir``,
+  no ``--data_dir``.
+* ``Model.__init__`` and ``Dataset.load_dataset`` dispatch on **exact**
+  string match against fixed lists; AudioBench cannot evaluate arbitrary
+  HF checkpoints (only the variants whose loaders ship under
+  ``model_src/``), and split selection happens via the dataset_name itself
+  (``gigaspeech2_thai``, ``spoken-mqa_short_digit``) — there is no
+  ``--data_dir`` flag.
+* AudioBench writes scores to the hardcoded path
+  ``$cwd/log_for_all_models/<model_name>/<dataset>_<metric>_score.json``.
+* Without ``--overwrite True`` AudioBench skips evaluation when a stale
+  score file exists, so :func:`oellm.contrib.audiobench.suite.run` always
+  passes that flag.
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import sys
+from pathlib import Path
+from unittest.mock import patch
+
+import pytest
+
+from oellm.task_groups import (
+    _collect_dataset_specs,
+    _expand_task_groups,
+    get_all_task_group_names,
+)
+
+SUITE = "audiobench"
+TOP_GROUP = "audio-audiobench"
+ASR_GROUP = "audio-audiobench-asr"
+ST_GROUP = "audio-audiobench-st"
+REASONING_GROUP = "audio-audiobench-reasoning"
+
+# Canonical task names that MUST be in the registry.  A silent rename
+# breaks the build.
+NEW_TASKS = {
+    # ASR (9)
+    "audiobench_aishell_asr_zh_test",
+    "audiobench_earnings21_test",
+    "audiobench_earnings22_test",
+    "audiobench_tedlium3_long_form_test",
+    "audiobench_gigaspeech2_thai",
+    "audiobench_gigaspeech2_indo",
+    "audiobench_gigaspeech2_viet",
+    "audiobench_seame_dev_man",
+    "audiobench_seame_dev_sge",
+    # ST (5)
+    "audiobench_covost2_en_id_test",
+    "audiobench_covost2_en_ta_test",
+    "audiobench_covost2_id_en_test",
+    "audiobench_covost2_zh_en_test",
+    "audiobench_covost2_ta_en_test",
+    # Reasoning (6)
+    "audiobench_spoken_mqa_short_digit",
+    "audiobench_spoken_mqa_long_digit",
+    "audiobench_spoken_mqa_single_step_reasoning",
+    "audiobench_spoken_mqa_multi_step_reasoning",
+    "audiobench_mmau_mini",
+    "audiobench_audiocaps_test",
+}
+
+DUAL_TASKS = {
+    "audiobench_librispeech_test_clean",
+    "audiobench_librispeech_test_other",
+    "audiobench_common_voice_15_en_test",
+    "audiobench_gigaspeech_test",
+    "audiobench_peoples_speech_test",
+    "audiobench_tedlium3_test",
+    "audiobench_covost2_en_zh_test",
+}
+
+ALL_PHASE1_TASKS = NEW_TASKS | DUAL_TASKS
+
+
+# ---------------------------------------------------------------------------
+# Registry — task.py
+# ---------------------------------------------------------------------------
+
+
+class TestTaskRegistry:
+    def test_registry_has_exactly_27_tasks(self):
+        from oellm.contrib.audiobench.task import AUDIOBENCH_TASKS
+
+        assert len(AUDIOBENCH_TASKS) == 27
+
+    def test_registry_covers_all_phase1_task_names(self):
+        from oellm.contrib.audiobench.task import AUDIOBENCH_TASKS
+
+        names = {t.name for t in AUDIOBENCH_TASKS}
+        assert names == ALL_PHASE1_TASKS
+
+    def test_every_task_has_audiobench_prefix(self):
+        from oellm.contrib.audiobench.task import AUDIOBENCH_TASKS
+
+        for t in AUDIOBENCH_TASKS:
+            assert t.name.startswith("audiobench_"), t.name
+
+    def test_every_task_has_audiollms_or_amao_hf_repo(self):
+        from oellm.contrib.audiobench.task import AUDIOBENCH_TASKS
+
+        for t in AUDIOBENCH_TASKS:
+            assert t.hf_repo.startswith(("AudioLLMs/", "amao0o0/")), (
+                f"{t.name} has unexpected repo {t.hf_repo}"
+            )
+
+    def test_asr_tasks_all_use_wer(self):
+        from oellm.contrib.audiobench.task import AUDIOBENCH_TASKS
+
+        for t in AUDIOBENCH_TASKS:
+            if t.family == "asr":
+                assert t.metric == "wer", f"{t.name}: {t.metric}"
+
+    def test_st_tasks_all_use_bleu(self):
+        from oellm.contrib.audiobench.task import AUDIOBENCH_TASKS
+
+        for t in AUDIOBENCH_TASKS:
+            if t.family == "st":
+                assert t.metric == "bleu", f"{t.name}: {t.metric}"
+
+    def test_gigaspeech2_tasks_use_per_split_upstream_name(self):
+        """All 3 GigaSpeech2 tasks share one HF repo, but AudioBench's
+        ``--dataset_name`` dispatch keys are the split-suffixed forms
+        (``gigaspeech2_thai``/``_indo``/``_viet``) — there is no
+        ``--data_dir`` flag.
+        """
+        from oellm.contrib.audiobench.task import AUDIOBENCH_TASKS
+
+        gs2 = [t for t in AUDIOBENCH_TASKS if "gigaspeech2" in t.name]
+        assert len(gs2) == 3
+        assert {t.hf_repo for t in gs2} == {"AudioLLMs/gigaspeech2-test"}
+        assert {t.upstream_name for t in gs2} == {
+            "gigaspeech2_thai",
+            "gigaspeech2_indo",
+            "gigaspeech2_viet",
+        }
+
+    def test_spoken_mqa_tasks_use_per_split_upstream_name(self):
+        """All 4 spoken-mqa tasks share one HF repo; AudioBench dispatches
+        via the hyphen-prefixed split-suffixed dataset_name
+        (``spoken-mqa_<split>``).
+        """
+        from oellm.contrib.audiobench.task import AUDIOBENCH_TASKS
+
+        smqa = [t for t in AUDIOBENCH_TASKS if "spoken_mqa" in t.name]
+        assert len(smqa) == 4
+        assert {t.hf_repo for t in smqa} == {"amao0o0/spoken-mqa"}
+        assert {t.upstream_name for t in smqa} == {
+            "spoken-mqa_short_digit",
+            "spoken-mqa_long_digit",
+            "spoken-mqa_single_step_reasoning",
+            "spoken-mqa_multi_step_reasoning",
+        }
+
+    def test_spoken_mqa_uses_acc_metric_upstream(self):
+        """Upstream metric for spoken-mqa is ``acc`` (the canonical key
+        we expose externally is ``accuracy``).
+        """
+        from oellm.contrib.audiobench.task import AUDIOBENCH_TASKS
+
+        for t in AUDIOBENCH_TASKS:
+            if "spoken_mqa" in t.name:
+                assert t.metric == "accuracy"
+                assert t.upstream_metric == "acc"
+
+    def test_get_task_spec_returns_spec(self):
+        from oellm.contrib.audiobench.task import get_task_spec
+
+        spec = get_task_spec("audiobench_librispeech_test_clean")
+        assert spec.upstream_name == "librispeech_test_clean"
+        assert spec.metric == "wer"
+        assert spec.family == "asr"
+
+    def test_get_task_spec_unknown_raises(self):
+        from oellm.contrib.audiobench.task import get_task_spec
+
+        with pytest.raises(KeyError, match="Unknown AudioBench task"):
+            get_task_spec("audiobench_does_not_exist")
+
+    def test_no_task_spec_carries_data_dir_attribute(self):
+        """``data_dir`` was removed once we discovered AudioBench has no
+        such flag; guard against accidental reintroduction.
+        """
+        from oellm.contrib.audiobench.task import AUDIOBENCH_TASKS, AudioBenchTaskSpec
+
+        # Field removed entirely from the dataclass.
+        assert "data_dir" not in AudioBenchTaskSpec.__dataclass_fields__
+        for t in AUDIOBENCH_TASKS:
+            assert not hasattr(t, "data_dir")
+
+
+# ---------------------------------------------------------------------------
+# Adapter — adapter.py
+# ---------------------------------------------------------------------------
+
+
+class TestAudioBenchModelAdapter:
+    """Adapter must return AudioBench's literal ``model_name`` dispatch keys.
+
+    Each pattern check is a regression target — AudioBench's ``model.py``
+    does ``if self.model_name == "<exact-string>":`` and raises
+    NotImplementedError on any other value.
+    """
+
+    @pytest.fixture
+    def adapter_cls(self):
+        from oellm.contrib.audiobench.adapter import AudioBenchModelAdapter
+        from oellm.core.base_model_adapter import BaseModelAdapter
+
+        return AudioBenchModelAdapter, BaseModelAdapter
+
+    def test_is_base_model_adapter(self, adapter_cls):
+        cls, base = adapter_cls
+        assert issubclass(cls, base)
+
+    def test_qwen2_audio_7b_instruct_returns_literal_key(self, adapter_cls):
+        cls, _ = adapter_cls
+        # AudioBench dispatches on the literal "Qwen2-Audio-7B-Instruct".
+        assert (
+            cls("Qwen/Qwen2-Audio-7B-Instruct").to_contrib_flags()
+            == "Qwen2-Audio-7B-Instruct"
+        )
+
+    def test_qwen_audio_chat_returns_literal_key(self, adapter_cls):
+        cls, _ = adapter_cls
+        assert cls("Qwen/Qwen-Audio-Chat").to_contrib_flags() == "Qwen-Audio-Chat"
+
+    def test_salmonn_returns_salmonn_7b(self, adapter_cls):
+        cls, _ = adapter_cls
+        # AudioBench only ships the 7B variant (model_src/salmonn_7b.py).
+        assert cls("tsinghua/SALMONN-7B").to_contrib_flags() == "SALMONN_7B"
+
+    def test_whisper_large_v3(self, adapter_cls):
+        cls, _ = adapter_cls
+        assert cls("openai/whisper-large-v3").to_contrib_flags() == "whisper_large_v3"
+
+    def test_whisper_large_v2(self, adapter_cls):
+        cls, _ = adapter_cls
+        assert cls("openai/whisper-large-v2").to_contrib_flags() == "whisper_large_v2"
+
+    def test_meralion_returns_full_literal_key(self, adapter_cls):
+        cls, _ = adapter_cls
+        assert (
+            cls("MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION").to_contrib_flags()
+            == "MERaLiON-AudioLLM-Whisper-SEA-LION"
+        )
+
+    def test_phi_4_multimodal(self, adapter_cls):
+        cls, _ = adapter_cls
+        assert (
+            cls("microsoft/Phi-4-multimodal-instruct").to_contrib_flags()
+            == "phi_4_multimodal_instruct"
+        )
+
+    def test_unknown_returns_none(self, adapter_cls):
+        """AudioBench has no generic loader.  Unmatched paths must return
+        ``None`` so :func:`suite.run` can raise a clear error rather than
+        falling through to a fictitious dispatch key.
+        """
+        cls, _ = adapter_cls
+        assert cls("random/unknown-model").to_contrib_flags() is None
+
+    def test_module_level_detect_function(self):
+        from oellm.contrib.audiobench.adapter import detect_audiobench_model_type
+
+        assert (
+            detect_audiobench_model_type("Qwen/Qwen2-Audio-7B-Instruct")
+            == "Qwen2-Audio-7B-Instruct"
+        )
+        assert detect_audiobench_model_type("completely/unknown") is None
+
+
+# ---------------------------------------------------------------------------
+# Suite plugin protocol — suite.py
+# ---------------------------------------------------------------------------
+
+
+class TestSuiteProtocol:
+    @pytest.fixture
+    def suite(self):
+        import oellm.contrib.audiobench.suite as s
+
+        return s
+
+    def test_suite_name(self, suite):
+        assert suite.SUITE_NAME == "audiobench"
+
+    def test_cluster_env_vars_declared(self, suite):
+        assert "AUDIOBENCH_DIR" in suite.CLUSTER_ENV_VARS
+
+    def test_task_groups_contains_all_four_groups(self, suite):
+        groups = suite.TASK_GROUPS["task_groups"]
+        for g in (TOP_GROUP, ASR_GROUP, ST_GROUP, REASONING_GROUP):
+            assert g in groups, f"{g} missing from TASK_GROUPS"
+
+    def test_top_level_group_has_all_27_tasks(self, suite):
+        tasks = suite.TASK_GROUPS["task_groups"][TOP_GROUP]["tasks"]
+        assert len(tasks) == 27
+
+    def test_task_metrics_present_for_all_leaves(self, suite):
+        metrics = suite.TASK_GROUPS["task_metrics"]
+        assert set(metrics.keys()) == ALL_PHASE1_TASKS
+
+    def test_all_groups_are_zero_shot(self, suite):
+        for name in (TOP_GROUP, ASR_GROUP, ST_GROUP, REASONING_GROUP):
+            group = suite.TASK_GROUPS["task_groups"][name]
+            assert group["n_shots"] == [0]
+            assert group["suite"] == SUITE
+
+    def test_detect_model_flags_qwen2_audio(self, suite):
+        assert (
+            suite.detect_model_flags("Qwen/Qwen2-Audio-7B-Instruct")
+            == "Qwen2-Audio-7B-Instruct"
+        )
+
+    def test_detect_model_flags_unknown_returns_none(self, suite):
+        assert suite.detect_model_flags("some/unknown-model") is None
+
+    def test_parse_results_recognises_audiobench_json(self, suite):
+        data = {
+            "model_name_or_path": "/path/to/model",
+            "results": {
+                "audiobench_librispeech_test_clean": {"wer": 0.047},
+            },
+            "configs": {"audiobench_librispeech_test_clean": {"num_fewshot": 0}},
+        }
+        result = suite.parse_results(data)
+        assert result is not None
+        model_id, task_name, n_shot, metrics = result
+        assert model_id == "/path/to/model"
+        assert task_name == "audiobench_librispeech_test_clean"
+        assert n_shot == 0
+        assert metrics["wer"] == pytest.approx(0.047)
+
+    def test_parse_results_rejects_non_audiobench_json(self, suite):
+        # lmms-eval style — no audiobench_ prefix.
+        data = {
+            "model_name_or_path": "some/model",
+            "results": {"librispeech_test_clean": {"wer,none": 0.05}},
+            "configs": {"librispeech_test_clean": {"num_fewshot": 0}},
+        }
+        assert suite.parse_results(data) is None
+
+    def test_parse_results_empty_returns_none(self, suite):
+        assert suite.parse_results({}) is None
+
+    def test_parse_results_malformed_returns_none(self, suite):
+        assert suite.parse_results({"results": "not a dict"}) is None
+
+
+# ---------------------------------------------------------------------------
+# TASK_GROUPS integration with core registry.
+# ---------------------------------------------------------------------------
+
+
+class TestTaskGroupsIntegration:
+    def test_groups_registered_via_registry(self):
+        all_names = get_all_task_group_names()
+        for g in (TOP_GROUP, ASR_GROUP, ST_GROUP, REASONING_GROUP):
+            assert g in all_names
+
+    def test_top_group_expands_to_27_zero_shot_tasks(self):
+        results = _expand_task_groups([TOP_GROUP])
+        assert len(results) == 27
+        for r in results:
+            assert r.n_shot == 0
+            assert r.suite == SUITE
+
+    def test_top_group_expands_to_expected_task_names(self):
+        results = _expand_task_groups([TOP_GROUP])
+        assert {r.task for r in results} == ALL_PHASE1_TASKS
+
+    def test_asr_group_has_15_leaves(self):
+        results = _expand_task_groups([ASR_GROUP])
+        # 9 new ASR + 6 dual ASR = 15.
+        assert len(results) == 15
+        for r in results:
+            assert r.suite == SUITE
+
+    def test_st_group_has_6_leaves(self):
+        results = _expand_task_groups([ST_GROUP])
+        # 5 new ST + 1 dual (en→zh) = 6.
+        assert len(results) == 6
+
+    def test_reasoning_group_has_6_leaves(self):
+        results = _expand_task_groups([REASONING_GROUP])
+        # 4 spoken-mqa + mmau_mini + audiocaps = 6.
+        assert len(results) == 6
+
+    def test_dataset_specs_flag_snapshot_download(self):
+        # Auto-derived from the ``audio-*`` group-name prefix in
+        # _collect_dataset_specs.
+        specs = _collect_dataset_specs([TOP_GROUP])
+        assert specs, "No dataset specs returned"
+        for s in specs:
+            assert s.needs_snapshot_download, (
+                f"DatasetSpec for {s.repo_id} missing needs_snapshot_download=True"
+            )
+
+    def test_dataset_specs_dedupe_shared_repos(self):
+        # gigaspeech2 (3 tasks) → 1 spec; spoken-mqa (4 tasks) → 1 spec.
+        specs = _collect_dataset_specs([TOP_GROUP])
+        repo_ids = [s.repo_id for s in specs]
+        assert repo_ids.count("AudioLLMs/gigaspeech2-test") == 1
+        assert repo_ids.count("amao0o0/spoken-mqa") == 1
+
+    def test_dataset_specs_contain_audiollms_repos(self):
+        specs = _collect_dataset_specs([TOP_GROUP])
+        repo_ids = {s.repo_id for s in specs}
+        assert "AudioLLMs/librispeech_test_clean" in repo_ids
+        assert "AudioLLMs/earnings21_test" in repo_ids
+        assert "AudioLLMs/MMAU-mini" in repo_ids
+        assert "amao0o0/spoken-mqa" in repo_ids
+
+
+# ---------------------------------------------------------------------------
+# Registry auto-discovery.
+# ---------------------------------------------------------------------------
+
+
+class TestRegistryDiscovery:
+    def test_audiobench_suite_is_auto_discovered(self):
+        # Clear the _discover() cache so this test doesn't rely on import
+        # order from earlier tests.
+        from oellm import registry
+
+        registry._discover.cache_clear()
+        mod = registry.get_suite("audiobench")
+        assert mod.SUITE_NAME == "audiobench"
+        assert hasattr(mod, "run")
+        assert hasattr(mod, "parse_results")
+        assert hasattr(mod, "detect_model_flags")
+
+    def test_task_groups_merged_into_registry(self):
+        from oellm import registry
+
+        registry._discover.cache_clear()
+        merged = registry.get_all_task_groups()
+        assert TOP_GROUP in merged["task_groups"]
+        assert "audiobench_librispeech_test_clean" in merged["task_metrics"]
+
+
+# ---------------------------------------------------------------------------
+# EvalRunner — resolve_suite wires audiobench through the adapter.
+# ---------------------------------------------------------------------------
+
+
+class TestRunnerIntegration:
+    def test_resolve_suite_appends_audiobench_dispatch_key(self):
+        from oellm.constants import EvaluationJob
+        from oellm.runner import EvalRunner
+
+        runner = EvalRunner()
+        job = EvaluationJob(
+            model_path="Qwen/Qwen2-Audio-7B-Instruct",
+            task_path="audiobench_librispeech_test_clean",
+            n_shot=0,
+            eval_suite="audiobench",
+        )
+        result = runner.resolve_suite(job)
+        # AudioBench's literal dispatch key (case-sensitive) must come
+        # through verbatim so dispatch.py / suite.run get the exact value
+        # AudioBench's ``Model`` class compares against.
+        assert result == "audiobench:Qwen2-Audio-7B-Instruct"
+
+    def test_resolve_suite_unknown_model_passes_through_bare(self):
+        """When the adapter returns ``None`` (no AudioBench-supported
+        loader for the model path), ``resolve_suite`` keeps the bare
+        suite name; :func:`suite.run` then raises a clear error at
+        dispatch time rather than fabricating a fake key.
+        """
+        from oellm.constants import EvaluationJob
+        from oellm.runner import EvalRunner
+
+        runner = EvalRunner()
+        job = EvaluationJob(
+            model_path="some/unknown-model",
+            task_path="audiobench_mmau_mini",
+            n_shot=0,
+            eval_suite="audiobench",
+        )
+        result = runner.resolve_suite(job)
+        assert result == "audiobench"  # bare, no ``:flags`` suffix
+
+
+# ---------------------------------------------------------------------------
+# run() subprocess harness — exercise with a mocked subprocess.
+# ---------------------------------------------------------------------------
+
+
+class TestRunHarness:
+    """Exercise suite.run() with a mocked subprocess, verifying both the
+    CLI it would invoke (matching AudioBench's actual ``main()`` signature)
+    and that we read the score file from AudioBench's hardcoded output
+    location.
+    """
+
+    def _fake_audiobench_tree(self, tmp_path: Path) -> Path:
+        """Create a minimal directory tree that looks like an AudioBench clone."""
+        ab_dir = tmp_path / "AudioBench"
+        (ab_dir / "src").mkdir(parents=True)
+        (ab_dir / "src" / "main_evaluate.py").write_text("# placeholder\n")
+        return ab_dir
+
+    @staticmethod
+    def _score_file_path(
+        ab_dir: Path, model_name: str, dataset: str, metric: str
+    ) -> Path:
+        """Mirror suite._extract_metrics' path construction."""
+        return (
+            ab_dir / "log_for_all_models" / model_name / f"{dataset}_{metric}_score.json"
+        )
+
+    def _fake_run_writing_score(
+        self, ab_dir: Path, *, score_value: float, body_shape: str = "flat"
+    ):
+        """Build a fake_run that writes a score file at AudioBench's
+        hardcoded path, parameterized by the JSON shape we want to test.
+        """
+
+        def fake_run(cmd, cwd, env, check):
+            model_name = cmd[cmd.index("--model_name") + 1]
+            dataset = cmd[cmd.index("--dataset_name") + 1]
+            metric = cmd[cmd.index("--metrics") + 1]
+            score_file = self._score_file_path(Path(cwd), model_name, dataset, metric)
+            score_file.parent.mkdir(parents=True, exist_ok=True)
+            if body_shape == "flat":
+                score_file.write_text(json.dumps({metric: score_value}))
+            elif body_shape == "nested":
+                score_file.write_text(
+                    json.dumps({"metrics": {metric: {"score": score_value, "n": 100}}})
+                )
+            elif body_shape == "missing_metric":
+                score_file.write_text(json.dumps({"irrelevant": 1}))
+            elif body_shape == "no_file":
+                pass  # deliberately don't write
+            else:
+                raise ValueError(f"unknown body_shape: {body_shape}")
+            return _FakeCompletedProcess(0)
+
+        return fake_run
+
+    def test_run_missing_audiobench_dir_raises(self, tmp_path):
+        from oellm.contrib.audiobench.suite import run
+
+        with pytest.raises(RuntimeError, match="AUDIOBENCH_DIR must be set"):
+            run(
+                model_path="Qwen/Qwen2-Audio-7B-Instruct",
+                task="audiobench_librispeech_test_clean",
+                n_shot=0,
+                output_path=tmp_path / "out.json",
+                model_flags="Qwen2-Audio-7B-Instruct",
+                env={},  # no AUDIOBENCH_DIR
+            )
+
+    def test_run_missing_entrypoint_raises(self, tmp_path):
+        from oellm.contrib.audiobench.suite import run
+
+        bad_dir = tmp_path / "not-audiobench"
+        bad_dir.mkdir()
+        with pytest.raises(FileNotFoundError, match="AudioBench entry point"):
+            run(
+                model_path="Qwen/Qwen2-Audio-7B-Instruct",
+                task="audiobench_librispeech_test_clean",
+                n_shot=0,
+                output_path=tmp_path / "out.json",
+                model_flags="Qwen2-Audio-7B-Instruct",
+                env={"AUDIOBENCH_DIR": str(bad_dir)},
+            )
+
+    def test_run_unmapped_model_raises(self, tmp_path):
+        """AudioBench has no generic loader.  When ``model_flags`` is
+        ``None`` (adapter found no match), :func:`run` must fail loudly
+        rather than invoking AudioBench with a missing/empty model_name.
+        """
+        from oellm.contrib.audiobench.suite import run
+
+        ab_dir = self._fake_audiobench_tree(tmp_path)
+        with pytest.raises(RuntimeError, match="Could not map model_path"):
+            run(
+                model_path="random/unknown-model",
+                task="audiobench_librispeech_test_clean",
+                n_shot=0,
+                output_path=tmp_path / "out.json",
+                model_flags=None,
+                env={"AUDIOBENCH_DIR": str(ab_dir)},
+            )
+
+    def test_run_invokes_subprocess_with_expected_cli(self, tmp_path):
+        from oellm.contrib.audiobench import suite
+
+        ab_dir = self._fake_audiobench_tree(tmp_path)
+        output_path = tmp_path / "result.json"
+
+        with patch(
+            "oellm.contrib.audiobench.suite.subprocess.run",
+            side_effect=self._fake_run_writing_score(ab_dir, score_value=0.063),
+        ) as mock_sp:
+            suite.run(
+                model_path="Qwen/Qwen2-Audio-7B-Instruct",
+                task="audiobench_librispeech_test_clean",
+                n_shot=0,
+                output_path=output_path,
+                model_flags="Qwen2-Audio-7B-Instruct",
+                env={"AUDIOBENCH_DIR": str(ab_dir), "LIMIT": "100"},
+            )
+
+        assert mock_sp.call_count == 1
+        cmd = mock_sp.call_args.args[0]
+        assert cmd[:2] == ["python", "src/main_evaluate.py"]
+
+        # AudioBench's actual main() signature: dataset_name / model_name
+        # / metrics / overwrite / number_of_samples.  No --model, no
+        # --log_dir, no --data_dir.
+        assert cmd[cmd.index("--dataset_name") + 1] == "librispeech_test_clean"
+        assert cmd[cmd.index("--model_name") + 1] == "Qwen2-Audio-7B-Instruct"
+        assert cmd[cmd.index("--metrics") + 1] == "wer"
+        assert cmd[cmd.index("--overwrite") + 1] == "True"
+        assert cmd[cmd.index("--number_of_samples") + 1] == "100"
+
+        # Flags AudioBench does NOT accept must not be in the cmd.
+        assert "--model" not in cmd  # only --model_name exists upstream
+        assert "--log_dir" not in cmd  # AudioBench writes to a fixed path
+        assert "--data_dir" not in cmd  # split selection is via dataset_name
+
+        # cwd is AUDIOBENCH_DIR so AudioBench's relative writes
+        # (log_for_all_models/...) land inside the clone.
+        assert mock_sp.call_args.kwargs["cwd"] == str(ab_dir)
+
+        # Output JSON is lmms-eval-shaped.
+        body = json.loads(output_path.read_text())
+        assert body["model_name_or_path"] == "Qwen/Qwen2-Audio-7B-Instruct"
+        assert body["results"]["audiobench_librispeech_test_clean"][
+            "wer"
+        ] == pytest.approx(0.063)
+        assert body["configs"]["audiobench_librispeech_test_clean"]["num_fewshot"] == 0
+
+    def test_run_uses_per_split_dataset_name_for_gigaspeech2(self, tmp_path):
+        """GigaSpeech2 splits are dispatched via the dataset_name itself
+        (``gigaspeech2_thai``), not a ``--data_dir`` flag.
+        """
+        from oellm.contrib.audiobench import suite
+
+        ab_dir = self._fake_audiobench_tree(tmp_path)
+        output_path = tmp_path / "result.json"
+
+        with patch(
+            "oellm.contrib.audiobench.suite.subprocess.run",
+            side_effect=self._fake_run_writing_score(ab_dir, score_value=0.12),
+        ) as mock_sp:
+            suite.run(
+                model_path="Qwen/Qwen2-Audio-7B-Instruct",
+                task="audiobench_gigaspeech2_thai",
+                n_shot=0,
+                output_path=output_path,
+                model_flags="Qwen2-Audio-7B-Instruct",
+                env={"AUDIOBENCH_DIR": str(ab_dir)},
+            )
+
+        cmd = mock_sp.call_args.args[0]
+        assert cmd[cmd.index("--dataset_name") + 1] == "gigaspeech2_thai"
+        assert "--data_dir" not in cmd
+
+    def test_run_omits_number_of_samples_when_limit_empty(self, tmp_path):
+        from oellm.contrib.audiobench import suite
+
+        ab_dir = self._fake_audiobench_tree(tmp_path)
+        output_path = tmp_path / "result.json"
+
+        with patch(
+            "oellm.contrib.audiobench.suite.subprocess.run",
+            side_effect=self._fake_run_writing_score(ab_dir, score_value=0.1),
+        ) as mock_sp:
+            suite.run(
+                model_path="Qwen/Qwen2-Audio-7B-Instruct",
+                task="audiobench_librispeech_test_clean",
+                n_shot=0,
+                output_path=output_path,
+                model_flags="Qwen2-Audio-7B-Instruct",
+                env={"AUDIOBENCH_DIR": str(ab_dir), "LIMIT": ""},
+            )
+
+        cmd = mock_sp.call_args.args[0]
+        assert "--number_of_samples" not in cmd
+
+    def test_run_always_passes_overwrite_true(self, tmp_path):
+        """AudioBench skips evaluation when a stale score file already
+        exists unless ``--overwrite True`` is passed; we always pass it
+        because we do our own deduplication via output_path.
+        """
+        from oellm.contrib.audiobench import suite
+
+        ab_dir = self._fake_audiobench_tree(tmp_path)
+        output_path = tmp_path / "result.json"
+
+        with patch(
+            "oellm.contrib.audiobench.suite.subprocess.run",
+            side_effect=self._fake_run_writing_score(ab_dir, score_value=0.1),
+        ) as mock_sp:
+            suite.run(
+                model_path="Qwen/Qwen2-Audio-7B-Instruct",
+                task="audiobench_librispeech_test_clean",
+                n_shot=0,
+                output_path=output_path,
+                model_flags="Qwen2-Audio-7B-Instruct",
+                env={"AUDIOBENCH_DIR": str(ab_dir)},
+            )
+
+        cmd = mock_sp.call_args.args[0]
+        assert cmd[cmd.index("--overwrite") + 1] == "True"
+
+    def test_run_nonzero_exit_raises(self, tmp_path):
+        from oellm.contrib.audiobench import suite
+
+        ab_dir = self._fake_audiobench_tree(tmp_path)
+        output_path = tmp_path / "result.json"
+
+        with patch(
+            "oellm.contrib.audiobench.suite.subprocess.run",
+            return_value=_FakeCompletedProcess(1),
+        ):
+            with pytest.raises(RuntimeError, match="AudioBench exited with code 1"):
+                suite.run(
+                    model_path="Qwen/Qwen2-Audio-7B-Instruct",
+                    task="audiobench_librispeech_test_clean",
+                    n_shot=0,
+                    output_path=output_path,
+                    model_flags="Qwen2-Audio-7B-Instruct",
+                    env={"AUDIOBENCH_DIR": str(ab_dir)},
+                )
+
+    def test_run_handles_nested_metric_json(self, tmp_path):
+        """AudioBench's score-file shape has drifted across releases; we
+        tolerate both ``{"wer": 0.05}`` and
+        ``{"metrics": {"wer": {"score": 0.05}}}`` layouts.
+        """
+        from oellm.contrib.audiobench import suite
+
+        ab_dir = self._fake_audiobench_tree(tmp_path)
+        output_path = tmp_path / "result.json"
+
+        with patch(
+            "oellm.contrib.audiobench.suite.subprocess.run",
+            side_effect=self._fake_run_writing_score(
+                ab_dir, score_value=0.051, body_shape="nested"
+            ),
+        ):
+            suite.run(
+                model_path="Qwen/Qwen2-Audio-7B-Instruct",
+                task="audiobench_librispeech_test_clean",
+                n_shot=0,
+                output_path=output_path,
+                model_flags="Qwen2-Audio-7B-Instruct",
+                env={"AUDIOBENCH_DIR": str(ab_dir)},
+            )
+
+        body = json.loads(output_path.read_text())
+        assert body["results"]["audiobench_librispeech_test_clean"][
+            "wer"
+        ] == pytest.approx(0.051)
+
+    def test_run_missing_score_file_raises(self, tmp_path):
+        """If AudioBench exits 0 but doesn't write the score file at the
+        expected path, surface a clear error rather than producing an
+        empty CSV row downstream.
+        """
+        from oellm.contrib.audiobench import suite
+
+        ab_dir = self._fake_audiobench_tree(tmp_path)
+        output_path = tmp_path / "result.json"
+
+        with patch(
+            "oellm.contrib.audiobench.suite.subprocess.run",
+            side_effect=self._fake_run_writing_score(
+                ab_dir, score_value=0.0, body_shape="no_file"
+            ),
+        ):
+            with pytest.raises(
+                RuntimeError, match="AudioBench did not write expected score file"
+            ):
+                suite.run(
+                    model_path="Qwen/Qwen2-Audio-7B-Instruct",
+                    task="audiobench_librispeech_test_clean",
+                    n_shot=0,
+                    output_path=output_path,
+                    model_flags="Qwen2-Audio-7B-Instruct",
+                    env={"AUDIOBENCH_DIR": str(ab_dir)},
+                )
+
+    def test_run_score_file_without_metric_key_raises(self, tmp_path):
+        from oellm.contrib.audiobench import suite
+
+        ab_dir = self._fake_audiobench_tree(tmp_path)
+        output_path = tmp_path / "result.json"
+
+        with patch(
+            "oellm.contrib.audiobench.suite.subprocess.run",
+            side_effect=self._fake_run_writing_score(
+                ab_dir, score_value=0.0, body_shape="missing_metric"
+            ),
+        ):
+            with pytest.raises(RuntimeError, match="Could not locate metric"):
+                suite.run(
+                    model_path="Qwen/Qwen2-Audio-7B-Instruct",
+                    task="audiobench_librispeech_test_clean",
+                    n_shot=0,
+                    output_path=output_path,
+                    model_flags="Qwen2-Audio-7B-Instruct",
+                    env={"AUDIOBENCH_DIR": str(ab_dir)},
+                )
+
+
+class _FakeCompletedProcess:
+    """Stand-in for subprocess.CompletedProcess."""
+
+    def __init__(self, returncode: int) -> None:
+        self.returncode = returncode
+
+
+# ---------------------------------------------------------------------------
+# schedule_evals dry-run — wiring smoke test.
+# ---------------------------------------------------------------------------
+
+
+class TestScheduleEvalsDryRun:
+    def test_dry_run_writes_audiobench_suite_to_csv(self, tmp_path):
+        import pandas as pd
+
+        from oellm.main import schedule_evals
+
+        with (
+            patch("oellm.scheduler._load_cluster_env"),
+            patch("oellm.scheduler._num_jobs_in_queue", return_value=0),
+            patch.dict(os.environ, {"EVAL_OUTPUT_DIR": str(tmp_path)}),
+        ):
+            schedule_evals(
+                models="Qwen/Qwen2-Audio-7B-Instruct",
+                task_groups=ASR_GROUP,
+                skip_checks=True,
+                venv_path=str(Path(sys.prefix)),
+                dry_run=True,
+            )
+
+        csv_files = list(tmp_path.glob("**/jobs.csv"))
+        assert len(csv_files) == 1
+        df = pd.read_csv(csv_files[0])
+        # All rows route to audiobench (with model-flag suffix).
+        assert all(s.startswith("audiobench") for s in df["eval_suite"].unique())
+        # task_path column contains canonical audiobench_ names.
+        assert all(t.startswith("audiobench_") for t in df["task_path"].unique())
+
+    def test_dry_run_preserves_model_flag_capitalization(self, tmp_path):
+        """Regression: scheduler.py used to lowercase the entire eval_suite
+        column, breaking AudioBench's case-sensitive dispatch keys
+        (``Qwen2-Audio-7B-Instruct`` was being mangled to
+        ``qwen2-audio-7b-instruct``).
+        """
+        import pandas as pd
+
+        from oellm.main import schedule_evals
+
+        with (
+            patch("oellm.scheduler._load_cluster_env"),
+            patch("oellm.scheduler._num_jobs_in_queue", return_value=0),
+            patch.dict(os.environ, {"EVAL_OUTPUT_DIR": str(tmp_path)}),
+        ):
+            schedule_evals(
+                models="Qwen/Qwen2-Audio-7B-Instruct",
+                task_groups=ASR_GROUP,
+                skip_checks=True,
+                venv_path=str(Path(sys.prefix)),
+                dry_run=True,
+            )
+
+        csv_files = list(tmp_path.glob("**/jobs.csv"))
+        df = pd.read_csv(csv_files[0])
+        suites = set(df["eval_suite"].unique())
+        # The exact AudioBench dispatch literal must come through case-intact.
+        assert "audiobench:Qwen2-Audio-7B-Instruct" in suites
+
+    def test_dry_run_sbatch_contains_contrib_dispatch(self, tmp_path):
+        from oellm.main import schedule_evals
+
+        with (
+            patch("oellm.scheduler._load_cluster_env"),
+            patch("oellm.scheduler._num_jobs_in_queue", return_value=0),
+            patch.dict(os.environ, {"EVAL_OUTPUT_DIR": str(tmp_path)}),
+        ):
+            schedule_evals(
+                models="Qwen/Qwen2-Audio-7B-Instruct",
+                task_groups=TOP_GROUP,
+                skip_checks=True,
+                venv_path=str(Path(sys.prefix)),
+                dry_run=True,
+            )
+
+        sbatch_files = list(tmp_path.glob("**/submit_evals.sbatch"))
+        assert len(sbatch_files) == 1
+        content = sbatch_files[0].read_text()
+        assert "oellm.contrib.dispatch" in content
+        # LIMIT is exported so contrib plugins can read it.
+        assert "export LIMIT=" in content
+
+
+# ---------------------------------------------------------------------------
+# collect_results compatibility — verify a run() output flows through unchanged.
+# ---------------------------------------------------------------------------
+
+
+class TestCollectResultsCompat:
+    def test_collect_results_parses_audiobench_json(self, tmp_path):
+        import pandas as pd
+
+        from oellm.main import collect_results
+
+        results_dir = tmp_path / "results"
+        results_dir.mkdir()
+
+        mock_output = {
+            "model_name_or_path": "/cluster/models/Qwen2-Audio-7B",
+            "results": {
+                "audiobench_librispeech_test_clean": {"wer": 0.052},
+            },
+            "configs": {"audiobench_librispeech_test_clean": {"num_fewshot": 0}},
+        }
+        (results_dir / "ab123.json").write_text(json.dumps(mock_output))
+
+        output_csv = str(tmp_path / "results.csv")
+        collect_results(str(tmp_path), output_csv=output_csv)
+
+        df = pd.read_csv(output_csv)
+        assert len(df) == 1
+        row = df.iloc[0]
+        assert row["task"] == "audiobench_librispeech_test_clean"
+        assert float(row["performance"]) == pytest.approx(0.052)
+        assert row["model_name"] == "/cluster/models/Qwen2-Audio-7B"