You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/mmlu_benchmark/README.md
+30-9Lines changed: 30 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,19 +1,27 @@
1
1
# MMLU Benchmark Example
2
2
3
-
Evaluate language models on [MMLU (Massive Multitask Language Understanding)](https://arxiv.org/abs/2009.03300) with optional efficient evaluation via [DISCO](https://arxiv.org/abs/2510.07959).
3
+
Evaluate language models on [MMLU (Massive Multitask Language Understanding)](https://arxiv.org/abs/2009.03300) with optional efficient evaluation via [DISCO (Diversifying Sample Condensation)](https://arxiv.org/abs/2510.07959).
4
4
5
5
## Installation
6
6
7
-
For basic MMLU evaluation:
7
+
Install [uv package manager](https://docs.astral.sh/uv/) as described [here](https://docs.astral.sh/uv/getting-started/installation/).
8
+
9
+
Create Python environment:
10
+
11
+
```bash
12
+
uv venv --python 3.11
13
+
```
14
+
15
+
Install dependencies for basic MMLU evaluation:
8
16
9
17
```bash
10
-
uv pip install .[mmlu]
18
+
uv sync --extra mmlu
11
19
```
12
20
13
-
For DISCO prediction (includes DISCO dependencies):
21
+
Install dependencies for MMLU evaluation with DISCO:
14
22
15
23
```bash
16
-
uv pip install .[disco]
24
+
uv sync --extra disco
17
25
```
18
26
19
27
## Run without DISCO (full evaluation)
@@ -31,9 +39,8 @@ Full evaluation results look like:
0 commit comments