lm_eval result is not reproducible in multiple runs

Hi.

I am evaluating AWQ on `meta-llama/Llama-3.2-1B-Instruct` model.
In original model (BF16), multiple runs guarantee identical results but after AWQ it seems not guaranteed in part of tasks (truthfulqa, winogrande).
I set the random seed and do_sample=False and temperature=0 in lm_eval's model args.
The quantization is performed at first and then ran just evaluation multiple times.
Do you have any idea how to resolve this?


Original (BF16)

{'arc_easy': 0.718, 'arc_challenge': 0.4147, 'hellaswag': 0.5976, 'mmlu': 0.4546, 'truthfulqa': 0.3672, 'winogrande': 0.6164, 'gsm8k': np.float64(0.3328)}

{'arc_easy': 0.718, 'arc_challenge': 0.4147, 'hellaswag': 0.5976, 'mmlu': 0.4546, 'truthfulqa': 0.3672, 'winogrande': 0.6164, 'gsm8k': np.float64(0.3328)}

{'arc_easy': 0.718, 'arc_challenge': 0.4147, 'hellaswag': 0.5976, 'mmlu': 0.4546, 'truthfulqa': 0.3672, 'winogrande': 0.6164, 'gsm8k': np.float64(0.3328)}


AWQ (W4)

{'arc_easy': 0.7029, 'arc_challenge': 0.4121, 'hellaswag': 0.5828, 'mmlu': 0.4394, **'truthfulqa': 0.377, 'winogrande': 0.6062,** 'gsm8k': np.float64(0.2676)}

{'arc_easy': 0.7029, 'arc_challenge': 0.4121, 'hellaswag': 0.5828, 'mmlu': 0.4394, **'truthfulqa': 0.3782, 'winogrande': 0.6069,** 'gsm8k': np.float64(0.2676)}

{'arc_easy': 0.7029, 'arc_challenge': 0.4121, 'hellaswag': 0.5828, 'mmlu': 0.4394, **'truthfulqa': 0.377, 'winogrande': 0.6077,** 'gsm8k': np.float64(0.2676)}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lm_eval result is not reproducible in multiple runs #738

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

lm_eval result is not reproducible in multiple runs #738

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions