Skip to content
This repository was archived by the owner on May 11, 2025. It is now read-only.
This repository was archived by the owner on May 11, 2025. It is now read-only.

lm_eval result is not reproducible in multiple runs #738

@oj9040

Description

@oj9040

Hi.

I am evaluating AWQ on meta-llama/Llama-3.2-1B-Instruct model.
In original model (BF16), multiple runs guarantee identical results but after AWQ it seems not guaranteed in part of tasks (truthfulqa, winogrande).
I set the random seed and do_sample=False and temperature=0 in lm_eval's model args.
The quantization is performed at first and then ran just evaluation multiple times.
Do you have any idea how to resolve this?

Original (BF16)

{'arc_easy': 0.718, 'arc_challenge': 0.4147, 'hellaswag': 0.5976, 'mmlu': 0.4546, 'truthfulqa': 0.3672, 'winogrande': 0.6164, 'gsm8k': np.float64(0.3328)}

{'arc_easy': 0.718, 'arc_challenge': 0.4147, 'hellaswag': 0.5976, 'mmlu': 0.4546, 'truthfulqa': 0.3672, 'winogrande': 0.6164, 'gsm8k': np.float64(0.3328)}

{'arc_easy': 0.718, 'arc_challenge': 0.4147, 'hellaswag': 0.5976, 'mmlu': 0.4546, 'truthfulqa': 0.3672, 'winogrande': 0.6164, 'gsm8k': np.float64(0.3328)}

AWQ (W4)

{'arc_easy': 0.7029, 'arc_challenge': 0.4121, 'hellaswag': 0.5828, 'mmlu': 0.4394, 'truthfulqa': 0.377, 'winogrande': 0.6062, 'gsm8k': np.float64(0.2676)}

{'arc_easy': 0.7029, 'arc_challenge': 0.4121, 'hellaswag': 0.5828, 'mmlu': 0.4394, 'truthfulqa': 0.3782, 'winogrande': 0.6069, 'gsm8k': np.float64(0.2676)}

{'arc_easy': 0.7029, 'arc_challenge': 0.4121, 'hellaswag': 0.5828, 'mmlu': 0.4394, 'truthfulqa': 0.377, 'winogrande': 0.6077, 'gsm8k': np.float64(0.2676)}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions