Hi.
I am evaluating AWQ on meta-llama/Llama-3.2-1B-Instruct model.
In original model (BF16), multiple runs guarantee identical results but after AWQ it seems not guaranteed in part of tasks (truthfulqa, winogrande).
I set the random seed and do_sample=False and temperature=0 in lm_eval's model args.
The quantization is performed at first and then ran just evaluation multiple times.
Do you have any idea how to resolve this?
Original (BF16)
{'arc_easy': 0.718, 'arc_challenge': 0.4147, 'hellaswag': 0.5976, 'mmlu': 0.4546, 'truthfulqa': 0.3672, 'winogrande': 0.6164, 'gsm8k': np.float64(0.3328)}
{'arc_easy': 0.718, 'arc_challenge': 0.4147, 'hellaswag': 0.5976, 'mmlu': 0.4546, 'truthfulqa': 0.3672, 'winogrande': 0.6164, 'gsm8k': np.float64(0.3328)}
{'arc_easy': 0.718, 'arc_challenge': 0.4147, 'hellaswag': 0.5976, 'mmlu': 0.4546, 'truthfulqa': 0.3672, 'winogrande': 0.6164, 'gsm8k': np.float64(0.3328)}
AWQ (W4)
{'arc_easy': 0.7029, 'arc_challenge': 0.4121, 'hellaswag': 0.5828, 'mmlu': 0.4394, 'truthfulqa': 0.377, 'winogrande': 0.6062, 'gsm8k': np.float64(0.2676)}
{'arc_easy': 0.7029, 'arc_challenge': 0.4121, 'hellaswag': 0.5828, 'mmlu': 0.4394, 'truthfulqa': 0.3782, 'winogrande': 0.6069, 'gsm8k': np.float64(0.2676)}
{'arc_easy': 0.7029, 'arc_challenge': 0.4121, 'hellaswag': 0.5828, 'mmlu': 0.4394, 'truthfulqa': 0.377, 'winogrande': 0.6077, 'gsm8k': np.float64(0.2676)}
Hi.
I am evaluating AWQ on
meta-llama/Llama-3.2-1B-Instructmodel.In original model (BF16), multiple runs guarantee identical results but after AWQ it seems not guaranteed in part of tasks (truthfulqa, winogrande).
I set the random seed and do_sample=False and temperature=0 in lm_eval's model args.
The quantization is performed at first and then ran just evaluation multiple times.
Do you have any idea how to resolve this?
Original (BF16)
{'arc_easy': 0.718, 'arc_challenge': 0.4147, 'hellaswag': 0.5976, 'mmlu': 0.4546, 'truthfulqa': 0.3672, 'winogrande': 0.6164, 'gsm8k': np.float64(0.3328)}
{'arc_easy': 0.718, 'arc_challenge': 0.4147, 'hellaswag': 0.5976, 'mmlu': 0.4546, 'truthfulqa': 0.3672, 'winogrande': 0.6164, 'gsm8k': np.float64(0.3328)}
{'arc_easy': 0.718, 'arc_challenge': 0.4147, 'hellaswag': 0.5976, 'mmlu': 0.4546, 'truthfulqa': 0.3672, 'winogrande': 0.6164, 'gsm8k': np.float64(0.3328)}
AWQ (W4)
{'arc_easy': 0.7029, 'arc_challenge': 0.4121, 'hellaswag': 0.5828, 'mmlu': 0.4394, 'truthfulqa': 0.377, 'winogrande': 0.6062, 'gsm8k': np.float64(0.2676)}
{'arc_easy': 0.7029, 'arc_challenge': 0.4121, 'hellaswag': 0.5828, 'mmlu': 0.4394, 'truthfulqa': 0.3782, 'winogrande': 0.6069, 'gsm8k': np.float64(0.2676)}
{'arc_easy': 0.7029, 'arc_challenge': 0.4121, 'hellaswag': 0.5828, 'mmlu': 0.4394, 'truthfulqa': 0.377, 'winogrande': 0.6077, 'gsm8k': np.float64(0.2676)}