LLaMA-3.2-1B-Instruct math results diverge significantly from paper (GSM8k-Aug, GSM8k-Aug-NL)

## Summary

Reproducing CODI results using the latest code (commit `2c23146`). GPT-2 results reproduce well across all benchmarks, but **LLaMA-3.2-1B-Instruct on math tasks diverges drastically** — we get ~3% accuracy vs the paper's 44-54%.

## Results

| Training Data | Benchmark | Paper Accuracy | Our Accuracy |
|---|---|---|---|
| iCoT (GSM8k-Aug) | GSM8k | 44.1 | 3.34 |
| iCoT (GSM8k-Aug) | SVAMP | 42.4 | 2.30 |
| iCoT (GSM8k-Aug) | GSM-Hard | 12.4 | 0.68 |
| iCoT (GSM8k-Aug) | MultiArith | 87.1 | 5.00 |
| iCoT-full (GSM8k-Aug-NL) | GSM8k | 54.2 | 3.56 |

For reference, these reproduce correctly:
- **GPT-2 iCoT**: GSM8k 41.09 (paper: 43.7), SVAMP 41.10 (paper: 42.9), GSM-Hard 9.33 (paper: 9.9), MultiArith 94.44 (paper: 92.8)
- **GPT-2 iCoT-full**: GSM8k 32.60 (paper: 34.8), SVAMP 35.20 (paper: 32.2), GSM-Hard 7.66 (paper: 7.5), MultiArith 85.56 (paper: 77.7)
- **LLaMA CommonsenseQA**: 73.96 (paper: 68.2)

## Training Configuration

Used exactly the scripts from the repository (`scripts/train_llama1b_gsm8k-aug.sh` and `scripts/train_llama1b_gsm8k-aug-nl.sh`):

**iCoT:**
- model: `meta-llama/Llama-3.2-1B-Instruct`
- epochs: 10, lr: 8e-4, batch: 32, grad_accum: 4
- LoRA r=128, alpha=32
- num_latent: 6, prj_dim: 2048, distill_loss_factor: 20
- max_token_num: 200

**iCoT-full:**
- Same as above but `data_name: icot-full`, max_token_num: 256

**Evaluation (`scripts/test_llama1b.sh`):**
- inf_latent_iterations: 6, greedy: True, batch_size: 128

## Observed Behavior

The LLaMA models produce degenerate repetitive text during inference instead of valid numerical answers. The model appears to collapse during training for math, while it works perfectly fine for CommonsenseQA (which is classification, not generative math).

## Environment

- Python 3.12
- transformers 4.52.4
- torch 2.7.1+cu126
- peft 0.15.2
- NVIDIA H100 GPUs

## Questions

1. Are there known environment-specific issues (library versions, CUDA) that could cause this divergence for LLaMA on math tasks specifically?
2. Were the paper results produced with these exact scripts (commit `2c23146`), or were there additional changes?
3. What transformers/peft versions were used for the paper results?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaMA-3.2-1B-Instruct math results diverge significantly from paper (GSM8k-Aug, GSM8k-Aug-NL) #17

Summary

Results

Training Configuration

Observed Behavior

Environment

Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Training Data	Benchmark	Paper Accuracy	Our Accuracy
iCoT (GSM8k-Aug)	GSM8k	44.1	3.34
iCoT (GSM8k-Aug)	SVAMP	42.4	2.30
iCoT (GSM8k-Aug)	GSM-Hard	12.4	0.68
iCoT (GSM8k-Aug)	MultiArith	87.1	5.00
iCoT-full (GSM8k-Aug-NL)	GSM8k	54.2	3.56

LLaMA-3.2-1B-Instruct math results diverge significantly from paper (GSM8k-Aug, GSM8k-Aug-NL) #17

Description

Summary

Results

Training Configuration

Observed Behavior

Environment

Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions