Summary
Reproducing CODI results using the latest code (commit 2c23146). GPT-2 results reproduce well across all benchmarks, but LLaMA-3.2-1B-Instruct on math tasks diverges drastically — we get ~3% accuracy vs the paper's 44-54%.
Results
| Training Data |
Benchmark |
Paper Accuracy |
Our Accuracy |
| iCoT (GSM8k-Aug) |
GSM8k |
44.1 |
3.34 |
| iCoT (GSM8k-Aug) |
SVAMP |
42.4 |
2.30 |
| iCoT (GSM8k-Aug) |
GSM-Hard |
12.4 |
0.68 |
| iCoT (GSM8k-Aug) |
MultiArith |
87.1 |
5.00 |
| iCoT-full (GSM8k-Aug-NL) |
GSM8k |
54.2 |
3.56 |
For reference, these reproduce correctly:
- GPT-2 iCoT: GSM8k 41.09 (paper: 43.7), SVAMP 41.10 (paper: 42.9), GSM-Hard 9.33 (paper: 9.9), MultiArith 94.44 (paper: 92.8)
- GPT-2 iCoT-full: GSM8k 32.60 (paper: 34.8), SVAMP 35.20 (paper: 32.2), GSM-Hard 7.66 (paper: 7.5), MultiArith 85.56 (paper: 77.7)
- LLaMA CommonsenseQA: 73.96 (paper: 68.2)
Training Configuration
Used exactly the scripts from the repository (scripts/train_llama1b_gsm8k-aug.sh and scripts/train_llama1b_gsm8k-aug-nl.sh):
iCoT:
- model:
meta-llama/Llama-3.2-1B-Instruct
- epochs: 10, lr: 8e-4, batch: 32, grad_accum: 4
- LoRA r=128, alpha=32
- num_latent: 6, prj_dim: 2048, distill_loss_factor: 20
- max_token_num: 200
iCoT-full:
- Same as above but
data_name: icot-full, max_token_num: 256
Evaluation (scripts/test_llama1b.sh):
- inf_latent_iterations: 6, greedy: True, batch_size: 128
Observed Behavior
The LLaMA models produce degenerate repetitive text during inference instead of valid numerical answers. The model appears to collapse during training for math, while it works perfectly fine for CommonsenseQA (which is classification, not generative math).
Environment
- Python 3.12
- transformers 4.52.4
- torch 2.7.1+cu126
- peft 0.15.2
- NVIDIA H100 GPUs
Questions
- Are there known environment-specific issues (library versions, CUDA) that could cause this divergence for LLaMA on math tasks specifically?
- Were the paper results produced with these exact scripts (commit
2c23146), or were there additional changes?
- What transformers/peft versions were used for the paper results?
Summary
Reproducing CODI results using the latest code (commit
2c23146). GPT-2 results reproduce well across all benchmarks, but LLaMA-3.2-1B-Instruct on math tasks diverges drastically — we get ~3% accuracy vs the paper's 44-54%.Results
For reference, these reproduce correctly:
Training Configuration
Used exactly the scripts from the repository (
scripts/train_llama1b_gsm8k-aug.shandscripts/train_llama1b_gsm8k-aug-nl.sh):iCoT:
meta-llama/Llama-3.2-1B-InstructiCoT-full:
data_name: icot-full, max_token_num: 256Evaluation (
scripts/test_llama1b.sh):Observed Behavior
The LLaMA models produce degenerate repetitive text during inference instead of valid numerical answers. The model appears to collapse during training for math, while it works perfectly fine for CommonsenseQA (which is classification, not generative math).
Environment
Questions
2c23146), or were there additional changes?