Skip to content

LLaMA-3.2-1B-Instruct math results diverge significantly from paper (GSM8k-Aug, GSM8k-Aug-NL) #17

@andrewivan123

Description

@andrewivan123

Summary

Reproducing CODI results using the latest code (commit 2c23146). GPT-2 results reproduce well across all benchmarks, but LLaMA-3.2-1B-Instruct on math tasks diverges drastically — we get ~3% accuracy vs the paper's 44-54%.

Results

Training Data Benchmark Paper Accuracy Our Accuracy
iCoT (GSM8k-Aug) GSM8k 44.1 3.34
iCoT (GSM8k-Aug) SVAMP 42.4 2.30
iCoT (GSM8k-Aug) GSM-Hard 12.4 0.68
iCoT (GSM8k-Aug) MultiArith 87.1 5.00
iCoT-full (GSM8k-Aug-NL) GSM8k 54.2 3.56

For reference, these reproduce correctly:

  • GPT-2 iCoT: GSM8k 41.09 (paper: 43.7), SVAMP 41.10 (paper: 42.9), GSM-Hard 9.33 (paper: 9.9), MultiArith 94.44 (paper: 92.8)
  • GPT-2 iCoT-full: GSM8k 32.60 (paper: 34.8), SVAMP 35.20 (paper: 32.2), GSM-Hard 7.66 (paper: 7.5), MultiArith 85.56 (paper: 77.7)
  • LLaMA CommonsenseQA: 73.96 (paper: 68.2)

Training Configuration

Used exactly the scripts from the repository (scripts/train_llama1b_gsm8k-aug.sh and scripts/train_llama1b_gsm8k-aug-nl.sh):

iCoT:

  • model: meta-llama/Llama-3.2-1B-Instruct
  • epochs: 10, lr: 8e-4, batch: 32, grad_accum: 4
  • LoRA r=128, alpha=32
  • num_latent: 6, prj_dim: 2048, distill_loss_factor: 20
  • max_token_num: 200

iCoT-full:

  • Same as above but data_name: icot-full, max_token_num: 256

Evaluation (scripts/test_llama1b.sh):

  • inf_latent_iterations: 6, greedy: True, batch_size: 128

Observed Behavior

The LLaMA models produce degenerate repetitive text during inference instead of valid numerical answers. The model appears to collapse during training for math, while it works perfectly fine for CommonsenseQA (which is classification, not generative math).

Environment

  • Python 3.12
  • transformers 4.52.4
  • torch 2.7.1+cu126
  • peft 0.15.2
  • NVIDIA H100 GPUs

Questions

  1. Are there known environment-specific issues (library versions, CUDA) that could cause this divergence for LLaMA on math tasks specifically?
  2. Were the paper results produced with these exact scripts (commit 2c23146), or were there additional changes?
  3. What transformers/peft versions were used for the paper results?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions