Trains a causal LM continuously from:
- Unlabeled text stream (next-token prediction)
- Labeled prompt/answer stream (instruction supervision, optional)
Uses GaLore-style gradient low-rank projection to reduce optimizer-state memory:
- Occasionally build a projector from current gradient via SVD
- Project gradients to compact space
- Run AdamW moments in compact space
- Project update back to full parameter space
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtDefault config uses allenai/c4 streaming for unlabeled data:
python -m src.trainEdit config.yaml:
labeled_source: hf_stream
semi_supervised:
enabled: true
supervised_weight: 0.5
supervised_every: 2Set in config.yaml:
unlabeled_source: folder_tail
labeled_source: folder_tailCreate directories and drop JSONL files:
mkdir -p data/unlabeled data/labeledUnlabeled format:
{"text": "new text arrives here over time"}Labeled format:
{"prompt": "### Instruction:\n...\n\n### Response:\n", "answer": "..."}In config.yaml:
galore.rank: projection rank (bigger = closer to full AdamW, more memory)galore.update_proj_gap: recompute projector every N stepsgalore.svd_method:lowrank(faster) orfull(more accurate)
- Single-device oriented, intentionally simple
- GaLore applied only to eligible 2D matrices (Linear weights)
- For scaling: add activation checkpointing, bf16, fused attention, multi-GPU