GaLore Online Semi-Supervised Continuous Pretraining

Trains a causal LM continuously from:

Unlabeled text stream (next-token prediction)
Labeled prompt/answer stream (instruction supervision, optional)

Uses GaLore-style gradient low-rank projection to reduce optimizer-state memory:

Occasionally build a projector from current gradient via SVD
Project gradients to compact space
Run AdamW moments in compact space
Project update back to full parameter space

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run (HF streaming)

Default config uses allenai/c4 streaming for unlabeled data:

python -m src.train

Enable instruction supervision

Edit config.yaml:

labeled_source: hf_stream
semi_supervised:
  enabled: true
  supervised_weight: 0.5
  supervised_every: 2

Run in folder-tail mode (true online)

Set in config.yaml:

unlabeled_source: folder_tail
labeled_source: folder_tail

Create directories and drop JSONL files:

mkdir -p data/unlabeled data/labeled

Unlabeled format:

{"text": "new text arrives here over time"}

Labeled format:

{"prompt": "### Instruction:\n...\n\n### Response:\n", "answer": "..."}

GaLore configuration

In config.yaml:

galore.rank: projection rank (bigger = closer to full AdamW, more memory)
galore.update_proj_gap: recompute projector every N steps
galore.svd_method: lowrank (faster) or full (more accurate)

Notes

Single-device oriented, intentionally simple
GaLore applied only to eligible 2D matrices (Linear weights)
For scaling: add activation checkpointing, bf16, fused attention, multi-GPU

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
experiments		experiments
report		report
results		results
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
DRAFT.md		DRAFT.md
LICENSE		LICENSE
README.md		README.md
VALIDATION_PLAN.md		VALIDATION_PLAN.md
config.yaml		config.yaml
config_exp003_base.yaml		config_exp003_base.yaml
config_exp004_0.yaml		config_exp004_0.yaml
config_exp004_1.yaml		config_exp004_1.yaml
config_exp004_2.yaml		config_exp004_2.yaml
config_exp005_0.yaml		config_exp005_0.yaml
config_exp005_1.yaml		config_exp005_1.yaml
config_exp005_2.yaml		config_exp005_2.yaml
config_exp005_3.yaml		config_exp005_3.yaml
config_exp006_0.yaml		config_exp006_0.yaml
config_exp006_1.yaml		config_exp006_1.yaml
config_exp006_2.yaml		config_exp006_2.yaml
config_exp006_lora.yaml		config_exp006_lora.yaml
config_exp006_phase2.yaml		config_exp006_phase2.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GaLore Online Semi-Supervised Continuous Pretraining

Setup

Run (HF streaming)

Enable instruction supervision

Run in folder-tail mode (true online)

GaLore configuration

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GaLore Online Semi-Supervised Continuous Pretraining

Setup

Run (HF streaming)

Enable instruction supervision

Run in folder-tail mode (true online)

GaLore configuration

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages