Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
680caf5
pipeline parallel init
tomiock Apr 23, 2026
56c048d
Enhance README with online datapacking and loading info
tomiock Apr 23, 2026
ed1d147
Revise scalability metrics in SCALABILITY.md
tomiock Apr 23, 2026
ba3ed0e
jup config
tomiock Apr 22, 2026
60f2680
jupiter data packing config
tomiock Apr 23, 2026
fe08445
9B PP=4 but error on 27B
tomiock Apr 23, 2026
92b5ccc
[feat] 27B model running w/ PP=6
tomiock Apr 24, 2026
12f3bd2
Merge branch 'pipeline' of github.com:VLR-CVC/vlm-training into pipeline
tomiock Apr 25, 2026
d48c74f
broken loss, models gives `nan` on logits
tomiock Apr 28, 2026
2afd6c7
[fix] corrent loss
tomiock Apr 28, 2026
a4b5083
loss issues w/ pipeline
tomiock Apr 28, 2026
1aff281
we need to load the weights in PP
tomiock Apr 28, 2026
6a7de7c
dev checkpoint
tomiock Apr 28, 2026
c89d450
[fix] PP loading weights
tomiock Apr 28, 2026
6fccd16
[fix] layers well split in PP ranks
tomiock Apr 28, 2026
579efb3
init moe
tomiock Apr 28, 2026
e069634
[feat] moe implemented (bad performance)
tomiock Apr 28, 2026
4ef463d
Merge branch 'pipeline' into moe
tomiock Apr 28, 2026
6b22368
merge `moe` into `pipeline`
tomiock Apr 28, 2026
267c870
Merge branch 'moe' of github.com:VLR-CVC/vlm-training into moe
tomiock Apr 28, 2026
7c40c3d
Merge pull request #17 from VLR-CVC/moe
tomiock Apr 28, 2026
7cf1de7
sync
tomiock Apr 29, 2026
22241cc
[feat] MoE TP & PP (not at the same time)
tomiock Apr 29, 2026
1b0e6c4
test EP
tomiock Apr 29, 2026
f036910
[feat] EP + TP implemented
tomiock Apr 29, 2026
d61392e
[feat] 4D parallelism
tomiock Apr 29, 2026
ae04d25
[feat] properly behaving 4D moe set up
tomiock May 2, 2026
9e5ddfc
[fix] better PP stages
tomiock May 2, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 7 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
Massive-scale VLM pre-training and finetuning on HPC environments. It is specifically designed and tested for **Marenostrum 5** and **JUPITER**.
Works similary to torchtitan, only relying on native torch code for the distributed implementation. Compatibilty with HF state-dict, loads weights from HF snapshot directory.

See SCALABILITY.md and USAGE.md for more details.

## Key Features
* **Supported Architectures:** **Qwen3.5**, Qwen3-VL and Qwen3 (text).
* **2D Parallelism:** FSDP/DDP (Single & Multi-node) and Tensor Parallelism (TP) support. Tested scaling up to 256 GPUs.
Expand All @@ -32,19 +34,12 @@ Support for ROCm systems (LUMI) is work in progress.
- `transformers=5.6.0`

## Datasets and Dataloading
Datasets are expected to be as a CrudeWebdataset. With https://github.com/NVIDIA/Megatron-Energon we handle the raw data and tokenize it on the fly. It is an asynchrnos process that does not have an impact on model performance.

**Online datapacking is not yet supported** (no particular issue related to the HPC system, its skill issue on my part). We believe that data packing is a must-have for a visual-language training codebase with native resolution, as the varying image sizes on the datasets can be handled easily.
Datasets are expected to be as a CrudeWebdataset. With https://github.com/NVIDIA/Megatron-Energon we handle the raw data and tokenize it on the fly. It is an asynchrnos process that does not have an impact on model performance. **Online datapacking is used by default.** Support for Metadatasets (multiple sources).

## Model Weights & Offline Loading
Use `utils/down.py` on a login node to pre-download model weights and tokenizers to a shared filesystem:

```bash
python utils/down.py
```
Go into the file and change the arguments, it does not have CLI support.
Use `utils/down.py` on a login node to pre-download model weights and tokenizers to a shared filesystem. The models' archicture configuration relies on what is downloaded.

**Loading Mechanism:** During training, models are instantiated directly from these local paths. For Native Torch models, the architecture is initialized purely in PyTorch, and the offline weights are mapped and loaded directly into the native state dictionary.
**Loading Mechanism:** During training, models are instantiated directly from these local paths. The architecture is initialized purely in PyTorch, and the offline weights are mapped and loaded directly into the native state dictionary.

## Usage
1. Ensure your datasets are formatted as Nvidia Energon webdatasets.
Expand All @@ -67,5 +62,6 @@ The codebase demonstrates linear scaling up to 256 GPUs using FSDP and Tensor Pa
For a detailed breakdown of throughput, GPU efficiency, and scaling characteristics, please refer to [SCALABILITY.md](SCALABILITY.md).

## Known Issues & TODOs
* Online data packing for Energon dataloading is not yet supported.
* The entire workflow `training -> checkpoints -> eval/usage` needs a lot of work.
* Static shape compilation (`torch.compile` with `fullgraph=True`) is pending.
* A better data packing implemented is needed.
10 changes: 8 additions & 2 deletions SCALABILITY.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
## Qwen3.5-2B @ JUPITER
- 16 node (64 H200 96GB) tested
- 10,000 tks/sec/device
- +15,000 tks/sec/gpu

## Qwen3.5-9B @ JUPITER
- scaling test from 16 to 256 nodes
- +500 TFLOPS/s/gpu
<img width="700" height="599" alt="image" src="https://github.com/user-attachments/assets/9ce70706-a081-4ba4-b018-734503fac241" />


## Qwen3-VL-8B @ JUPITER
- ~380 TFLOPS with 4 nodes (16 GH200 96GB)
Expand All @@ -16,4 +22,4 @@

### Results
Scalability throughput with 8B model on Marenostrum 5:
<img width="700" height="600" alt="image" src="https://github.com/user-attachments/assets/186567ce-5a76-4625-9e1c-587d0f44c24c" />
<img width="800" height="600" alt="image" src="https://github.com/user-attachments/assets/186567ce-5a76-4625-9e1c-587d0f44c24c" />
38 changes: 38 additions & 0 deletions configs/cvc/moe.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
[model]
model_name = "Qwen/Qwen3.5-8B-A1B"
model_impl = "native"

train_llm = true
train_mlp = true
train_vit = true

[wandb]
run_name = "test"
project_name = "moe"

[training]
model_dir = "/data/151-1/users/tockier/qwen_finetune/cache/qwen35_8b_a1b"
#model_dir = "/data/151-1/users/tockier/qwen_finetune/cache/qwen35_35b_a3b"
output_dir = "/data/151-1/users/shared_cache/qwen_finetune/checkpoints"

save_steps = 10000
total_steps = 100
random_init = true

compile = false

[parallel]
tp_size = 1
pp_size = 1
ep_size = 1
data_parallel = 'fsdp'

ac_mode = "off"

[data]
data_path = "/data/151-1/datasets/synth_test_datasets/imagenet"
seq_len = 8192

packing_buffer_size = 100

batch_size = 0
33 changes: 33 additions & 0 deletions configs/cvc/qwen3_5_27b.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
[model]
model_name = "Qwen/Qwen3.5-27B"
model_impl = "native"

train_llm = true
train_mlp = true
train_vit = false

[wandb]
run_name = "test"
project_name = "qwen35_27b"

[training]
model_dir = "/data/151-1/users/tockier/qwen_finetune/cache/qwen35_27b"
output_dir = "/data/151-1/users/shared_cache/qwen_finetune/checkpoints"

save_steps = 10000
total_steps = 10000
random_init = false

tp_size = 1
pp_size = 2
data_parallel = 'fsdp'

ac_mode = "full"
compile = false

[data]
data_path = "/data/151-1/datasets/synth_test_datasets/cap_pretrain"
seq_len = 200

packing_buffer_size = 100
batch_size = 0
14 changes: 5 additions & 9 deletions configs/cvc/qwen3_5_2b.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,24 +17,20 @@ random_init = false

scheduler_type = "cosine"

total_steps: int = 1_000
warmup_steps: int = 100
wsd_decay_ratio: float = 0.1
min_lr_ratio: float = 0.1

tp_size = 1
pp_size = 2
data_parallel = 'ddp'

compile = false

tpi_multiplier = 12
tpi_multiplier = 1

[data]
data_path = "/data/151-1/datasets/llava_recap"
seq_len = 8192
data_path = "/data/151-1/datasets/synth_test_datasets/cap_pretrain"
seq_len = 256

shuffle_buffer_size = 1000
packing_buffer_size = 1000
packing_buffer_size = 100
max_samples_per_sequence = 100

batch_size = 0
12 changes: 7 additions & 5 deletions configs/cvc/qwen3_5_9b.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,17 +16,19 @@ output_dir = "/data/151-1/users/shared_cache/qwen_finetune/checkpoints"

save_steps = 10000
total_steps = 10000
random_init_mlp = false
random_init = false

tp_size = 4
tp_size = 1
pp_size = 2
data_parallel = 'fsdp'

ac_mode = "full"

compile = false

[data]
data_path = "/data/151-1/datasets/synth_test_datasets/cap_pretrain"
seq_len = 4096
seq_len = 512

packing_buffer_size = 100

batch_size = 32
batch_size = 0
49 changes: 49 additions & 0 deletions configs/jupiter/qwen3_5_27b.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
[model]
model_name = "Qwen/Qwen3.5-27B"
model_impl = "native"

train_llm = true
train_mlp = true
train_vit = false

[wandb]
run_name = "test 27b"
project_name = "scaling_27b"
entity_name = "bsc_runs"

[training]
model_dir = "/e/project1/reformo/ockier1/qwen_models/qwen3_5_27b"
output_dir = "/e/scratch/reformo/ockier1/checkpoints/test_35_27b"

tpi_multiplier = 1.0
save_steps = 1000

scheduler_type = "cosine"
total_steps = 18000
warmup_steps = 100
#wsd_decay_ratio = 0.1
min_lr_ratio = 0.1

lr_llm = 0.00002
lr_mlp = 0.0001

data_parallel = 'ddp'

pp_size = 1
tp_size = 1

resume_checkpoint = false
random_init = false

ac_mode = 'off'

compile = false
async_tp = false

[data]
data_path = "/e/project1/jureap59/ockier1/datasets/cap_pretrain"

packing_buffer_size = 100

seq_len = 512
batch_size = 0
22 changes: 14 additions & 8 deletions configs/jupiter/qwen3_5_2b.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,27 +17,33 @@ model_dir = "/e/project1/reformo/ockier1/qwen_models/qwen3_5_2b"
output_dir = "/e/scratch/reformo/ockier1/checkpoints/test_35_2b"

tpi_multiplier = 1.0
save_steps = 1000
save_steps = 200

scheduler_type = "cosine"
total_steps = 18000
warmup_steps = 100
total_steps = 500
warmup_steps = 10
#wsd_decay_ratio = 0.1
min_lr_ratio = 0.1

lr_llm = 0.00002
lr_mlp = 0.0001

data_parallel = 'fsdp'
data_parallel = 'ddp'
tp_size = 1

resume_checkpoint = false
random_init_mlp = false
random_init = true

compile = true
compile = false

[data]
data_path = "/e/project1/jureap59/ockier1/datasets/cap_pretrain"
data_path = "/e/data1/datasets/products/llava_onevision_mid_training_85m/imagenet/EN"

seq_len = 8192
batch_size = 66

shuffle_buffer_size = 1000
packing_buffer_size = 1000
max_samples_per_sequence = 100

batch_size = 0

14 changes: 9 additions & 5 deletions configs/jupiter/qwen3_5_9b.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,18 +28,22 @@ lr_llm = 0.00002
lr_mlp = 0.0001

data_parallel = 'fsdp'
tp_size = 4

pp_size = 4
tp_size = 1

resume_checkpoint = false
random_init_mlp = false
random_init = false

ac_mode = 'off'

compile = true
compile = false
async_tp = false

[data]
data_path = "/e/project1/jureap59/ockier1/datasets/cap_pretrain"

seq_len = 10240
batch_size = 64
packing_buffer_size = 100

seq_len = 8192
batch_size = 0
32 changes: 31 additions & 1 deletion data/energon_dataloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,28 @@ class EnergonSample(Sample):
image: torch.Tensor
messages: list

@stateless
def cooker_llava_imagenet(sample: dict, add_system_prompt: bool = True) -> EnergonSample:
messages = [
{'role': 'user', 'content': [
{"type": "image"}
]},
{'role': 'assistant', 'content': [
{"type": "text", "text": sample['txt']}
]},
]

if not add_system_prompt:
messages.append({"role": "system", "content": [{"type": "text", "text": ""}]})

image = sample['jpg']

return EnergonSample(
**basic_sample_keys(sample),
image=image,
messages=messages,
)

@stateless
def cooker_captioning(sample: dict, add_system_prompt: bool = True) -> EnergonSample:
role_map = {'human': 'user', 'gpt': 'assistant', 'user': 'user', 'assistant': 'assistant'}
Expand Down Expand Up @@ -254,11 +276,19 @@ def __init__(self, processor, max_seq_len):
self.assistant_token = self.tokenizer.encode("assistant")[0]
self.EOS_token = self.tokenizer.eos_token_id

"""
cookers = [
# subflavors can be used to distinguish datasets when using a Metadataset
Cooker(cooker_captioning),
Cooker(cooker_captioning, has_subflavors={"type_dataset": "synth"}),
Cooker(cooker_llava_imagenet, has_subflavors={"type_dataset": "llava_onevision_midtraining"}),
]
"""

cookers = [
# subflavors can be used to distinguish datasets when using a Metadataset
Cooker(cooker_captioning),
Cooker(cooker_llava_imagenet, has_subflavors={"type_dataset": "llava_onevision_midtraining"}),
]
# transform the RAW data, tokenize a single sample
@stateless(restore_seeds=True)
def encode_sample(self, sample: EnergonSample) -> EncodedSample:
Expand Down
Loading