VLR-CVC · tomiock · Apr 23, 2026 · Apr 23, 2026 · Apr 23, 2026 · Apr 22, 2026
diff --git a/README.md b/README.md
@@ -6,6 +6,8 @@
 Massive-scale VLM pre-training and finetuning on HPC environments. It is specifically designed and tested for **Marenostrum 5** and **JUPITER**.
 Works similary to torchtitan, only relying on native torch code for the distributed implementation. Compatibilty with HF state-dict, loads weights from HF snapshot directory.
 
+See SCALABILITY.md and USAGE.md for more details.
+
 ## Key Features
 * **Supported Architectures:** **Qwen3.5**, Qwen3-VL and Qwen3 (text).
 * **2D Parallelism:** FSDP/DDP (Single & Multi-node) and Tensor Parallelism (TP) support. Tested scaling up to 256 GPUs.
@@ -32,19 +34,12 @@ Support for ROCm systems (LUMI) is work in progress.
 - `transformers=5.6.0`
 
 ## Datasets and Dataloading
-Datasets are expected to be as a CrudeWebdataset. With https://github.com/NVIDIA/Megatron-Energon we handle the raw data and tokenize it on the fly. It is an asynchrnos process that does not have an impact on model performance.
-
-**Online datapacking is not yet supported** (no particular issue related to the HPC system, its skill issue on my part). We believe that data packing is a must-have for a visual-language training codebase with native resolution, as the varying image sizes on the datasets can be handled easily.
+Datasets are expected to be as a CrudeWebdataset. With https://github.com/NVIDIA/Megatron-Energon we handle the raw data and tokenize it on the fly. It is an asynchrnos process that does not have an impact on model performance. **Online datapacking is used by default.** Support for Metadatasets (multiple sources).
 
 ## Model Weights & Offline Loading
-Use `utils/down.py` on a login node to pre-download model weights and tokenizers to a shared filesystem:
-
-```bash
-python utils/down.py
-```
-Go into the file and change the arguments, it does not have CLI support.
+Use `utils/down.py` on a login node to pre-download model weights and tokenizers to a shared filesystem. The models' archicture configuration relies on what is downloaded. 
 
-**Loading Mechanism:** During training, models are instantiated directly from these local paths. For Native Torch models, the architecture is initialized purely in PyTorch, and the offline weights are mapped and loaded directly into the native state dictionary.
+**Loading Mechanism:** During training, models are instantiated directly from these local paths. The architecture is initialized purely in PyTorch, and the offline weights are mapped and loaded directly into the native state dictionary.
 
 ## Usage
 1. Ensure your datasets are formatted as Nvidia Energon webdatasets.
@@ -67,5 +62,6 @@ The codebase demonstrates linear scaling up to 256 GPUs using FSDP and Tensor Pa
 For a detailed breakdown of throughput, GPU efficiency, and scaling characteristics, please refer to [SCALABILITY.md](SCALABILITY.md).
 
 ## Known Issues & TODOs
-* Online data packing for Energon dataloading is not yet supported.
+* The entire workflow `training -> checkpoints -> eval/usage` needs a lot of work.
 * Static shape compilation (`torch.compile` with `fullgraph=True`) is pending.
+* A better data packing implemented is needed.
diff --git a/SCALABILITY.md b/SCALABILITY.md
@@ -1,6 +1,12 @@
 ## Qwen3.5-2B @ JUPITER
 - 16 node (64 H200 96GB) tested
-- 10,000 tks/sec/device
+- +15,000 tks/sec/gpu
+
+## Qwen3.5-9B @ JUPITER
+- scaling test from 16 to 256 nodes
+- +500 TFLOPS/s/gpu
+<img width="700" height="599" alt="image" src="https://github.com/user-attachments/assets/9ce70706-a081-4ba4-b018-734503fac241" />
+
 
 ## Qwen3-VL-8B @ JUPITER
 - ~380 TFLOPS with 4 nodes (16 GH200 96GB)
@@ -16,4 +22,4 @@
 
 ### Results
 Scalability throughput with 8B model on Marenostrum 5:
-<img width="700" height="600" alt="image" src="https://github.com/user-attachments/assets/186567ce-5a76-4625-9e1c-587d0f44c24c" />
+<img width="800" height="600" alt="image" src="https://github.com/user-attachments/assets/186567ce-5a76-4625-9e1c-587d0f44c24c" />
diff --git a/configs/cvc/moe.toml b/configs/cvc/moe.toml
@@ -0,0 +1,38 @@
+[model]
+model_name = "Qwen/Qwen3.5-8B-A1B"
+model_impl = "native"
+
+train_llm = true
+train_mlp = true
+train_vit = true
+
+[wandb]
+run_name = "test"
+project_name = "moe"
+
+[training]
+model_dir = "/data/151-1/users/tockier/qwen_finetune/cache/qwen35_8b_a1b"
+#model_dir = "/data/151-1/users/tockier/qwen_finetune/cache/qwen35_35b_a3b"
+output_dir = "/data/151-1/users/shared_cache/qwen_finetune/checkpoints"
+
+save_steps = 10000
+total_steps = 100
+random_init = true
+
+compile = false
+
+[parallel]
+tp_size = 1
+pp_size = 1
+ep_size = 1
+data_parallel = 'fsdp'
+
+ac_mode = "off"
+
+[data]
+data_path = "/data/151-1/datasets/synth_test_datasets/imagenet"
+seq_len = 8192
+
+packing_buffer_size = 100
+
+batch_size = 0
diff --git a/configs/cvc/qwen3_5_27b.toml b/configs/cvc/qwen3_5_27b.toml
@@ -0,0 +1,33 @@
+[model]
+model_name = "Qwen/Qwen3.5-27B"
+model_impl = "native"
+
+train_llm = true
+train_mlp = true
+train_vit = false
+
+[wandb]
+run_name = "test"
+project_name = "qwen35_27b"
+
+[training]
+model_dir = "/data/151-1/users/tockier/qwen_finetune/cache/qwen35_27b"
+output_dir = "/data/151-1/users/shared_cache/qwen_finetune/checkpoints"
+
+save_steps = 10000
+total_steps = 10000
+random_init = false
+
+tp_size = 1
+pp_size = 2
+data_parallel = 'fsdp'
+
+ac_mode = "full"
+compile = false
+
+[data]
+data_path = "/data/151-1/datasets/synth_test_datasets/cap_pretrain"
+seq_len = 200
+
+packing_buffer_size = 100
+batch_size = 0
diff --git a/configs/cvc/qwen3_5_2b.toml b/configs/cvc/qwen3_5_2b.toml
@@ -17,24 +17,20 @@ random_init = false
 
 scheduler_type = "cosine"
 
-total_steps: int = 1_000
-warmup_steps: int = 100
-wsd_decay_ratio: float = 0.1
-min_lr_ratio: float = 0.1
-
 tp_size = 1
+pp_size = 2
 data_parallel = 'ddp'
 
 compile = false
 
-tpi_multiplier = 12
+tpi_multiplier = 1
 
 [data]
-data_path = "/data/151-1/datasets/llava_recap"
-seq_len = 8192
+data_path = "/data/151-1/datasets/synth_test_datasets/cap_pretrain"
+seq_len = 256
 
 shuffle_buffer_size = 1000
-packing_buffer_size = 1000
+packing_buffer_size = 100
 max_samples_per_sequence = 100
 
 batch_size = 0
diff --git a/configs/cvc/qwen3_5_9b.toml b/configs/cvc/qwen3_5_9b.toml
@@ -16,17 +16,19 @@ output_dir = "/data/151-1/users/shared_cache/qwen_finetune/checkpoints"
 
 save_steps = 10000
 total_steps = 10000
-random_init_mlp = false
+random_init = false
 
-tp_size = 4
+tp_size = 1
+pp_size = 2
 data_parallel = 'fsdp'
 
 ac_mode = "full"
-
 compile = false
 
 [data]
 data_path = "/data/151-1/datasets/synth_test_datasets/cap_pretrain"
-seq_len = 4096
+seq_len = 512
+
+packing_buffer_size = 100
 
-batch_size = 32
+batch_size = 0
diff --git a/configs/jupiter/qwen3_5_27b.toml b/configs/jupiter/qwen3_5_27b.toml
@@ -0,0 +1,49 @@
+[model]
+model_name = "Qwen/Qwen3.5-27B"
+model_impl = "native"
+
+train_llm = true
+train_mlp = true
+train_vit = false
+
+[wandb]
+run_name = "test 27b"
+project_name = "scaling_27b"
+entity_name = "bsc_runs"
+
+[training]
+model_dir = "/e/project1/reformo/ockier1/qwen_models/qwen3_5_27b"
+output_dir = "/e/scratch/reformo/ockier1/checkpoints/test_35_27b"
+
+tpi_multiplier = 1.0
+save_steps = 1000
+
+scheduler_type = "cosine"
+total_steps = 18000
+warmup_steps = 100
+#wsd_decay_ratio = 0.1
+min_lr_ratio = 0.1
+
+lr_llm = 0.00002
+lr_mlp = 0.0001
+
+data_parallel = 'ddp'
+
+pp_size = 1
+tp_size = 1
+
+resume_checkpoint = false
+random_init = false
+
+ac_mode = 'off'
+
+compile = false
+async_tp = false
+
+[data]
+data_path = "/e/project1/jureap59/ockier1/datasets/cap_pretrain"
+
+packing_buffer_size = 100
+
+seq_len = 512
+batch_size = 0
diff --git a/configs/jupiter/qwen3_5_2b.toml b/configs/jupiter/qwen3_5_2b.toml
@@ -17,27 +17,33 @@ model_dir = "/e/project1/reformo/ockier1/qwen_models/qwen3_5_2b"
 output_dir = "/e/scratch/reformo/ockier1/checkpoints/test_35_2b"
 
 tpi_multiplier = 1.0
-save_steps = 1000
+save_steps = 200
 
 scheduler_type = "cosine"
-total_steps = 18000
-warmup_steps = 100
+total_steps = 500
+warmup_steps = 10
 #wsd_decay_ratio = 0.1
 min_lr_ratio = 0.1
 
 lr_llm = 0.00002
 lr_mlp = 0.0001
 
-data_parallel = 'fsdp'
+data_parallel = 'ddp'
 tp_size = 1
 
 resume_checkpoint = false
-random_init_mlp = false
+random_init = true
 
-compile = true
+compile = false
 
 [data]
-data_path = "/e/project1/jureap59/ockier1/datasets/cap_pretrain"
+data_path = "/e/data1/datasets/products/llava_onevision_mid_training_85m/imagenet/EN"
 
 seq_len = 8192
-batch_size = 66
+
+shuffle_buffer_size = 1000
+packing_buffer_size = 1000
+max_samples_per_sequence = 100
+
+batch_size = 0
+
diff --git a/configs/jupiter/qwen3_5_9b.toml b/configs/jupiter/qwen3_5_9b.toml
@@ -28,18 +28,22 @@ lr_llm = 0.00002
 lr_mlp = 0.0001
 
 data_parallel = 'fsdp'
-tp_size = 4
+
+pp_size = 4
+tp_size = 1
 
 resume_checkpoint = false
-random_init_mlp = false
+random_init = false
 
 ac_mode = 'off'
 
-compile = true
+compile = false
 async_tp = false
 
 [data]
 data_path = "/e/project1/jureap59/ockier1/datasets/cap_pretrain"
 
-seq_len = 10240
-batch_size = 64
+packing_buffer_size = 100
+
+seq_len = 8192
+batch_size = 0
diff --git a/data/energon_dataloader.py b/data/energon_dataloader.py
@@ -93,6 +93,28 @@ class EnergonSample(Sample):
     image: torch.Tensor
     messages: list
 
+@stateless
+def cooker_llava_imagenet(sample: dict, add_system_prompt: bool = True) -> EnergonSample:
+    messages = [
+        {'role': 'user', 'content': [
+            {"type": "image"} 
+        ]},
+        {'role': 'assistant', 'content': [
+            {"type": "text", "text": sample['txt']}
+        ]},
+    ]
+
+    if not add_system_prompt:
+        messages.append({"role": "system", "content": [{"type": "text", "text": ""}]})
+
+    image = sample['jpg']
+
+    return EnergonSample(
+        **basic_sample_keys(sample),
+        image=image,
+        messages=messages,
+    )
+
 @stateless
 def cooker_captioning(sample: dict, add_system_prompt: bool = True) -> EnergonSample:
     role_map = {'human': 'user', 'gpt': 'assistant', 'user': 'user', 'assistant': 'assistant'}
@@ -254,11 +276,19 @@ def __init__(self, processor, max_seq_len):
         self.assistant_token = self.tokenizer.encode("assistant")[0]
         self.EOS_token  = self.tokenizer.eos_token_id
 
+    """
     cookers = [
         # subflavors can be used to distinguish datasets when using a Metadataset
-        Cooker(cooker_captioning),
+        Cooker(cooker_captioning, has_subflavors={"type_dataset": "synth"}),
+        Cooker(cooker_llava_imagenet, has_subflavors={"type_dataset": "llava_onevision_midtraining"}),
     ]
+    """
 
+    cookers = [
+        # subflavors can be used to distinguish datasets when using a Metadataset
+        Cooker(cooker_captioning),
+        Cooker(cooker_llava_imagenet, has_subflavors={"type_dataset": "llava_onevision_midtraining"}),
+    ]
     # transform the RAW data, tokenize a single sample
     @stateless(restore_seeds=True)
     def encode_sample(self, sample: EnergonSample) -> EncodedSample: