The basic steps for conducting the data mixture experiment are as follows:
-
Ensure you have installed the necessary tools, environment, and datasets by following the steps in Getting Started.
-
Modify the data mixture experiment configuration by referring to the content in Configuration.
-
Launch the experiment using the following script.
bash openseek/algorithm/run_exp.sh start <config-path>- View the experiment results and logs by referring to the content here.
You can modify the configuration by referring to the baseline experiment configuration which is located in the configs/OpenSeek-Small-v1-Baseline directory, including:
-
config_deepseek_v3_1_4b.yaml: This is the experiment configuration file, defining the experiment directory, backend engine, task type, and environment settings. -
train/train_deepseek_v3_1_4b.yaml: This is the job configuration (job config) file, specifying model parameters, dataset configurations, and training-specific settings.
To customize the experiment configuration, you can modify the baseline configuration directly (or copy it and modify):
- Modify the
experiment.exp_namefield in the experiment configuration file. The experiment output path will be under a directory with this new name.
# ...
experiment:
exp_name: OpenSeek-Small-v1-Baseline
# ...- Modify the
data.data_pathfield in the job configuration file to use your data path and the corresponding data mixing ratios. During the experiment, each individual ratio will be divided by the sum of all specified ratios to determine the actual proportion of data to be used from that path.
# ...
data:
# exp: baseline
data_path:
- 1.1068
- ${experiment.dataset_base_dir}/Nemotron-CC-high-actual-actual-high/part_142_text_document
- 0.3577
- ${experiment.dataset_base_dir}/Nemotron-CC-high-actual-actual-low/part_62_text_document
- 0.7775
- ${experiment.dataset_base_dir}/Nemotron-CC-high-actual-actual-mid/part_189_text_document
- 0.2859
- ${experiment.dataset_base_dir}/Nemotron-CC-high-synthetic-distill-high/part_76_text_document
- 0.1672
- ${experiment.dataset_base_dir}/Nemotron-CC-high-synthetic-distill-low/part_124_text_document
- 0.2339
- ${experiment.dataset_base_dir}/Nemotron-CC-high-synthetic-distill-mid/part_29_text_document
- 0.5397
- ${experiment.dataset_base_dir}/Nemotron-CC-high-synthetic-diverse_qa_pairs-high/part_244_text_document
- 0.4064
- ${experiment.dataset_base_dir}/Nemotron-CC-high-synthetic-diverse_qa_pairs-low/part_150_text_document
- 0.5005
- ...
# ...Modify the run_exp.sh script to specify the directory of your configuration and the path to your experiment configuration file.
For more information, see Megatron doc.
Experiment results will be located in a folder with the same name as the experiment (OpenSeek-Small-v1-Baseline for the baseline example).
Logs will be located in the log subdirectory within that folder. For example, when training with two machines, each having eight GPUs:
- The experiment startup log is located in the path corresponding to the first GPU on the first host. For example:
OpenSeek-Small-v1-Baseline/logs/details/host_0_xxx.xxx.xxx.xxx/20250423_185352.022338/default_atongk86/attempt_0/0
- The training loss log is located in the path corresponding to the last GPU on the last host. For example:
OpenSeek-Small-v1-Baseline/logs/details/host_1_yyy.yyy.yyy.yyy/20250423_185352.918040/default_zcuhq1c7/attempt_0/7