Skip to content

Latest commit

 

History

History
85 lines (61 loc) · 3.73 KB

File metadata and controls

85 lines (61 loc) · 3.73 KB

Data Mixture Experiment

Pipeline

The basic steps for conducting the data mixture experiment are as follows:

  1. Ensure you have installed the necessary tools, environment, and datasets by following the steps in Getting Started.

  2. Modify the data mixture experiment configuration by referring to the content in Configuration.

  3. Launch the experiment using the following script.

bash openseek/algorithm/run_exp.sh start <config-path>
  1. View the experiment results and logs by referring to the content here.

Configuration

You can modify the configuration by referring to the baseline experiment configuration which is located in the configs/OpenSeek-Small-v1-Baseline directory, including:

  • config_deepseek_v3_1_4b.yaml: This is the experiment configuration file, defining the experiment directory, backend engine, task type, and environment settings.

  • train/train_deepseek_v3_1_4b.yaml: This is the job configuration (job config) file, specifying model parameters, dataset configurations, and training-specific settings.

To customize the experiment configuration, you can modify the baseline configuration directly (or copy it and modify):

  • Modify the experiment.exp_name field in the experiment configuration file. The experiment output path will be under a directory with this new name.
# ...
experiment:
  exp_name: OpenSeek-Small-v1-Baseline
# ...
  • Modify the data.data_path field in the job configuration file to use your data path and the corresponding data mixing ratios. During the experiment, each individual ratio will be divided by the sum of all specified ratios to determine the actual proportion of data to be used from that path.
# ...
data:
  # exp: baseline
  data_path:
    - 1.1068
    - ${experiment.dataset_base_dir}/Nemotron-CC-high-actual-actual-high/part_142_text_document
    - 0.3577
    - ${experiment.dataset_base_dir}/Nemotron-CC-high-actual-actual-low/part_62_text_document
    - 0.7775
    - ${experiment.dataset_base_dir}/Nemotron-CC-high-actual-actual-mid/part_189_text_document
    - 0.2859
    - ${experiment.dataset_base_dir}/Nemotron-CC-high-synthetic-distill-high/part_76_text_document
    - 0.1672
    - ${experiment.dataset_base_dir}/Nemotron-CC-high-synthetic-distill-low/part_124_text_document
    - 0.2339
    - ${experiment.dataset_base_dir}/Nemotron-CC-high-synthetic-distill-mid/part_29_text_document
    - 0.5397
    - ${experiment.dataset_base_dir}/Nemotron-CC-high-synthetic-diverse_qa_pairs-high/part_244_text_document
    - 0.4064
    - ${experiment.dataset_base_dir}/Nemotron-CC-high-synthetic-diverse_qa_pairs-low/part_150_text_document
    - 0.5005
    - ...
# ...

Modify the run_exp.sh script to specify the directory of your configuration and the path to your experiment configuration file.

For more information, see Megatron doc.

Results and Log

Experiment results will be located in a folder with the same name as the experiment (OpenSeek-Small-v1-Baseline for the baseline example).

Logs will be located in the log subdirectory within that folder. For example, when training with two machines, each having eight GPUs:

  • The experiment startup log is located in the path corresponding to the first GPU on the first host. For example:
OpenSeek-Small-v1-Baseline/logs/details/host_0_xxx.xxx.xxx.xxx/20250423_185352.022338/default_atongk86/attempt_0/0
  • The training loss log is located in the path corresponding to the last GPU on the last host. For example:
OpenSeek-Small-v1-Baseline/logs/details/host_1_yyy.yyy.yyy.yyy/20250423_185352.918040/default_zcuhq1c7/attempt_0/7