OpenSeek/docs/FlagScale_Usage.md at main · FlagAI-Open/OpenSeek

Setup

We recommend using the latest release of NGC's PyTorch container for setup.

Clone the repository:

git clone https://github.com/FlagOpen/FlagScale.git

Install the requirements:

cd FlagScale/install
./install-requirements.sh --env train
./install-requirements.sh --env inference

The above instructions create two conda environments: flagscale-train and flagscale-inference, which contain the dependency environments for training and inference, respectively.

Install the packages with customized extensions:

cd vllm
pip install .

pip install -e ./megatron-energon
cp -r megatron-energon/src/megatron/energon megatron/megatron

Run a task

FlagScale provides a unified runner for various tasks, including training, inference, and serving. Simply specify the configuration file to run the task with a single command.

Start the distributed training job

python run.py --config-path=examples/deepseek_v3/conf --config-name=config_deepseek_v3.yaml action=run

Stop the distributed training job

python run.py --config-path=examples/deepseek_v3/conf --config-name=config_deepseek_v3.yaml action=stop

YAML Configuration FlagScale leverages Hydra for configuration management, which is organized into two levels: an outer experiment-level YAML file and an inner task-level YAML file.

In the OpenSeek project, we have open-sourced a DeepSeek model with a total parameter count of 16B and an activated parameter count of 2.4B. This model has been thoroughly validated on real-world datasets, and the loss curve will be released shortly.

Experiment-level YAML: The experiment-level YAML file defines the experiment directory, backend engine, task type, and other related environmental configurations. config_deepseek_v3.yaml
Task-level YAML: The task-level YAML file specifies the model, dataset, and parameters for specific tasks such as training or inference. train_deepseek_v3.yaml

Model Checkpoint conversion

1. HuggingFace --> Megatron

FlagScale supports the conversion of open-source models and checkpoints (CKPT) from HuggingFace to the Megatron format. Once the conversion is completed, the CKPT can be loaded, and distributed training can be initiated using FlagScale.
For instance, the DeepSeek-V2-Lite 16B model, which is openly available on HuggingFace, can be converted into a CKPT format supported by FlagScale. Subsequently, the model can be directly warm-started by configuring the load option in config_deepseek_v3.yaml.
FlagScale conversion supports tensor model parallelism, expert model parallelism, and pipeline model parallelism with even & uneven partitioning of pipeline stages during the checkpoint (CKPT) conversion.

CKPT conversion script

cd FlagScale/tools
python convert.py \
    --model-type deepseek_v3 \
    --loader transformers \
    --saver mcore \
    --load-dir DeepSeek-V2-Lite \
    --save-dir converted_mcore_bf16_model \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 2 \
    --target-decoder-first-pipeline-num-layers 13 \
    --target-expert-parallel-size 4 \
    --target-params-dtype bf16 \
    --true-vocab-size 151851

Modify task yaml

Set the load field in the YAML file to the path of the converted checkpoint

system:
  tensor_model_parallel_size: 2
  pipeline_model_parallel_size: 2
  expert_model_parallel_size: 2
  context_parallel_size: 1
  sequence_parallel: true
  use_distributed_optimizer: true
  ...
  checkpoint:
    save_interval: 10000
    load: converted_mcore_bf16_model # the save_dir after conversion
    ckpt_format: torch

Start training

python run.py --config-path=examples/deepseek_v3/conf --config-name=config_deepseek_v3.yaml action=run

2. Megatron --> HuggingFace

FlagScale also supports the conversion of model checkpoints (CKPT) trained on FlagScale into the HuggingFace format, facilitating model release and evaluation.

CKPT conversion script

python convert.py \
    --model-type deepseek_v3 \
    --loader mcore \
    --saver transformers \
    --load-dir bf16_model \
    --save-dir converted_huggingface_model \
    --target-tensor-parallel-size 1 \
    --target-pipeline-parallel-size 1 \
    --target-expert-parallel-size 1 \
    --target-params-dtype bf16 \
    --true-vocab-size 151851 \

Contribute to FlagScale

Currently, we have preliminarily reproduced the DeepSeek V3 pre-training code with the following features:

Support for MLA and MTP structures: shared embedding.
Support for DeepSeekMoE structure: shared expert, loss-free, etc.
Support checkpoint conversion with the Hugging Face.
Support for hybrid parallelism of TP (Tensor Parallelism), PP (Pipeline Parallelism), DP (Data Parallelism), and EP (Expert Parallelism).

The framework system side still has the following tasks. Everyone is welcome to participate and contribute. See the FlagAI-Open OpenSeek for a full list of proposed features .

Roadmap

Basic

Enhance the distributed training documentation
Improve the installation and usage
Conversion ckpt between FlagScale and Huggingface parameters
Research and design a solution can be easily implemented in FlagScale

Intermediate

Advanced

Support for DualPipe pipeline parallelism
Achieve more efficient pipeline parallelism
Improve communication algorithms to achieve more efficient MoE parallelism optimization
Collaborate with algorithm teams to achieve more efficient long sequence optimization
Implement customized parallel and optimization strategies for specific chips
Implement more innovative FP8 training solutions

How to contribute

We warmly welcome contributions to the FlagScale project! If you would like to contribute, please follow these steps:

Fork FlagScale to your own github repo
Create a copy of the FlagScale repo under your account, with a URL like https://github.com/your-own-id/FlagScale

Clone the forked repository to your local machine and navigate into the local FlagScale directory

git clone https://github.com/your-own-id/FlagScale.git
cd FlagScale
git config --global user.name XXX
git config --global user.email XXX
pre-commit install

Add the upstream repository to keep your fork updated with changes from the original FlagScale repository
```
git remote add upstream https://github.com/FlagOpen/FlagScale.git
```
Sync updates from the upstream FlagScale repository
```
git pull upstream main:main
```
Create a new branch and start your development
```
git checkout -b feature/my-new-feature
```

Commit your changes

git add .
git commit -m "Add my new feature"

Push your new branch to your GitHub repository
```
git push origin feature/my-new-feature
```
Create a pull request (PR) for FlagScale

Open your GitHub repository page (https://github.com/your-own-id/FlagScale)
You will see a prompt with a compare & pull request button for your newly pushed branch
Please provide a title and a description for your pull request that succinctly describes the modifications you have made
Click this button to proceed to the Pull Request page

Wait for review and merge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup

Run a task

Model Checkpoint conversion

1. HuggingFace --> Megatron

CKPT conversion script

Modify task yaml

Start training

2. Megatron --> HuggingFace

CKPT conversion script

Contribute to FlagScale

Roadmap

Basic

Intermediate

Advanced

How to contribute

FilesExpand file tree

FlagScale_Usage.md

Latest commit

History

FlagScale_Usage.md

File metadata and controls

Setup

Run a task

Model Checkpoint conversion

1. HuggingFace --> Megatron

CKPT conversion script

Modify task yaml

Start training

2. Megatron --> HuggingFace

CKPT conversion script

Contribute to FlagScale

Roadmap

Basic

Intermediate

Advanced

How to contribute