Unified masked-diffusion modeling across textual reasoning, image generation, image editing, multi-modal understanding, text to speech, and speech to text.
Dynin-Omni: Omnimodal Unified Large Diffusion Language Model is an 8B-scale masked-diffusion foundation model that unifies text, image, video, and speech understanding and generation within a single architecture.
Unlike autoregressive (AR) unified models that serialize heterogeneous modalities into a left-to-right sequence, Dynin-Omni models all modalities as discrete tokens in a shared vocabulary and performs generation via iterative masked denoising. This enables bidirectional context modeling, parallel multi-token prediction, and globally conditioned any-to-any inference without modality-specific expert decoders.
Training proceeds in three stages: (1) modality adaptation, (2) omni-modal supervised fine-tuning with model merging, and (3) continual capability scaling.
Integration with dInfer is currently in progress. Integration with sglang is planned next.
Dynin-Omni support has been merged into vLLM-Omni through PR #1759 and is scheduled to be included in version 0.19.0. Once 0.19.0 is released, this section will be updated with the official setup and usage instructions.
Direct local-machine inference and training with Dynin-Omni are supported. Follow the instructions below.
Clone this repository:
git clone https://github.com/AIDASLab/Dynin-Omni.git
cd Dynin-OmniCreate and activate a conda environment:
conda create -n dynin_omni python=3.10
conda activate dynin_omniInitialize the environment (installs and builds Python packages):
bash scripts/init_env.sh --overwrite--overwrite forces the Hugging Face cache root to datasets/huggingface under the project root. Without --overwrite, the cache root is resolved as HF_CACHE_DIR > HF_HOME > project default.
Dynin-Omni performs multimodal inference through iterative masked denoising. Target tokens are initialized as masks and refined over diffusion steps.
Entrypoint script:
bash scripts/inference.sh [--text|--i2i|--mmu|--speech|--t2i] [options]The default configuration is configs/dynin_omni_demo.yaml.
--result defaults to results/<mode>.
Masked-diffusion text generation with block-wise decoding.
Validation script: validation/generate.py.
bash scripts/inference.sh --text- Input questions (default):
validation/data/text/lm_questions.jsonl. jsonlformat: one sample per line (e.g.,{"question":"..."}).- Optional override:
--questions-file. - Fallback behavior: a built-in demo question is used when the file is missing or empty.
Validation script: validation/mmu_generate.py.
bash scripts/inference.sh --mmu- Image directory (
.jpg/.jpeg/.png/.webp): defaultvalidation/data/image(override with--mmu-image-root). - Video directory (
.mp4/.mov/.avi/.mkv/.webm): defaultvalidation/data/video(create if absent, or override with--video-image-root).
Discrete image tokens are generated via parallel masked refinement, followed by deterministic detokenization.
Validation script: validation/t2i_generate.py.
bash scripts/inference.sh --t2i- Input data (default):
validation/data/text/t2i_metadata.jsonl. jsonlformat: one sample per line, e.g.{"id":"t2i-00000","prompt":"..."}(promptis required).- Optional alternative:
--validation-prompts-filewith a plain-text file (one prompt per line) instead ofjsonl.
Validation script: validation/i2i_generate.py.
bash scripts/inference.sh --i2i- Input
json(default):validation/data/text/i2i_edits.json. jsonformat: each item includesid(source image filename) andprompt.- Source image directory (default):
validation/data/image(override with--origin-img-root).
Speech recognition and synthesis are performed within the same token-level diffusion backbone without a modality-specific decoder.
Validation script: validation/speech.py.
bash scripts/inference.sh --speech- Default source: LibriSpeech ASR test split from Hugging Face (
openslr/librispeech_asr). - Optional local audio root:
--librispeech-root(directory containing LibriSpeech.flacfiles).
Training configurations (datasets, hyperparameters, etc.) are defined in configs/*.yaml.
scripts/train.sh path variables (CONFIG_FILE, TRAIN_SCRIPT, EXPERIMENT_CFG, LOG_DIR) must be specified as project-root-relative paths.
The examples below assume a single-node setup; host/runtime variables should be adapted to the target environment.
Accelerate configuration can be prepared by running:
python -m accelerate configPredefined configurations are also available in accelerate_configs/:
├── accelerate_configs/
│ ├── 1_gpu.yaml
│ ├── 1_node_8_gpus_deepspeed_zero2.yaml
│ ├── 1_node_8_gpus_deepspeed_zero3.yaml
│ └── 8_node_8_gpus_deepspeed_zero2.yamlStage 1 adapts newly introduced modalities (video and speech) to the masked-diffusion backbone. The following modality directions are activated:
- Video → Text (Video Captioning)
- Speech → Text (ASR)
- Text → Speech (TTS)
This stage anchors video and speech tokens into the shared semantic token space under text supervision.
CONFIG_FILE=accelerate_configs/1_node_8_gpus_deepspeed_zero2.yaml \
EXPERIMENT_CFG=configs/dynin_omni_stage1_llada_instruct.yaml \
TRAIN_SCRIPT=training/train_dynin_omni_stage1.py \
./scripts/train.shStage 1 starts from the MMaDA-8B-MixCoT backbone checkpoint and extends it to support video and speech modalities through vocabulary expansion and text-centric alignment.
Stage 2 continues from the Stage 1 checkpoint and performs full omni-modal supervised fine-tuning.
Activated modality directions:
- Text → Text (Chat & Reasoning)
- Image → Text, Video → Text (Multi-Modal Understanding)
- Text → Image (Image Generation)
- Image → Image (Image Editing)
- Speech → Text (ASR)
- Text → Speech (TTS)
Before training, model merging is applied between the original backbone and the Stage 1 checkpoint to mitigate catastrophic forgetting. Explicit <EOS> supervision enables stable variable-length generation across modalities.
CONFIG_FILE=accelerate_configs/1_node_8_gpus_deepspeed_zero2.yaml \
EXPERIMENT_CFG=configs/dynin_omni_stage2_llada_instruct.yaml \
TRAIN_SCRIPT=training/train_dynin_omni_stage2.py \
./scripts/train.shStage 3 continues from the Stage 2 checkpoint, retaining all modality directions while further scaling model capabilities.
Key enhancements include:
- Extended context length
- Higher-resolution image modeling
- Long-form speech generation (up to 21 seconds)
- Thinking-mode control (
\think/\no_think) - Chain-of-thought supervision
- Increased synthetic data for reasoning and generation
This stage improves reasoning depth, perception granularity, and long-form generation while preserving the unified masked-diffusion objective.
CONFIG_FILE=accelerate_configs/1_node_8_gpus_deepspeed_zero2.yaml \
EXPERIMENT_CFG=configs/dynin_omni_stage3_llada_instruct.yaml \
TRAIN_SCRIPT=training/train_dynin_omni_stage3.py \
./scripts/train.shStage 3 starts from the Stage 2 checkpoint specified in configs/dynin_omni_stage3_llada_instruct.yaml and performs continual capability scaling under the same unified diffusion objective.
Evaluation details are provided in evaluation/README.md.
@article{aidaslab2026dyninomni,
title={Dynin-Omni: Omnimodal Unified Large Diffusion Language Model},
author={Kim, Jaeik and Kim, Woojin and Hong, Jihwan and Lee, Yejoon and Hyeon, Sieun and Lim, Mintaek and Han, Yunseok and Kim, Dogeun and Lee, Hoeun and Kim, Hyunggeun and Do, Jaeyoung},
journal={arXiv preprint arXiv:2604.00007},
year={2026}
}
