This is the official implementation of GoDiff, a novel diffusion-driven framework for Single-Domain Generalized Object Detection (SDGOD) in urban autonomous driving scenarios. The code is built upon InstanceDiffusion.
GoDiff addresses the challenge of domain generalization in object detection by employing dual-level augmentation:
- Image Level: PTDG (Pseudo Target Data Generation) module generates diverse pseudo-domain images while preserving precise annotations using a dual-prompt strategy
- Feature Level: CSN (Cross-Style Normalization) technique enhances domain-invariant learning through cross-domain style interchange
- π₯ Novel dual-level (image & feature) augmentation for generalized object detection
- π― PTDG module generates diverse styled pseudo-domains with consistent annotations
- π Achieves state-of-the-art performance on autonomous driving benchmarks
- π§ GoDiff enhances existing SDG methods and object detectors as a general-purpose module
Instance-SDGOD-master/
βββ configs/ # Configuration files
β βββ train_sd15.yaml # Training configuration
β βββ test_box.yaml # Box condition testing
β βββ test_mask.yaml # Mask condition testing
β βββ test_point.yaml # Point condition testing
β βββ test_scribble.yaml # Scribble condition testing
βββ dataset-generation/ # Dataset generation tools
β βββ generate.py # Main generation script (Grounding DINO + SAM + RAM + BLIP)
β βββ create_img_caption.py # Caption creation
β βββ ram/ # Recognize Anything Model
βββ ldm/ # Latent Diffusion Model core
β βββ models/ # Diffusion models (DDPM, DDIM, PLMS)
β βββ modules/ # UNet, attention, encoders
β βββ data/ # Dataset utilities
βββ tools/ # Utility tools
β βββ data_generate.py # Data generation
β βββ ann_filter.py # Annotation filtering
β βββ cmmd/ # CMMD distance calculation
βββ inference.py # Inference script
βββ finetune.py # Fine-tuning script
βββ run_with_submitit.py # Distributed training launcher
βββ trainer.py # Training logic
βββ requirements.txt # Dependencies
- Linux or macOS with Python β₯ 3.8
- PyTorch β₯ 2.0 and torchvision
- CUDA-capable GPU (recommended)
- OpenCV β₯ 4.6
- Clone the repository
git clone https://github.com/fantasioly/Instance-SDGOD.git
cd Instance-SDGOD- Create conda environment
conda create --name godiff python=3.8 -y
conda activate godiff- Install dependencies
pip install -r requirements.txt- Download pretrained models
Download the following models and place them in the pretrained/ folder:
| Model | Source | Path |
|---|---|---|
| Stable Diffusion 1.5 | Hugging Face | pretrained/v1-5-pruned-emaonly.ckpt |
| InstanceDiffusion | Hugging Face | pretrained/instancediffusion_sd15.pth |
| Grounding DINO | GitHub | Config + Checkpoint |
| SAM | sam_vit_h_4b8939.pth |
|
| RAM | Hugging Face | ram_swin_large_14m.pth |
The project uses JSON format for training data. Each JSON file contains:
{
"caption": "Global image caption",
"width": 512,
"height": 512,
"file_name": "image_001.jpg",
"image": "base64_encoded_image",
"annos": [
{
"bbox": [x, y, width, height],
"caption": "Instance caption from BLIP",
"category_name": "car",
"mask": {"counts": "RLE_encoded_mask", "size": [512, 512]},
"text_embedding_before": "base64_encoded_CLIP_embedding",
"blip_clip_embeddings": "base64_encoded_BLIP_CLIP_embedding"
}
]
}Use the provided script to generate annotated training data:
python dataset-generation/generate.py \
--config path/to/grounding_dino_config.py \
--ram_checkpoint path/to/ram_checkpoint.pth \
--grounded_checkpoint path/to/grounding_dino_checkpoint.pth \
--sam_checkpoint path/to/sam_checkpoint.pth \
--train_data_path path/to/train_data.json \
--output_dir outputs/training_data \
--box_threshold 0.25 \
--text_threshold 0.2Generate images with instance-level control:
python inference.py \
--num_images 8 \
--output OUTPUT/ \
--input_json demos/demo_example.json \
--ckpt pretrained/instancediffusion_sd15.pth \
--test_config configs/test_box.yaml \
--guidance_scale 7.5 \
--alpha 0.8 \
--seed 0 \
--mis 0.36 \
--cascade_strength 0.4| Parameter | Description | Default |
|---|---|---|
--num_images |
Number of images to generate | 8 |
--guidance_scale |
CFG scale for generation | 7.5 |
--alpha |
Percentage of timesteps using grounding inputs | 0.75 |
--mis |
Multi-instance sampler ratio | 0.36 |
--cascade_strength |
SDXL Refiner strength (0 to disable) | 0.35 |
--test_config |
Condition type config | test_mask.yaml |
- Box:
configs/test_box.yaml- Bounding box conditions - Mask:
configs/test_mask.yaml- Segmentation mask conditions - Point:
configs/test_point.yaml- Single point conditions - Scribble:
configs/test_scribble.yaml- Scribble/curve conditions
python finetune.py \
--yaml_file configs/train_sd15.yaml \
--official_ckpt_name pretrained/v1-5-pruned-emaonly.ckpt \
--train_file dataset/your_train_data.txt \
--batch_size 2 \
--base_learning_rate 5e-5 \
--total_iters 500000python run_with_submitit.py \
--workers 8 \
--ngpus 4 \
--nodes 1 \
--batch_size 2 \
--base_learning_rate 5e-5 \
--yaml_file configs/train_sd15.yaml \
--official_ckpt_name pretrained/v1-5-pruned-emaonly.ckpt \
--train_file dataset/your_train_data.txtGoDiff can be integrated with existing object detectors as a data augmentation module:
The PTDG module generates pseudo-target domain data:
# Generate styled images with consistent annotations
python tools/data_generate.py \
--source_domain path/to/source_data \
--target_style weather_conditions.json \
--output_dir outputs/pseudo_target{
"caption": "A street scene with cars and pedestrians",
"width": 512,
"height": 512,
"annos": [
{
"bbox": [100, 150, 200, 100],
"caption": "a red car driving on the road",
"category_name": "car",
"mask": [],
"point": [150, 200]
},
{
"bbox": [300, 200, 80, 150],
"caption": "a person walking on the sidewalk",
"category_name": "person",
"mask": [],
"point": [340, 275]
}
]
}Filter low-quality generated samples:
python tools/ann_filter.py \
--input_dir outputs/generated_data \
--output_dir outputs/filtered_data \
--clip_threshold 0.25Analyze dataset distribution:
python tools/dataset_statistics.py --data_path dataset/train.jsonIf you find this work useful, please cite:
@article{li2026godiff,
title = {Object Style Diffusion for Generalized Object Detection in Urban Scene},
author = {Hao Li and Xiangyuan Yang and Mengzhu Wang and Long Lan and Ke Liang and Xinwang Liu and Kenli Li},
journal = {Pattern Recognition},
year = {2026},
publisher = {Elsevier}
}- InstanceDiffusion - Base framework
- Stable Diffusion - Diffusion model
- Grounding DINO - Object detection
- Segment Anything - Segmentation
- RAM - Image tagging
- BLIP-2 - Image captioning
This project is licensed under the Apache License 2.0. Portions of this project are available under separate license terms (CLIP, BLIP, Stable Diffusion, GLIGEN).
For questions and issues, please open an issue on GitHub or contact the authors.
Note: This repository is released for academic and research purposes.