GoDiff: Object Style Diffusion for Generalized Object Detection in Urban Scene

This is the official implementation of GoDiff, a novel diffusion-driven framework for Single-Domain Generalized Object Detection (SDGOD) in urban autonomous driving scenarios. The code is built upon InstanceDiffusion.

📋 Overview

GoDiff addresses the challenge of domain generalization in object detection by employing dual-level augmentation:

Image Level: PTDG (Pseudo Target Data Generation) module generates diverse pseudo-domain images while preserving precise annotations using a dual-prompt strategy
Feature Level: CSN (Cross-Style Normalization) technique enhances domain-invariant learning through cross-domain style interchange

Key Features

🔥 Novel dual-level (image & feature) augmentation for generalized object detection
🎯 PTDG module generates diverse styled pseudo-domains with consistent annotations
📈 Achieves state-of-the-art performance on autonomous driving benchmarks
🔧 GoDiff enhances existing SDG methods and object detectors as a general-purpose module

🏗️ Project Structure

Instance-SDGOD-master/
├── configs/                    # Configuration files
│   ├── train_sd15.yaml        # Training configuration
│   ├── test_box.yaml          # Box condition testing
│   ├── test_mask.yaml         # Mask condition testing
│   ├── test_point.yaml        # Point condition testing
│   └── test_scribble.yaml     # Scribble condition testing
├── dataset-generation/         # Dataset generation tools
│   ├── generate.py            # Main generation script (Grounding DINO + SAM + RAM + BLIP)
│   ├── create_img_caption.py  # Caption creation
│   └── ram/                   # Recognize Anything Model
├── ldm/                       # Latent Diffusion Model core
│   ├── models/                # Diffusion models (DDPM, DDIM, PLMS)
│   ├── modules/               # UNet, attention, encoders
│   └── data/                  # Dataset utilities
├── tools/                     # Utility tools
│   ├── data_generate.py       # Data generation
│   ├── ann_filter.py          # Annotation filtering
│   └── cmmd/                  # CMMD distance calculation
├── inference.py               # Inference script
├── finetune.py                # Fine-tuning script
├── run_with_submitit.py       # Distributed training launcher
├── trainer.py                 # Training logic
└── requirements.txt           # Dependencies

🚀 Installation

Requirements

Linux or macOS with Python ≥ 3.8
PyTorch ≥ 2.0 and torchvision
CUDA-capable GPU (recommended)
OpenCV ≥ 4.6

Setup

Clone the repository

git clone https://github.com/fantasioly/Instance-SDGOD.git
cd Instance-SDGOD

Create conda environment

conda create --name godiff python=3.8 -y
conda activate godiff

Install dependencies

pip install -r requirements.txt

Download pretrained models

Download the following models and place them in the pretrained/ folder:

Model	Source	Path
Stable Diffusion 1.5	Hugging Face	`pretrained/v1-5-pruned-emaonly.ckpt`
InstanceDiffusion	Hugging Face	`pretrained/instancediffusion_sd15.pth`
Grounding DINO	GitHub	Config + Checkpoint
SAM	Facebook	`sam_vit_h_4b8939.pth`
RAM	Hugging Face	`ram_swin_large_14m.pth`

📊 Dataset Preparation

Data Format

The project uses JSON format for training data. Each JSON file contains:

{
    "caption": "Global image caption",
    "width": 512,
    "height": 512,
    "file_name": "image_001.jpg",
    "image": "base64_encoded_image",
    "annos": [
        {
            "bbox": [x, y, width, height],
            "caption": "Instance caption from BLIP",
            "category_name": "car",
            "mask": {"counts": "RLE_encoded_mask", "size": [512, 512]},
            "text_embedding_before": "base64_encoded_CLIP_embedding",
            "blip_clip_embeddings": "base64_encoded_BLIP_CLIP_embedding"
        }
    ]
}

Generate Training Data

Use the provided script to generate annotated training data:

python dataset-generation/generate.py \
    --config path/to/grounding_dino_config.py \
    --ram_checkpoint path/to/ram_checkpoint.pth \
    --grounded_checkpoint path/to/grounding_dino_checkpoint.pth \
    --sam_checkpoint path/to/sam_checkpoint.pth \
    --train_data_path path/to/train_data.json \
    --output_dir outputs/training_data \
    --box_threshold 0.25 \
    --text_threshold 0.2

🎯 Usage

Inference

Generate images with instance-level control:

python inference.py \
    --num_images 8 \
    --output OUTPUT/ \
    --input_json demos/demo_example.json \
    --ckpt pretrained/instancediffusion_sd15.pth \
    --test_config configs/test_box.yaml \
    --guidance_scale 7.5 \
    --alpha 0.8 \
    --seed 0 \
    --mis 0.36 \
    --cascade_strength 0.4

Key Parameters

Parameter	Description	Default
`--num_images`	Number of images to generate	8
`--guidance_scale`	CFG scale for generation	7.5
`--alpha`	Percentage of timesteps using grounding inputs	0.75
`--mis`	Multi-instance sampler ratio	0.36
`--cascade_strength`	SDXL Refiner strength (0 to disable)	0.35
`--test_config`	Condition type config	`test_mask.yaml`

Condition Types

Box: configs/test_box.yaml - Bounding box conditions
Mask: configs/test_mask.yaml - Segmentation mask conditions
Point: configs/test_point.yaml - Single point conditions
Scribble: configs/test_scribble.yaml - Scribble/curve conditions

Training

Single GPU Training

python finetune.py \
    --yaml_file configs/train_sd15.yaml \
    --official_ckpt_name pretrained/v1-5-pruned-emaonly.ckpt \
    --train_file dataset/your_train_data.txt \
    --batch_size 2 \
    --base_learning_rate 5e-5 \
    --total_iters 500000

Multi-GPU Distributed Training

python run_with_submitit.py \
    --workers 8 \
    --ngpus 4 \
    --nodes 1 \
    --batch_size 2 \
    --base_learning_rate 5e-5 \
    --yaml_file configs/train_sd15.yaml \
    --official_ckpt_name pretrained/v1-5-pruned-emaonly.ckpt \
    --train_file dataset/your_train_data.txt

🔧 GoDiff Integration

GoDiff can be integrated with existing object detectors as a data augmentation module:

PTDG Module

The PTDG module generates pseudo-target domain data:

# Generate styled images with consistent annotations
python tools/data_generate.py \
    --source_domain path/to/source_data \
    --target_style weather_conditions.json \
    --output_dir outputs/pseudo_target

📁 Input JSON Format for Inference

{
    "caption": "A street scene with cars and pedestrians",
    "width": 512,
    "height": 512,
    "annos": [
        {
            "bbox": [100, 150, 200, 100],
            "caption": "a red car driving on the road",
            "category_name": "car",
            "mask": [],
            "point": [150, 200]
        },
        {
            "bbox": [300, 200, 80, 150],
            "caption": "a person walking on the sidewalk",
            "category_name": "person",
            "mask": [],
            "point": [340, 275]
        }
    ]
}

🛠️ Tools

Annotation Filtering

Filter low-quality generated samples:

python tools/ann_filter.py \
    --input_dir outputs/generated_data \
    --output_dir outputs/filtered_data \
    --clip_threshold 0.25

Dataset Statistics

Analyze dataset distribution:

python tools/dataset_statistics.py --data_path dataset/train.json

📚 Citation

If you find this work useful, please cite:

@article{li2026godiff,
    title = {Object Style Diffusion for Generalized Object Detection in Urban Scene},
    author = {Hao Li and Xiangyuan Yang and Mengzhu Wang and Long Lan and Ke Liang and Xinwang Liu and Kenli Li},
    journal = {Pattern Recognition},
    year = {2026},
    publisher = {Elsevier}
}

🙏 Acknowledgments

InstanceDiffusion - Base framework
Stable Diffusion - Diffusion model
Grounding DINO - Object detection
Segment Anything - Segmentation
RAM - Image tagging
BLIP-2 - Image captioning

📄 License

This project is licensed under the Apache License 2.0. Portions of this project are available under separate license terms (CLIP, BLIP, Stable Diffusion, GLIGEN).

📧 Contact

For questions and issues, please open an issue on GitHub or contact the authors.

Note: This repository is released for academic and research purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GoDiff: Object Style Diffusion for Generalized Object Detection in Urban Scene

📋 Overview

Key Features

🏗️ Project Structure

🚀 Installation

Requirements

Setup

📊 Dataset Preparation

Data Format

Generate Training Data

🎯 Usage

Inference

Key Parameters

Condition Types

Training

Single GPU Training

Multi-GPU Distributed Training

🔧 GoDiff Integration

PTDG Module

📁 Input JSON Format for Inference

🛠️ Tools

Annotation Filtering

Dataset Statistics

📚 Citation

🙏 Acknowledgments

📄 License

📧 Contact

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

GoDiff: Object Style Diffusion for Generalized Object Detection in Urban Scene

📋 Overview

Key Features

🏗️ Project Structure

🚀 Installation

Requirements

Setup

📊 Dataset Preparation

Data Format

Generate Training Data

🎯 Usage

Inference

Key Parameters

Condition Types

Training

Single GPU Training

Multi-GPU Distributed Training

🔧 GoDiff Integration

PTDG Module

📁 Input JSON Format for Inference

🛠️ Tools

Annotation Filtering

Dataset Statistics

📚 Citation

🙏 Acknowledgments

📄 License

📧 Contact