Skip to content

fantasioly/GoDiff

Repository files navigation

GoDiff: Object Style Diffusion for Generalized Object Detection in Urban Scene

Paper InstanceDiffusion

This is the official implementation of GoDiff, a novel diffusion-driven framework for Single-Domain Generalized Object Detection (SDGOD) in urban autonomous driving scenarios. The code is built upon InstanceDiffusion.

πŸ“‹ Overview

GoDiff addresses the challenge of domain generalization in object detection by employing dual-level augmentation:

  • Image Level: PTDG (Pseudo Target Data Generation) module generates diverse pseudo-domain images while preserving precise annotations using a dual-prompt strategy
  • Feature Level: CSN (Cross-Style Normalization) technique enhances domain-invariant learning through cross-domain style interchange

Key Features

  • πŸ”₯ Novel dual-level (image & feature) augmentation for generalized object detection
  • 🎯 PTDG module generates diverse styled pseudo-domains with consistent annotations
  • πŸ“ˆ Achieves state-of-the-art performance on autonomous driving benchmarks
  • πŸ”§ GoDiff enhances existing SDG methods and object detectors as a general-purpose module

πŸ—οΈ Project Structure

Instance-SDGOD-master/
β”œβ”€β”€ configs/                    # Configuration files
β”‚   β”œβ”€β”€ train_sd15.yaml        # Training configuration
β”‚   β”œβ”€β”€ test_box.yaml          # Box condition testing
β”‚   β”œβ”€β”€ test_mask.yaml         # Mask condition testing
β”‚   β”œβ”€β”€ test_point.yaml        # Point condition testing
β”‚   └── test_scribble.yaml     # Scribble condition testing
β”œβ”€β”€ dataset-generation/         # Dataset generation tools
β”‚   β”œβ”€β”€ generate.py            # Main generation script (Grounding DINO + SAM + RAM + BLIP)
β”‚   β”œβ”€β”€ create_img_caption.py  # Caption creation
β”‚   └── ram/                   # Recognize Anything Model
β”œβ”€β”€ ldm/                       # Latent Diffusion Model core
β”‚   β”œβ”€β”€ models/                # Diffusion models (DDPM, DDIM, PLMS)
β”‚   β”œβ”€β”€ modules/               # UNet, attention, encoders
β”‚   └── data/                  # Dataset utilities
β”œβ”€β”€ tools/                     # Utility tools
β”‚   β”œβ”€β”€ data_generate.py       # Data generation
β”‚   β”œβ”€β”€ ann_filter.py          # Annotation filtering
β”‚   └── cmmd/                  # CMMD distance calculation
β”œβ”€β”€ inference.py               # Inference script
β”œβ”€β”€ finetune.py                # Fine-tuning script
β”œβ”€β”€ run_with_submitit.py       # Distributed training launcher
β”œβ”€β”€ trainer.py                 # Training logic
└── requirements.txt           # Dependencies

πŸš€ Installation

Requirements

  • Linux or macOS with Python β‰₯ 3.8
  • PyTorch β‰₯ 2.0 and torchvision
  • CUDA-capable GPU (recommended)
  • OpenCV β‰₯ 4.6

Setup

  1. Clone the repository
git clone https://github.com/fantasioly/Instance-SDGOD.git
cd Instance-SDGOD
  1. Create conda environment
conda create --name godiff python=3.8 -y
conda activate godiff
  1. Install dependencies
pip install -r requirements.txt
  1. Download pretrained models

Download the following models and place them in the pretrained/ folder:

Model Source Path
Stable Diffusion 1.5 Hugging Face pretrained/v1-5-pruned-emaonly.ckpt
InstanceDiffusion Hugging Face pretrained/instancediffusion_sd15.pth
Grounding DINO GitHub Config + Checkpoint
SAM Facebook sam_vit_h_4b8939.pth
RAM Hugging Face ram_swin_large_14m.pth

πŸ“Š Dataset Preparation

Data Format

The project uses JSON format for training data. Each JSON file contains:

{
    "caption": "Global image caption",
    "width": 512,
    "height": 512,
    "file_name": "image_001.jpg",
    "image": "base64_encoded_image",
    "annos": [
        {
            "bbox": [x, y, width, height],
            "caption": "Instance caption from BLIP",
            "category_name": "car",
            "mask": {"counts": "RLE_encoded_mask", "size": [512, 512]},
            "text_embedding_before": "base64_encoded_CLIP_embedding",
            "blip_clip_embeddings": "base64_encoded_BLIP_CLIP_embedding"
        }
    ]
}

Generate Training Data

Use the provided script to generate annotated training data:

python dataset-generation/generate.py \
    --config path/to/grounding_dino_config.py \
    --ram_checkpoint path/to/ram_checkpoint.pth \
    --grounded_checkpoint path/to/grounding_dino_checkpoint.pth \
    --sam_checkpoint path/to/sam_checkpoint.pth \
    --train_data_path path/to/train_data.json \
    --output_dir outputs/training_data \
    --box_threshold 0.25 \
    --text_threshold 0.2

🎯 Usage

Inference

Generate images with instance-level control:

python inference.py \
    --num_images 8 \
    --output OUTPUT/ \
    --input_json demos/demo_example.json \
    --ckpt pretrained/instancediffusion_sd15.pth \
    --test_config configs/test_box.yaml \
    --guidance_scale 7.5 \
    --alpha 0.8 \
    --seed 0 \
    --mis 0.36 \
    --cascade_strength 0.4

Key Parameters

Parameter Description Default
--num_images Number of images to generate 8
--guidance_scale CFG scale for generation 7.5
--alpha Percentage of timesteps using grounding inputs 0.75
--mis Multi-instance sampler ratio 0.36
--cascade_strength SDXL Refiner strength (0 to disable) 0.35
--test_config Condition type config test_mask.yaml

Condition Types

  • Box: configs/test_box.yaml - Bounding box conditions
  • Mask: configs/test_mask.yaml - Segmentation mask conditions
  • Point: configs/test_point.yaml - Single point conditions
  • Scribble: configs/test_scribble.yaml - Scribble/curve conditions

Training

Single GPU Training

python finetune.py \
    --yaml_file configs/train_sd15.yaml \
    --official_ckpt_name pretrained/v1-5-pruned-emaonly.ckpt \
    --train_file dataset/your_train_data.txt \
    --batch_size 2 \
    --base_learning_rate 5e-5 \
    --total_iters 500000

Multi-GPU Distributed Training

python run_with_submitit.py \
    --workers 8 \
    --ngpus 4 \
    --nodes 1 \
    --batch_size 2 \
    --base_learning_rate 5e-5 \
    --yaml_file configs/train_sd15.yaml \
    --official_ckpt_name pretrained/v1-5-pruned-emaonly.ckpt \
    --train_file dataset/your_train_data.txt

πŸ”§ GoDiff Integration

GoDiff can be integrated with existing object detectors as a data augmentation module:

PTDG Module

The PTDG module generates pseudo-target domain data:

# Generate styled images with consistent annotations
python tools/data_generate.py \
    --source_domain path/to/source_data \
    --target_style weather_conditions.json \
    --output_dir outputs/pseudo_target

πŸ“ Input JSON Format for Inference

{
    "caption": "A street scene with cars and pedestrians",
    "width": 512,
    "height": 512,
    "annos": [
        {
            "bbox": [100, 150, 200, 100],
            "caption": "a red car driving on the road",
            "category_name": "car",
            "mask": [],
            "point": [150, 200]
        },
        {
            "bbox": [300, 200, 80, 150],
            "caption": "a person walking on the sidewalk",
            "category_name": "person",
            "mask": [],
            "point": [340, 275]
        }
    ]
}

πŸ› οΈ Tools

Annotation Filtering

Filter low-quality generated samples:

python tools/ann_filter.py \
    --input_dir outputs/generated_data \
    --output_dir outputs/filtered_data \
    --clip_threshold 0.25

Dataset Statistics

Analyze dataset distribution:

python tools/dataset_statistics.py --data_path dataset/train.json

πŸ“š Citation

If you find this work useful, please cite:

@article{li2026godiff,
    title = {Object Style Diffusion for Generalized Object Detection in Urban Scene},
    author = {Hao Li and Xiangyuan Yang and Mengzhu Wang and Long Lan and Ke Liang and Xinwang Liu and Kenli Li},
    journal = {Pattern Recognition},
    year = {2026},
    publisher = {Elsevier}
}

πŸ™ Acknowledgments

πŸ“„ License

This project is licensed under the Apache License 2.0. Portions of this project are available under separate license terms (CLIP, BLIP, Stable Diffusion, GLIGEN).

πŸ“§ Contact

For questions and issues, please open an issue on GitHub or contact the authors.


Note: This repository is released for academic and research purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors