Skip to content

Gh-Novel/DDIM_Image_Generation

Repository files navigation

🧠 DDIM Face Generation

A Denoising Diffusion Implicit Model trained from scratch on 30,000 faces — no pretrained weights, no diffusers library. Pure PyTorch.

HuggingFace Space Python PyTorch License


🖼️ Results — 100 Epochs on CelebA-HQ 64×64

Generated faces at 100 epochs

Faces generated from pure Gaussian noise — no post-processing


🚀 Live Demo

Demo features:

  • ✨ Generate — sample new faces from pure noise with adjustable DDIM steps
  • 🎞️ Trajectory — animated GIF showing the full denoising path (noise → face)
  • 🔀 Interpolate — spherical linear interpolation (slerp) between two faces
  • 📖 How it works — full architecture & training breakdown at the bottom of the page

⚙️ Technical Details

Architecture U-Net with sinusoidal time embeddings + multi-head self-attention
Channels [64, 128, 256, 256]
Parameters 25.6M
Dataset CelebA-HQ — 30,000 aligned faces at 64×64
Training 100 epochs, ~40 hours, Apple Silicon MPS (no cloud GPU)
Sampler DDIM — 20 steps vs DDPM 1000 steps (50× speedup)
Noise schedule Linear β: 1×10⁻⁴ → 0.02, T = 1000
Inference weights EMA (exponential moving average of training weights)

🏗️ Architecture

Input x_t (noisy image) + timestep t
            │
    ┌───────▼────────┐
    │  Time Embedding │  Sinusoidal → MLP → injected at every ResBlock
    └───────┬────────┘
            │
    ┌───────▼────────┐
    │    U-Net       │  4 resolution levels
    │                │  Self-attention at 8×8 and 16×16
    │  Down → Mid    │  GroupNorm + SiLU throughout
    │       → Up     │  Zero-init output conv (identity at init)
    └───────┬────────┘
            │
    predicted ε (noise)

Training objective: L = ||ε − ε_θ(√ᾱₜ x₀ + √(1−ᾱₜ) ε, t)||²


📁 Project Structure

minidiffusion/
├── models/
│   ├── attention.py     # Multi-head self-attention (2D spatial)
│   ├── unet.py          # Full U-Net with time embeddings
│   └── diffusion.py     # DDPM training + DDIM sampling + EMA + AdamW
├── utils/
│   ├── dataset.py       # CelebA-HQ dataloader
│   └── visualize.py     # Trajectory GIF, interpolation grid
├── train.py             # Training loop — W&B logging, auto-resume
├── sample.py            # Inference — grid, trajectory, interpolation, compare
├── app.py               # Gradio demo UI
└── config.py            # All hyperparameters

🔧 Built From Scratch

Every component is hand-written — no diffusers, no guided-diffusion, no pretrained encoders:

attention.py · unet.py · diffusion.py · dataset.py · train.py

Notable engineering decisions:

  • Custom CPU-resident AdamW — fixes a MPS NaN bug in PyTorch 2.3.1 where zero-grad params corrupt optimizer state, while also saving ~2GB of GPU memory
  • EMA shadow on CPU — keeps a smoothed copy of weights off the GPU, saving another ~1GB
  • MPS-safe DDIM indexing — tensor indexing with MPS buffers returns garbage in some PyTorch builds; fixed by using Python ints throughout the sampling loop

🏃 Run Locally

git clone https://github.com/Gh-Novel/DDIM_Image_Generation.git
cd DDIM_Image_Generation
pip install -r requirements.txt

# Run the Gradio demo (uses bundled checkpoint)
python app.py

# Or generate samples directly
python sample.py --ckpt checkpoints/stage-64_best.pt --num 16 --steps 50

# Train from scratch on your own data
python train.py --image-size 64 --epochs 100 --run-name my-run

About

A Denoising Diffusion Implicit Model (DDIM) trained from scratch on 30,000 faces from the CelebA-HQ dataset. Built entirely in PyTorch without any pretrained components. What it does: Generate — sample new human faces from pure Gaussian noise in 50 steps Trajectory — watch the denoising process as an animated GIF (noise → face).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors