This project implements a Stable Diffusion model from scratch using PyTorch. Stable Diffusion is a latent diffusion model that generates high-quality images from text prompts. The implementation includes all core components: VAE (Variational Autoencoder), CLIP (Contrastive Language-Image Pretraining), U-Net with attention, and the diffusion process.
stable diffusion model/
├── README.txt # This file: Project overview and file descriptions
├── stable diffusion model notes.docx # Detailed notes on the implementation
├── data/ # Model weights and tokenizer data
│ ├── v1-5-pruned-emaonly.ckpt # Stable Diffusion v1.5 model weights (main checkpoint)
│ ├── hollie-mengert.ckpt # Alternative model checkpoint (possibly fine-tuned)
│ ├── tokenizer_vocab.json # CLIP tokenizer vocabulary file
│ └── tokenizer_merges.txt # CLIP tokenizer byte-pair encoding merges
├── image/ # Sample images for testing
│ ├── 1.png # Input sample image
│ └── generated_image_1.png # Generated output image
├── images/ # Directory for additional generated images (currently empty)
└── sd/ # Main implementation directory
├── encoder.py # VAE Encoder: Compresses images into latent space
├── decoder.py # VAE Decoder: Reconstructs images from latent space
├── attention.py # Attention mechanisms: Self-attention and cross-attention
├── clip.py # CLIP Text Encoder: Processes text prompts into embeddings
├── diffusion.py # Diffusion process: Forward (noise addition) and reverse (denoising)
├── ddpm.py # DDPM Sampler: Denoising Diffusion Probabilistic Model sampling algorithm
├── model_converter.py # Model Converter: Utilities for loading and converting model weights
├── model_loader.py # Model Loader: Functions to preload models from checkpoints
├── pipeline.py # Pipeline: High-level interface for text-to-image and image-to-image generation
├── demo.ipynb # Demo Notebook: Jupyter notebook demonstrating model usage
├── add_noise.ipynb # Noise Addition Notebook: Experiments with noise addition
└── __pycache__/ # Python bytecode cache (auto-generated)
-
VAE (Variational Autoencoder):
- Encoder (
encoder.py): Compresses 512x512 RGB images to 64x64 latent representations - Decoder (
decoder.py): Reconstructs images from latent space - Uses residual blocks with group normalization and attention
- Encoder (
-
CLIP (Contrastive Language-Image Pretraining):
- Text Encoder (
clip.py): Converts text prompts to embeddings - Tokenizer (data files): Processes text input into tokens using vocabulary and merges
- Text Encoder (
-
U-Net with Attention:
- Attention (
attention.py): Self-attention for spatial relationships, cross-attention for text-image fusion - Integrated into encoder and decoder for feature refinement
- Attention (
-
Diffusion Process:
- Diffusion (
diffusion.py): Implements forward process (adding noise) and reverse process (removing noise) - DDPM (
ddpm.py): Sampling algorithm for generating images from noise
- Diffusion (
-
Pipeline:
- Pipeline (
pipeline.py): Orchestrates text-to-image and image-to-image generation - Combines all components for end-to-end image generation
- Pipeline (
- Install required dependencies (PyTorch, transformers, etc.)
- Ensure model weights are in the
data/directory - Run
demo.ipynbto see the model in action - Use
pipeline.pyfor programmatic generation
- This is an educational implementation, not optimized for production
- Model weights are from the official Stable Diffusion v1.5 release
- Implementation follows the original paper and Hugging Face diffusers library structure