Skip to content

MIC-DKFZ/semantic_segmentation

Repository files navigation

TL;DR: An easy-to-use, Hydra-based framework for reproducible semantic segmentation with state-of-the-art models and benchmark-ready training, evaluation, and inference pipelines.

This repository provides a modular and reproducible framework for semantic segmentation, with an emphasis on benchmark-quality training, evaluation, and comparison across multiple datasets. It is designed to be usable out of the box, while remaining fully transparent and configurable, enabling experiments to be reproduced, extended, and adapted with minimal effort. The framework follows a configuration-driven design, allowing seamless integration of additional models, datasets, optimizers, learning rate schedulers, loss functions, metrics, and data augmentation pipelines.

Model architectures such as the High-Resolution Network (HRNet), Object Contextual Representation (OCR), FCN and DeepLabv3 are supported. Several widely used semantic segmentation benchmarks, including Cityscapes, PASCAL-Context, ADE20K, COCO-Stuff, and LIP, are fully integrated, with preprocessing scripts, reference configurations, and reproducible baseline results provided. New datasets and experiments can be added easily via configuration templates.

This repository builds on a set of widely used and well-maintained libraries for deep learning and experiment management:

  • PyTorch: The core deep learning framework used for model implementation and training.
  • Lightning: A high-level framework that structures training code and enables scalable, device-agnostic, and distributed training with minimal boilerplate,
  • Hydra: A flexible configuration framework for composing, overriding, and managing experiments via YAML files and the command line.
  • Albumentations: A fast and expressive library for image-based data augmentation, used here for semantic segmentation pipelines.
  • Torchmetrics: A collection of efficient and distributed-ready evaluation metrics with native integration into PyTorch Lightning.

Installation

git clone https://github.com/MIC-DKFZ/semantic_segmentation.git
cd semantic_segmentation
pip install -e ./

Note: PyTorch may require a manual installation depending on your OS/GPU. Follow the instructions here.

Quick Start:

1. Download model weights

  • Download the pretrained weights for hrnet/ocr here.
  • Extract the weights into your pretrained_weights/ directory.

2. (Optional) Define output directory

  • Set output_dir in config/environment/local.yaml. Default is logs/ folder.

3 Setup Data

3.1. Benchmark Dataset

  • Download the dataset and run the preprocessing scripts(see here)
  • Adopt the corresponding path in config/environment/local.yaml

3.2 Custom Dataset

  • Copy: config/dataset/_template_.yamlconfig/dataset/my_dataset.yaml
  • Fill ou the Template and define how to find, load and split the data
  • Copy: config/experiment/_template_.yamlconfig/experiment/my_experiment.yaml
  • Fill out the Template and override hyperparameters and define the augmentation policy

4 Train

# Benchmark Experiment
python src/semantic_segmentation/train.py experiment=cityscapes/baseline
python src/semantic_segmentation/train.py experiment=ade20k/baseline
python src/semantic_segmentation/train.py experiment=coco_stuff/baseline
python src/semantic_segmentation/train.py experiment=lip/baseline
python src/semantic_segmentation/train.py experiment=pascal_context/baseline
# Custom Experiment
python src/semantic_segmentation/train.py experiment=my_experiment

5. Evaluate

  • Checkpoint Directory will look like this output_dir/Dataset_Name/Model_Name/Experiment_ID/Run_ID:
# Benchmark Experiment
python src/semantic_segmentation/eval.py ckpt_dir=...

6. Inference

# Benchmark Experiment
python src/semantic_segmentation/predict.py ckpt_dir=... input_dir=... output_dir=...

Benchmark Results

Cityscapes

Model Experiment mIoU (w/o tta) mIoU (w/ tta) Experiment Config
HRNet baseline 82.06 82.88 cityscapes/baseline
HRNet sampling 82.37 82.70 cityscapes/sampling
HRNet RMI 81.97 82.96 cityscapes/RMI
HRNet RMI+sampling 82.61 83.38 cityscapes/sampling_RMI
OCR baseline 82.14 83.30 cityscapes/baseline_ocr
OCR sampling 82.57 83.70 cityscapes/sampling_ocr
OCR RMI 82.94 83.83 cityscapes/RMI_ocr
OCR RMI+sampling 82.95 84.14 cityscapes/sampling_RMI_ocr

PASCAL-Context

Model Experiment mIoU (w/o tta) mIoU (w/ tta) Experiment Config
HRNet baseline 55.04 55.97 pascal_context/baseline
HRNet sampling 55.04 55.90 pascal_context/sampling
HRNet RMI 54.82 55.75 pascal_context/RMI
HRNet RMI+sampling 54.54 55.67 pascal_context/sampling_RMI
OCR baseline 57.24 58.38 pascal_context/baseline_ocr
OCR sampling 57.62 58.83 pascal_context/sampling_ocr
OCR RMI 57.77 59.01 pascal_context/RMI_ocr
OCR RMI+sampling 57.94 59.14 pascal_context/sampling_RMI_ocr

ADE20K

Model Experiment mIoU (w/o tta) mIoU (w/ tta) Experiment Config
HRNet baseline 45.61 47.52 ade20k/baseline
HRNet sampling 46.62 48.53 ade20k/sampling
HRNet RMI 47.28 48.53 ade20k/RMI
HRNet RMI+sampling 48.22 50.17 ade20k/sampling_RMI
OCR baseline 47.81 48.76 ade20k/baseline_ocr
OCR sampling 49.28 50.99 ade20k/sampling_ocr
OCR RMI 49.14 50.45 ade20k/RMI_ocr
OCR RMI+sampling 49.86 51.69 ade20k/sampling_RMI_ocr

COCO Stuff

Model Experiment mIoU (w/o tta) mIoU (w/ tta) Experiment Config
HRNet baseline 42.38 43.22 coco_stuff/baseline
HRNet sampling 43.42 44.31 coco_stuff/sampling
HRNet RMI 42.22 42.86 coco_stuff/RMI
HRNet RMI+sampling 42.94 43.62 coco_stuff/sampling_RMI
OCR baseline 44.24 45.06 coco_stuff/baseline_ocr
OCR sampling 45.39 46.11 coco_stuff/sampling_ocr
OCR RMI 44.89 45.47 coco_stuff/RMI_ocr
OCR RMI+sampling 45.85 46.51 coco_stuff/sampling_RMI_ocr

LIP

Model Experiment mIoU (w/o tta) mIoU (w/ tta) Experiment Config
HRNet baseline 56.05 --- lip/baseline
HRNet sampling 56.05 --- lip/sampling
HRNet RMI 58.29 --- lip/RMI
HRNet RMI+sampling 57.95 --- lip/sampling_RMI
OCR baseline 56.35 --- lip/baseline_ocr
OCR sampling 56.25 --- lip/sampling_ocr
OCR RMI 58.53 --- lip/RMI_ocr
OCR RMI+sampling 58.35 --- lip/sampling_RMI_ocr

All models are initialized with PaddleClas-pretrained HRNet weights. Apart from the pretrained weights and the designated training split of the corresponding dataset, no additional data is used. Training is performed using stochastic gradient descent (SGD) with momentum and a polynomial learning rate scheduler. All experiments are repeated three times, and the best result is reported. Test-time augmentation (TTA) is performed using multi-scale evaluation with scales [0.5,0.75,1.0,1.25,1.5,1.75,2.0] and image flipping.

  • Cityscapes: The initial learning rate is set to 0.01, the weight decay to 0.0005, the patch size to 512 × 1024, and the batch size to 12, with training conducted for 100K iterations using random cropping.
  • PASCAL-Context: The initial learning rate is set to 0.004, the weight decay to 0.0001, the patch size to 512 × 512, and the batch size to 16, with training conducted for 50K iterations using direct resizing.
  • ADE20K: The initial learning rate is set to 0.004, the weight decay to 0.0001, the patch size to 512 × 512, and the batch size to 16, with training conducted for 150K iterations using short-side resizing followed by random cropping.
  • COCO Stuff: The initial learning rate is set to 0.001, the weight decay to 0.0001, the patch size to 520 × 520, and the batch size to 16, with training conducted for 150K iterations using direct resizing.
  • LIP: The initial learning rate is set to 0.007, the weight decay to 0.0005, the patch size to 473 × 473, and the batch size to 40, with training conducted for 110K iterations using direct resizing.
Expand for Full Results
Model Dataset experiment mIoU (best, w/o tta) mIoU (best, w/ tta) mIoU (last, w/o tta) mIoU (last, w/ tta)
HRNet Cityscapes baseline 81.24 / 82.06 / 81.86 82.03 / 82.88 / 82.77 82.01 / 82.06 / 81.86 82.58 / 82.88 / 82.77
HRNet Cityscapes sampling 81.96 / 82.37 / 82.00 82.70 / 82.66 / 82.50 82.16 / 82.37 / 81.90 82.61 / 82.66 / 82.23
HRNet Cityscapes RMI 81.57 / 81.92 / 81.67 82.86 / 82.85 / 82.56 81.74 / 81.97 / 81.68 82.96 / 82.88 / 82.6
HRNet Cityscapes RMI+sampling 81.84 / 80.73 / 81.89 83.15 / 81.71 / 82.52 82.01 / 81.63 / 82.61 83.38 / 82.92 / 83.20
OCR Cityscapes baseline 81.65 / 81.88 / 81.75 83.13 / 83.07 / 82.98 81.65 / 82.14 / 82.09 83.13 / 83.3 / 83.18
OCR Cityscapes sampling 82.37 / 81.63 / 82.42 83.39 / 82.96 / 83.70 82.25 / 82.43 / 82.57 83.48 / 83.49 / 83.63
OCR Cityscapes RMI 82.56 / 82.27 / 82.46 83.78 / 83.52 / 83.50 82.84 / 82.94 / 82.55 83.80 / 83.83 / 83.70
OCR Cityscapes RMI+sampling 82.79 / 82.95 / 82.52 84.07 / 83.95 / 83.89 82.95 / 82.95 / 82.52 84.14 / 83.87 / 83.89
HRNet PASCAL-Context baseline 54.60 / 54.57 / 54.96 55.52 / 55.38 / 55.93 54.68 / 54.62 / 55.04 55.56 / 55.52 / 55.97
HRNet PASCAL-Context sampling 54.59 / 55.04 / 54.68 55.61 / 55.90 / 55.45 54.67 / 54.92 / 54.64 55.72 / 55.79 / 55.42
HRNet PASCAL-Context RMI 54.46 / 54.40 / 54.33 55.27 / 55.49 / 55.21 54.73 / 54.50 / 54.82 55.75 / 55.61 / 55.65
HRNet PASCAL-Context RMI+sampling 54.54 / 53.89 / 54.52 55.28 / 54.97 / 55.67 54.50 / 54.23 / 54.49 55.24 / 55.18 / 55.30
OCR PASCAL-Context baseline 57.21 / 56.27 / 57.21 58.33 / 57.53 / 58.35 57.21 / 56.63 / 57.24 58.33 / 57.91 / 58.38
OCR PASCAL-Context sampling 57.27 / 57.62 / 57.38 58.21 / 58.77 / 58.37 57.41 / 57.54 / 57.41 58.35 / 58.39 / 58.83
OCR PASCAL-Context RMI 57.41 / 57.57 / 57.25 58.68 / 58.88 / 57.86 57.77 / 57.54 / 57.66 59.01 / 58.96 / 58.70
OCR PASCAL-Context RMI+sampling 57.94 / 57.63 / 57.35 59.14 / 58.62 / 58.78 57.71 / 57.64 / 56.74 59.10 / 58.79 / 58.53
HRNet ADE20K baseline 45.04 / 45.45 / 45.2 47.17 / 47.18 / 47.5 45.24 / 45.61 / 44.97 47.44 / 47.37 / 47.52
HRNet ADE20K sampling 46.30 / 46.12 / 46.36 48.34 / 48.53 / 48.01 46.24 / 46.04 / 46.62 48.16 / 48.43 / 48.17
HRNet ADE20K RMI 46.36 / 46.80 / 46.86 47.95 / 48.21 / 48.07 46.54 / 46.96 / 47.28 48.53 / 48.06 / 48.37
HRNet ADE20K RMI+sampling 47.86 / 48.06 / 47.46 49.87 / 50.20 / 49.55 48.00 / 48.22 / 47.39 50.17 / 50.06 / 49.35
OCR ADE20K baseline 47.22 / 47.15 / 46.44 48.0 / 48.41 / 47.27 47.56 / 47.81 / 47.28 48.76 / 48.72 / 48.48
OCR ADE20K sampling 48.88 / 47.81 / 49.14 50.39 / 49.94 / 50.99 48.92 / 48.32 / 49.28 50.29 / 50.27 / 50.98
OCR ADE20K RMI 49.14 / 47.60 / 47.56 50.12 / 48.63 / 49.18 48.99 / 48.24 / 47.91 49.74 / 50.45 / 49.83
OCR ADE20K RMI+sampling 49.07 / 49.86 / 49.37 50.26 / 51.22 / 51.69 49.48 / 48.68 / 49.26 51.16 / 50.37 / 51.66
HRNet COCO Stuff baseline 41.94 / 42.38 / 42.36 42.79 / 43.22 / 43.22 42.20 / 42.27 / 42.28 43.19 / 43.18 / 43.11
HRNet COCO Stuff sampling 43.13 / 43.30 / 43.21 44.19 / 44.22 / 44.08 43.17 / 43.42 / 43.26 44.17 / 44.31 / 44.11
HRNet COCO Stuff RMI 41.96 / 42.11 / 42.15 42.69 / 42.70 / 42.83 41.96 / 42.22 / 42.22 42.67 / 42.85 / 42.86
HRNet COCO Stuff RMI+sampling 42.49 / 42.62 42.98 / 43.21 42.94 / 42.91 43.62 / 43.62
OCR COCO Stuff baseline 43.97 / 44.15 / 43.93 44.63 / 45.01 / 44.67 44.24 / 44.12 / 44.02 45.06 / 45.03 / 44.78
OCR COCO Stuff sampling 45.11 / 45.00 / 44.84 45.91 / 45.96 / 45.30 45.02 / 44.91 / 45.39 45.94 / 45.90 / 46.11
OCR COCO Stuff RMI 43.97 / 44.70 / 44.79 44.51 / 45.21 / 45.47 44.58 / 44.89 / 44.84 45.24 / 45.28 / 45.44
OCR COCO Stuff RMI+sampling 45.59 / 45.78 46.02 / 46.50 45.85 / 45.67 46.41 / 46.51
HRNet LIP baseline 55.39 / 56.05 / 55.72 --- / --- / --- 55.37/ 55.81 / 55.59 --- / --- / ---
HRNet LIP sampling 55.94 / 55.40 / 55.74 --- / --- / --- 56.05 / 55.54 / 55.74 --- / --- / ---
HRNet LIP RMI 58.25 / 57.85 / 58.05 --- / --- / --- 58.17 / 58.06 / 58.29 --- / --- / ---
HRNet LIP RMI+sampling 57.62 / 57.33 / 57.79 --- / --- / --- 57.85 / 57.55 / 57.95 --- / --- / ---
OCR LIP baseline 56.05 / 56.03 / 56.35 --- / --- / --- 55.85 / 56.07 / 56.34 --- / --- / ---
OCR LIP sampling 55.75 / 56.16 / 56.25 --- / --- / --- 55.83 / 56.12 / 56.23 --- / --- / ---
OCR LIP RMI 58.51 / 57.95 / 57.80 --- / --- / --- 58.53 / 58.32 / 57.95 --- / --- / ---
OCR LIP RMI+sampling 58.35 / 57.91 / 57.69 --- / --- / --- 58.31 / 57.78 / 58.02 --- / --- / ---

Note: best denotes the checkpoint with the highest pseudo validation score during training, while last corresponds to the final training epoch.

python src/semantic_segmentation/train.py experiment=<experiment> model=<model>
# e.g. Cityscapes
python src/semantic_segmentation/train.py experiment=cityscapes/baseline
python src/semantic_segmentation/train.py experiment=cityscapes/baseline_ocr

Models

  1. Download the pretrained weights and place them into the pretrained_weights/ directory. By default, we use PaddleClas weights, as they consistently outperform the standard ImageNet weights.
  2. If you prefer using ImageNet pretrained weights, update the pretrained field in the model config file or set it directly via the command line: model.pretrained=ImageNet
  3. To disable the pretrained weights, use model.pretrained=None
Model Download Weights Config File
HRNet (Source, Paper) Source ImageNet PaddleClas hrnet.yaml
OCR (Source, Paper) ⇑ Use HRNet weights above ⇑ ocr.yaml
FCN (Source) fcn.yaml
DeepLabv3 (Source) deeplabv3.yaml

Benchmark Datasets

Preprocessing Instructions:

  1. Download the required files for each dataset and place them together in a single folder
  2. Unzip all files — note that some datasets may contain nested ZIP files, which must also be unzipped.
  3. Run the preprocessing script corresponding to each dataset: python preprocess_{dataset}.py input_dir={input_dir} output_dir={output_dir}
    • --input_dir,-i: Path to the folder containing the downloaded and extracted files.
    • --output_dir,-o: Path to the folder where the preprocessed dataset will be saved.
    • Note: preprocessing scripts are located in src/semantic_segmentation/datasets/preprocessing
  4. After preprocessing, each dataset will follow the structure below:
    dataset_name
    ├── images
    │   ├── train
    │   └── val
    └── labels
        ├── train
        └── val
    
  5. (Optional) Delete raw data - it is not needed anymore
  6. Adopt the root_data argument in the config files (config/dataset/{dataset_name}.yaml) to output_dir used in the preprocessing script
Dataset Download Config File Preprocessing Script Size (train/val/(test)) Classes Image Size (height/width)
Cityscapes (Source, Paper) Images Labels cityscapes.yaml preprocess_cityscapes.py 2.975 / 500 19 1024 / 2048
PASCAL-Context (Source1, Source2, Paper) Images Labels pascal_context.yaml preprocess_pascal_context.py 4.998 / 5.105 59 71-500 / 142-500
ADE20K (Source, Paper) Images+Labels ade20k.yaml preprocess_ade20k.py 20.210 / 2.000 150 96-2100 / 130-2100
COCO Stuff (Source, Paper) Images(train) Images(val) Labels coco_stuff.yaml preprocess_cocostuff.py 118.287 / 5.000 171 51-640 / 59-640
LIP (Source, Paper) Images (trainVal_Images.zip) Labels (TrainVal_parsing_annotations.zip) lip.yaml preprocess_lip.py 30.462 / 10.000 20 36-640 / 21-640
DACL10k (Source, Paper) Images+Labels dacl10k.yaml preprocess_dacl10k.py 6.935 / 975 / 1.012 19

Training:

python src/semantic_segmentation/train.py experiment=ade20k/baseline
python src/semantic_segmentation/train.py experiment=cityscapes/baseline
python src/semantic_segmentation/train.py experiment=coco_stuff/baseline
python src/semantic_segmentation/train.py experiment=lip/baseline
python src/semantic_segmentation/train.py experiment=pascal_context/baseline

Configuration Overview

Overview about all available config group options and how to apply them:

Config Group Options (YAML files)
logger tensorboard, wandb
model hrnet, ocr, fcn, deeplabv3
optimizer SGD, ADAM, ADAMW, MADGRAD
lr_scheduler PolynomialLR, CosineAnnealing, CosineAnnealing_step
loss CE, wCE, RMI, wRMI
metric IoU, Dice
dataset cityscapes, ade20k, coco_stuff, lip, pascal_context,
hydra default, quiet
python src/semantic_segmentation/train.py model=ocr loss=RMI metric=Dice ...

Logging and Checkpointing

Results and Checkpoints will be stored under output_dir/Dataset_Name/Model_Name/Experiment_ID/Run_ID and a typical run directory contains the following structure::

logs/Cityscapes/hrnet/experiment_cityscapes.baseline__epochs_300__lr_0.04/2025-12-10_15-32-48_TTXAfd9y
    ├── .hydra                             # Hydra configuration
    │     ├── config.yaml                       # Final composed config for this run
    │     ├── hydra.yaml                        # Hydra-specific configuration
    │     └── overrides.yaml                    # List of overrides used for this run
    
    ├── checkpoints                       # Model checkpoints
    │     ├── best_epoch200_IoU_epoch_0.79.ckpt # Checkpoint with the best validation metric
    │     └── last_epoch299_IoU_epoch_0.78.ckpt # Checkpoint from the final training epoch
    
    ├── eval_2025-12-11_15-15-20_7uFjCvpS       # Optional evaluation run
    │     ├── eval.log                          # Evaluation log file
    │     ├── events.out.tfevents....           # TensorBoard log
    │     ├── hparams.yaml
    │     ├── metric_log_test.jsonl             # Logged test metrics
    │     └── test_results.csv                  # Metrics per input file
    ├── eval_...                          # Additional evaluation runs
    │     └── ...
    
    ├── events.out.tfevents....           # TensorBoard log
    ├── hparams.yaml                      # Hyperparameters for this training run
    ├── metric_log_val.jsonl              # Validation metrics logged each epoch
    └── train.log                         # Training log file

To view the Tensorboard logs run:

tensorboard --logdir=<ckpt/dir>

Evaluation and Inference

Output directories follow the structure:

ckpt_dir = output_dir/Dataset_Name/Model_Name/Experiment_ID/Run_ID
# Example
ckpt_dir = logs/Cityscapes/hrnet/experiment_cityscapes.baseline__epochs_300__lr_0.04/2025-12-10_15-32-48_TTXAfd9y

Evaluation
You can run evaluation by passing these checkpoint directories (replace ckpt/dir with the full path to the output directory of your training):

python src/semantic_segmentation/eval.py ckpt_dir=<ckpt/dir>
# Specify which checkpoint to use
python src/semantic_segmentation/eval.py ckpt_dir=<ckpt/dir> inference.ckpt_type=last
python src/semantic_segmentation/eval.py ckpt_dir=<ckpt/dir> inference.ckpt_type=best
# Disable TTA and patch wise inference
python src/semantic_segmentation/eval.py ckpt_dir=<ckpt/dir> inference.use_patch_inference=False
python src/semantic_segmentation/eval.py ckpt_dir=<ckpt/dir> inference.use_tta=False

Inference
Prediction works similarly, but additionally requires specifying an input directory and an output directory:

python src/semantic_segmentation/predict.py ckpt_dir=<ckpt/dir>  input_dir=<input/dir> output_dir=<output/dir>
# Specify which checkpoint to use
python src/semantic_segmentation/predict.py ckpt_dir=<ckpt/dir> input_dir=<input/dir> output_dir=<output/dir> inference.ckpt_type=last  # Default
python src/semantic_segmentation/predict.py ckpt_dir=<ckpt/dir> input_dir=<input/dir> output_dir=<output/dir> inference.ckpt_type=best
# Disable/Enable TTA and patch wise inference
python src/semantic_segmentation/predict.py ckpt_dir=<ckpt/dir> input_dir=<input/dir> output_dir=<output/dir> inference.use_patch_inference=False  # Default
python src/semantic_segmentation/predict.py ckpt_dir=<ckpt/dir> input_dir=<input/dir> output_dir=<output/dir> inference.use_patch_inference=True
python src/semantic_segmentation/predict.py ckpt_dir=<ckpt/dir> input_dir=<input/dir> output_dir=<output/dir> inference.use_tta=True # Default
python src/semantic_segmentation/predict.py ckpt_dir=<ckpt/dir> input_dir=<input/dir> output_dir=<output/dir> inference.use_tta=False

Acknowledgements

    

This repository is developed and maintained by the Applied Computer Vision Lab (ACVL) of Helmholtz Imaging.

This repository plugin was generated with copier using the napari-plugin-template.

About

A modular and extensible framework for training and evaluating semantic segmentation models with PyTorch Lightning, supporting multiple architectures, datasets, losses, and data augmentation pipelines out of the box.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages