Official implementation of MultiDiffSense, a unified ControlNet-based diffusion model that generates realistic and physically grounded tactile sensor images across three different sensor types (ViTac, TacTip, ViTacTip) from a single model, conditioned on CAD-derived depth maps and structured text prompts encoding contact pose and sensor modality.
- Overview
- Quick Start (Pre-trained Model)
- Repository Structure
- Installation
- Dataset Preparation
- Training
- Testing & Evaluation
- Ablation Studies
- Baseline Comparison (cGAN / Pix2Pix)
- Example Data
- Citation
- Acknowledgements
MultiDiffSense leverages ControlNet (built on Stable Diffusion 1.5) to translate depth map renderings of 3D objects into realistic tactile sensor images across three sensor modalities:
- TacTip -- Optical tactile sensor with pin-based deformation markers
- ViTac -- Vision-based tactile sensor (no markers)
- ViTacTip -- Hybrid vision-tactile sensor
The model is conditioned on:
- Depth maps (rendered from STL files) as spatial control signals
- Text prompts describing the 4-DOF contact pose, and target sensor type
Generate tactile images directly using the pre-trained checkpoint (conditioned on short prompts + depth maps) from Hugging Face -- no training required. The checkpoint is downloaded automatically on first run.
pip install huggingface_hub
# Option 1: From a single depth map + text prompt:
python multidiffsense/controlnet/generate.py \
--source_image path/to/depth_map.png \
--prompt '{"sensor_context": "captured by a high-resolution vision only sensor ViTac.", "object_pose": {"x": 0.12, "y": -0.34, "z": 1.5, "yaw": 15.0}}'
# Option 2: From a prompt file (batch) -- each line contains a depth map path and prompt:
python multidiffsense/controlnet/generate.py \
--dataset_dir datasets \
--prompt_json datasets/test/prompt_ViTacTip.json_
Note: For Option 2: each line in the prompt file is a JSON object that specifies the depth map path (relative to --dataset_dir), the text prompt and in case of training/testing the target path (in case of inference = no target image):
{"source": "source/1_0.png", "target": "target/1_ViTacTip_0.png", "prompt": {"sensor_context": "captured by a high-resolution vision only sensor ViTac.", "object_pose": {"x": 0.12, "y": -0.34, "z": 1.5, "yaw": 15.0}}}For training from scratch, dataset preparation, evaluation, and ablation studies, see the sections below.
MultiDiffSense/
|-- cldm/ # ControlNet modules (at repo root)
| |-- cldm.py # ControlledUNet, ControlNet, ControlLDM
| |-- ddim_hacked.py # DDIM sampler for ControlNet
| |-- hack.py # CLIP and attention hacks
| |-- logger.py # Image logging callback
| |-- loss_plotter.py # Training loss visualisation callback
| +-- model.py # Model creation and checkpoint loading
|
|-- configs/ # Configuration files
| |-- controlnet_train.yaml # Training config (short prompts)
| +-- controlnet_train_long_prompt.yaml # Training config (long prompts, ablation 2)
|
|-- ldm/ # Latent Diffusion Model core (from CompVis/stable-diffusion)
| |-- data/ # Data utilities
| |-- models/
| | |-- diffusion/ # DDPM, DDIM, PLMS samplers
| | +-- autoencoder.py # VQ-VAE encoder/decoder
| +-- modules/ # UNet, attention, encoders, EMA, distributions
|
|-- multidiffsense/ # Core source code
| |-- controlnet/ # ControlNet training/testing scripts
| | |-- train.py # Training (supports --no_prompt / --no_source ablation)
| | |-- test.py # Testing with quantitative metrics
| | |-- generate.py # Inference-only generation
| | +-- data_loader.py # Dataset class for ControlNet
| |
| |-- baseline_cgan/ # Pix2Pix (cGAN) baseline
| | |-- train.py # cGAN training
| | |-- test.py # cGAN testing with same metrics
| | |-- dataset_converter.py # Convert ControlNet format -> Pix2Pix format
| | +-- README.md # Baseline-specific setup instructions
| |
| |-- data_preparation/ # Dataset building pipeline
| | |-- all_processing.py # Orchestrator: run full pipeline per object
| | |-- source_processing.py # Render depth maps from STL (target-driven alignment)
| | |-- target_processing.py # Rename + resize tactile sensor images
| | |-- prompt_creation.py # Generate prompt.json (short or long style)
| | |-- ds_creation.py # Assemble mega dataset (merge per-object datasets)
| | |-- dataset_split.py # Train/val/test splitting (70/15/15)
| | +-- modality_split.py # Split prompts by sensor modality for evaluation
| |
| +-- evaluation/ # Evaluation utilities
| +-- metrics.py # SSIM, PSNR, MSE, LPIPS, FID computation
|
|-- data/ # Raw data directory (user-populated)
| +-- example/ # Minimal example dataset
| |-- stl/ # STL mesh files: <obj_id>.stl
| |-- csv/ # Per-object pose CSV: <obj_id>.csv
| +-- tactile/ # Tactile images per object/sensor
| +-- <obj_id>/
| |-- TacTip/target/
| |-- ViTac/target/
| +-- ViTacTip/target/
|
|-- datasets/ # Assembled dataset (generated by pipeline)
| |-- source/ # All depth maps (shared across splits)
| |-- target/ # All tactile images (shared across splits)
| |-- prompt.json # Merged short prompts
| |-- prompt_long.json # Merged long prompts (ablation 2)
| |-- train/
| | |-- prompt.json
| | +-- prompt_long.json
| |-- val/
| | |-- prompt.json
| | +-- prompt_long.json
| +-- test/
| |-- prompt.json
| |-- prompt_long.json
| |-- prompt_TacTip.json # Per-modality splits (from modality_split)
| |-- prompt_ViTac.json
| +-- prompt_ViTacTip.json
|
|-- models/ # Model checkpoints
| +-- cldm_v15.yaml # ControlNet + SD1.5 architecture config
|
|-- scripts/ # Shell scripts for common workflows
|-- figures/ # Figures included in README
|-- tool_add_control.py # Utility: create ControlNet init weights from SD1.5
|-- requirements.txt # Python dependencies
+-- README.md # This file
git clone https://github.com/sirine-b/MultiDiffSense.git
cd MultiDiffSense
conda env create -f environment.yml
conda activate multidiffsensegit clone https://github.com/sirine-b/MultiDiffSense.git
cd MultiDiffSense
pip install -r requirements.txtDownload the Stable Diffusion v1.5 checkpoint and create the ControlNet initialisation weights:
bash scripts/prepare_model.shThis will:
- Download
v1-5-pruned.ckptfrom Hugging Face - Run
tool_add_control.pyto producemodels/control_sd15_ini.ckpt
The full pipeline builds the training dataset from raw data in 4 steps.
Expected raw data structure:
data/example/
|-- stl/ # STL mesh files: <obj_id>.stl
|-- csv/ # Pose CSV files: <obj_id>.csv
+-- tactile/ # Tactile images per object/sensor
+-- <obj_id>/
|-- TacTip/target/
|-- ViTac/target/
+-- ViTacTip/target/
Process one or more objects end-to-end across all three sensor modalities:
python -m multidiffsense.data_preparation.all_processing \
--stl_dir data/example/stl \
--csv_dir data/example/csv \
--tactile_dir data/example/tactile \
--obj_ids 1Processing order per object:
| Step | ViTac (1st) | TacTip (2nd) | ViTacTip (3rd) |
|---|---|---|---|
| Target processing | Rename + resize | Rename + resize | Rename + resize |
| Source processing | Generate from STL | Copy from ViTac | Copy from ViTac |
| Prompt creation | Generate from CSV | Generate from CSV | Generate from CSV |
Why ViTac first? Source (depth map) generation aligns each frame by extracting the object from the tactile image to determine its bounding box and centre position. ViTac images are vision-only with no pin markers on the sensor surface, making the object boundary much clearer and easier to segment than TacTip (pin markers) or ViTacTip (hybrid markers). Since the source depth maps represent the same object at the same pose regardless of sensor, they are generated once from ViTac and copied to the other two modalities.
The pipeline iterates only over frames that actually exist in the target directory (not the CSV row count), so missing or removed frames are handled gracefully.
Under the hood, this runs three sub-steps per sensor:
- Target processing (
target_processing.py) -- renames raw tactile images to<obj_id>_<sensor>_<frame>.pngand resizes to 512x512. - Source processing (
source_processing.py) -- uses target-driven alignment: segments the object in each target frame, then resizes, rotates, and positions the CAD depth map to match the target exactly. Uses Otsu's automatic thresholding (no per-object tuning). Only runs for ViTac; source images are copied to TacTip and ViTacTip. - Prompt creation (
prompt_creation.py) -- reads the per-object CSV and writes a JSONL prompt file. Supports--prompt_style short(default) or--prompt_style longfor ablation studies (see Ablation Studies).
Merge per-object datasets across all three sensor modalities into a single dataset:
python -m multidiffsense.data_preparation.ds_creation \
--tactile_dir data/example/tactile \
--output_dir datasets \
--object_ids 1 \
--sensors TacTip ViTac ViTacTipThis copies all source/target images into flat source/ and target/ directories and merges all per-object prompt.json files into one.
python -m multidiffsense.data_preparation.dataset_split \
--base_dir datasets \
--seed 16Splits the merged prompt.json into train/, val/, test/ subdirectories (70/15/15). Groups by source image so all sensor modalities for the same contact stay in the same split. Images remain in the parent datasets/source/ and datasets/target/; only prompt files are placed in the split subdirectories.
python -m multidiffsense.data_preparation.modality_split \
--prompt_path datasets/test/prompt.json \
--output_dir datasets/testCreates prompt_TacTip.json, prompt_ViTac.json, and prompt_ViTacTip.json for per-sensor evaluation.
Final dataset layout:
datasets/
|-- source/ # All depth maps (shared across splits)
|-- target/ # All tactile images (shared across splits)
|-- prompt.json # All samples
|-- train/prompt.json # Train split (prompt entries only, no images)
|-- val/prompt.json # Val split
+-- test/
|-- prompt.json
|-- prompt_TacTip.json
|-- prompt_ViTac.json
+-- prompt_ViTacTip.json
python multidiffsense/controlnet/train.py \
--config configs/controlnet_train.yaml \
--batch_size 8 \
--lr 1e-5 \
--max_epochs 150 \
--sd_lockedKey training parameters:
| Parameter | Default | Description |
|---|---|---|
batch_size |
8 | Training batch size |
lr |
1e-5 | Learning rate |
max_epochs |
150 | Maximum training epochs |
sd_locked |
True | Freeze Stable Diffusion backbone |
precision |
32 | Training precision |
early_stop_patience |
10 | Early stopping patience |
Training logs and checkpoints are saved to results/lightning_logs/.
# Test on seen objects (test split)
python multidiffsense/controlnet/test.py \
--config configs/controlnet_train.yaml \
--checkpoint path/to/best_checkpoint.ckpt \
--modality ViTacTip \
--seen_objects \
--output_dir results/test_seen
# Test on unseen objects
python multidiffsense/controlnet/test.py \
--config configs/controlnet_train.yaml \
--checkpoint path/to/best_checkpoint.ckpt \
--modality TacTip \
--output_dir results/test_unseenReported metrics (computed per-image and aggregated): SSIM (Structural Similarity Index), PSNR (Peak Signal-to-Noise Ratio in dB), MSE (Mean Squared Error), LPIPS (Learned Perceptual Image Patch Similarity, AlexNet), and FID (Frechet Inception Distance).
Results are saved as a CSV file and visual grids (control | target | generated).
All ablations involve retraining the model (not just test-time flag changes). Each ablation produces a separate checkpoint that is then evaluated.
Tests the contribution of each conditioning signal by training without it.
1a. Source only (no text prompt) -- train with empty prompts, depth map conditioning only:
# Train
python multidiffsense/controlnet/train.py \
--config configs/controlnet_train.yaml \
--no_prompt \
--output_suffix _no_prompt
# Test (use the no-prompt checkpoint, with --no_prompt to match)
python multidiffsense/controlnet/test.py \
--config configs/controlnet_train.yaml \
--checkpoint results_no_prompt/lightning_logs/.../best.ckpt \
--modality ViTacTip \
--no_prompt \
--output_dir results/ablation_no_prompt1b. Prompt only (no depth map) -- train with blank source images, text prompt conditioning only:
# Train
python multidiffsense/controlnet/train.py \
--config configs/controlnet_train.yaml \
--no_source \
--output_suffix _no_source
# Test
python multidiffsense/controlnet/test.py \
--config configs/controlnet_train.yaml \
--checkpoint results_no_source/lightning_logs/.../best.ckpt \
--modality ViTacTip \
--no_source \
--output_dir results/ablation_no_sourceThe --output_suffix flag appends to the output directory (e.g. results_no_prompt/) so checkpoints from different ablations don't overwrite each other.
Tests whether richer text prompts improve generation quality.
Short prompt (default): sensor context + object pose.
{"sensor_context": "captured by a high-resolution vision only sensor ViTac.",
"object_pose": {"x": 0.12, "y": -0.34, "z": 1.5, "yaw": 15.0}}Long prompt: object description + contact description + sensor context + style tags + negatives + object pose.
{"object_description": "A edge-shaped object with distinct geometric features",
"contact_description": "Medium contact on the object surface with moderate indentation",
"sensor_context": "Captured by a high-resolution vision only sensor ViTac",
"style_tags": "High quality, detailed texture, realistic tactile response, sharp sensor reading",
"negatives": "Blurry, low quality, artifacts, noise, distortion",
"object_pose": {"x": 0.12, "y": -0.34, "z": 1.5, "yaw": 15.0}}Workflow -- generate both prompt types, then train separately:
# Step 1: Generate short prompts (default, already done in normal pipeline)
python -m multidiffsense.data_preparation.all_processing \
--stl_dir data/example/stl --csv_dir data/example/csv \
--tactile_dir data/example/tactile --obj_ids 1 \
--prompt_style short
# Step 2: Generate long prompts (saved as prompt_long.json alongside prompt.json)
python -m multidiffsense.data_preparation.all_processing \
--stl_dir data/example/stl --csv_dir data/example/csv \
--tactile_dir data/example/tactile --obj_ids 1 \
--prompt_style long
# Step 3: Assemble + split both
python -m multidiffsense.data_preparation.ds_creation \
--tactile_dir data/example/tactile --output_dir datasets \
--object_ids 1 --prompt_style short
python -m multidiffsense.data_preparation.ds_creation \
--tactile_dir data/example/tactile --output_dir datasets \
--object_ids 1 --prompt_style long
python -m multidiffsense.data_preparation.dataset_split --base_dir datasets --prompt_style short
python -m multidiffsense.data_preparation.dataset_split --base_dir datasets --prompt_style long
# Step 4: Train with short prompts (default config)
python multidiffsense/controlnet/train.py \
--config configs/controlnet_train.yaml
# Step 5: Train with long prompts (separate config pointing to prompt_long.json)
python multidiffsense/controlnet/train.py \
--config configs/controlnet_train_long_prompt.yamlShort and long prompts coexist in the same dataset directory -- prompt.json and prompt_long.json sit side by side, sharing the same source/target images.
We compare against Pix2Pix as a conditional GAN baseline using the pytorch-CycleGAN-and-pix2pix framework.
git clone https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix external/pytorch-CycleGAN-and-pix2pixpython multidiffsense/baseline_cgan/dataset_converter.py \
--controlnet_dataset datasets \
--output_path external/pytorch-CycleGAN-and-pix2pix/datasets/depth_to_sensor \
--modality TacTipcd external/pytorch-CycleGAN-and-pix2pix
python train.py \
--dataroot datasets/depth_to_sensor \
--name depth_to_sensor_experiment \
--model pix2pix \
--direction AtoB \
--n_epochs 200 \
--n_epochs_decay 100bash scripts/test_pix2pix.shThe test script computes the same metrics (SSIM, PSNR, MSE, LPIPS, FID) for fair comparison.
Generate tactile images from depth maps without ground truth targets:
python multidiffsense/controlnet/generate.py \
--config configs/controlnet_train.yaml \
--checkpoint path/to/best_checkpoint.ckpt \
--dataset_dir datasets \
--prompt_json datasets/test/prompt_ViTacTip.json \
--output_dir results/generatedThe data/example/ directory contains a minimal working example with:
- 1 object across 3 sensor modalities (TacTip, ViTac, ViTacTip)
- Per-object pose CSV in
csv/<obj_id>.csv(tab-separated, 4-DOF pose) - STL source file in
stl/<obj_id>.stl - Tactile images in
tactile/<obj_id>/<sensor_type>/target/
To verify your installation, run the per-object pipeline on the example data:
python -m multidiffsense.data_preparation.all_processing \
--stl_dir data/example/stl \
--csv_dir data/example/csv \
--tactile_dir data/example/tactile \
--obj_ids 1If you find this work useful, please cite:
@inproceedings{multidiffsense2026,
title = {MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose},
author = {Sirine Bhouri and Lan Wei and Jian-Qing Zheng and Dandan Zhang},
booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
year = {2026}
url = {https://arxiv.org/abs/2602.19348}
}- ControlNet by Lvmin Zhang et al.
- Stable Diffusion by Rombach et al.
- pytorch-CycleGAN-and-pix2pix by Zhu et al.
This project is licensed under the MIT License -- see LICENSE for details.
