Input image reconstruction from features (small network)
- Video PPT Presentation and Code Demo : https://youtu.be/YzsRktf5dVI?si=K3JNTEeE23hOux2H
- Documentation : https://fau-my.sharepoint.com/:w:/g/personal/mmitayeegiri2024_fau_edu/IQC6R7wd59K5SbGaf26PDk2AAR5ckQtbKM0Hxw2__gFW8pc?e=aUlaMo
This project demonstrates how to reconstruct input images from deep CNN feature maps using a small decoder network.
It also includes a Streamlit-based user interface that allows you to upload or capture an image and view live reconstructions generated by the trained decoder.
-
Project Goals and Motivation
-
Environment Setup
-
Download Dataset
-
Model Architecture
-
Loss Function and Training Objective
-
Training the Model
-
Interactive UI
-
Limitations and Future Work
This project investigates image reconstruction from intermediate convolutional features using a small decoder network. Instead of training a large end-to-end autoencoder, we freeze a lightweight convolutional encoder and learn only a compact decoder that maps feature maps back to the image domain. Concretely, we use a convolutional backbone to extract feature tensors of size 56 X 56 X 256 from face images, and train a decoder composed of strided transposed convolutions and residual blocks to reconstruct 224 X 224 X 3 RGB images. Training is performed on the CelebA-HQ face dataset (30k high-quality face images), a widely used benchmark for generative modeling and face synthesis.
The decoder is optimized with mean absolute error (MAE) in pixel space, while evaluation uses a richer set of metrics: PSNR, SSIM, and LPIPS. SSIM provides a perceptually motivated measure of structural fidelity, and LPIPS compares deep feature activations from pretrained networks, which correlates better with human judgments than traditional distortion metrics. We conduct experiments on full-resolution face images and report both numerical scores and qualitative side-by-side reconstructions. Results show that even a relatively small decoder, trained only on frozen mid-level features, can recover the global structure and identity of faces, though fine details, contrast, and high-frequency textures remain challenging.
To make the system more accessible and reproducible, we package the full pipeline into a modular codebase with scripts for feature extraction, decoder training, and evaluation, along with a Streamlit user interface. The UI allows users to upload images or use a webcam, visualize encoder features, and view live reconstructions. We also provide a reproducibility checklist, detailed training logs, and fixed random seeds to ensure that experiments can be replicated under the specified environment. Overall, this project demonstrates a practical trade-off between model size and reconstruction quality, and serves as a compact, end-to-end example of feature-based image reconstruction suitable for course projects and future research extensions.
This project explores image reconstruction from intermediate CNN feature maps using a small, resource-efficient decoder network.
-
The encoder is a frozen convolutional feature extractor.
-
The decoder is a compact convolutional network trained to reconstruct the original 224 X 224 RGB image from encoder feature maps.
The project provides:
-
A training pipeline (
train.py) using Kerasmodel.fit. -
A research-style evaluation script (
evaluate.py) with MSE, PSNR, SSIM, and plots. -
A test script (
test_model.py) for quick sanity checks. -
A Streamlit UI (
ui_app.py) for:-
Image upload reconstruction.
-
Optional live webcam reconstruction.
-
The focus is on using a small decoder under realistic compute and memory constraints (Apple Silicon, 16GB RAM), while maintaining reasonable reconstruction quality and providing a reproducible end-to-end workflow.
-
Feature-to-Image Reconstruction
Given an intermediate feature map from a CNN encoder, learn a decoder that can reconstruct the original 224 X 224 RGB image. -
Small Network Constraint
Keep the decoder relatively small (hundreds of thousands of parameters, not tens of millions), to:-
Run on constrained hardware (e.g., Apple Silicon).
-
Demonstrate that reasonable reconstructions are possible with limited capacity.
-
-
End-to-End Workflow
-
Robust training scripts (
train.py). -
Evaluation scripts (
evaluate.py) with numerical metrics and visualizations. -
A UI (
ui_app.py) for interactive demos.
-
-
Reproducibility and Documentation
-
Pinned environment.
-
Clear dataset assumptions.
-
Correct handling of dataset cardinality and steps per epoch.
-
Feature-to-image reconstruction is closely related to interpretability: it provides intuition about how much information intermediate feature maps retain. Small decoders are relevant in:
-
Low-resource deployment scenarios.
-
Privacy-related questions (how much can be reconstructed from shared features).
-
Educational contexts where hardware is limited.
The repository layout is:
CAP6415-Project-ImageReconstruction/
├── main.py # Optional CLI entry-point (train, evaluate)
├── requirements.txt # Pinned environment for macOS + Apple Silicon
├── README.md # Documentation
├── src/
│ ├── encoder.py # Builds frozen encoder (feature extractor)
│ ├── decoder.py # Builds small decoder network
│ ├── dataset.py # CelebA-HQ loader (224×224, [0,1])
│ ├── train.py # Training script (Keras model.fit)
│ ├── test_model.py # Sanity-check reconstruction script
│ └── evaluate.py # Evaluation: MSE/PSNR/SSIM + plots
├── app/
│ └── ui_app.py # Streamlit UI: upload + webcam reconstruction
├── dataset/
│ └── celeba_hq/ # CelebA-HQ images (30,000)
└── outputs/
└── evaluation/ # Metrics & plots from evaluate.py
Additional runtime directories:
-
src/models/decoder_checkpoints/: stores trained decoder weights (e.g.,decoder_final.h5). -
outputs/eval_run*/: stores evaluation metrics and figures.
-
OS: macOS (Apple Silicon).
-
CPU/GPU: Apple Silicon (e.g., Apple M5) with Metal acceleration.
-
Python: 3.10.
-
DL Stack:
tensorflow-macos == 2.10.0+tensorflow-metal.
Create and activate a dedicated environment:
conda create -n CV python=3.10 -y
conda activate CVFrom the project root:
pip install -r requirements.txtKey pinned packages:
-
tensorflow-macos == 2.10.0 -
tensorflow-metal == 0.7.0 -
numpy == 1.23.5 -
ml-dtypes == 0.2.0 -
protobuf == 3.19.6 -
opencv-python == 4.8.1.78 -
scikit-image == 0.21.0 -
matplotlib == 3.7.1 -
streamlit == 1.22.0 -
streamlit-webrtc -
altair == 4.2.2,vega-datasets == 0.9.0 -
tqdm == 4.66.1
These versions avoid common compatibility issues such as NumPy / TensorFlow ABI mismatches and Protobuf descriptor errors.
The project uses CelebA-HQ, a dataset of high-quality face images. Place the images as:
CAP6415-Project-ImageReconstruction/dataset/celeba_hq/
00001.png
00002.png
...
(≈ 30,000 images)
No class subfolders are required; the dataset is treated as a single pool of face images.
The loader uses tf.keras.utils.image_dataset_from_directory to:
-
Read all images under
dataset/celeba_hq/. -
Resize each image to
$224\times224$ . -
Normalize to
$[0, 1]$ (float32). -
Batch them (default batch size is often 8).
To verify:
python src/dataset.pyYou should see:
-
"Found 30000 files belonging to 1 classes."
-
Batch shape: e.g.
(4, 224, 224, 3). -
Pixel range:
0.0 - 1.0.
Given:
-
$N = 30,000$ images. -
Batch size
$B = 8$ .
Number of steps (batches) in one full epoch:
Thus, one epoch using the full dataset corresponds to 3750 training steps.
The encoder is a convolutional backbone used as a feature extractor:
-
Input:
$224\times224\times3$ RGB image. -
Output: feature map, typically
$56\times56\times256$ . -
During training:
encoder.trainable = False.
It approximates a precomputed feature extractor for feature-to-image reconstruction.
The decoder is a small CNN:
-
Input: encoder features, e.g.,
$56\times56\times256$ . -
Output: reconstructed image,
$224\times224\times3$ . -
Uses upsampling (e.g.,
UpSampling2D+Conv2D) and residual blocks. -
Final layer:
Conv2D(3, kernel_size=3, activation="sigmoid")to map to$[0,1]$ .
The total parameter count is kept in the order of a few hundred thousand parameters.
The autoencoder combines encoder and decoder:
where:
-
$f$ is the frozen encoder. -
The decoder is trainable.
-
The training objective is to minimize the difference between
$x$ and$\hat{x}$ .
The project uses a combination of mean absolute error (L1) and structural similarity (SSIM):
def ssim_l1_loss(y_true, y_pred, alpha=0.8):
y_true = tf.cast(y_true, tf.float32)
y_pred = tf.cast(y_pred, tf.float32)
l1 = tf.reduce_mean(tf.abs(y_true - y_pred))
ssim_val = tf.image.ssim(y_true, y_pred, max_val=1.0)
ssim_loss = 1.0 - tf.reduce_mean(ssim_val)
return alpha * l1 + (1.0 - alpha) * ssim_loss
-
L1 encourages pixel-wise accuracy.
-
SSIM focuses on structural similarity.
-
$\alpha = 0.8$ gives more weight to L1 while retaining structure-aware penalties.
The autoencoder is typically trained with the Adam optimizer and a modest learning rate to ensure stable convergence for the small decoder.
Previous issues occurred when manually setting steps_per_epoch = 1000,
which conflicted with the true cardinality and caused "Your input ran
out of data" warnings.
In the final configuration:
-
One epoch uses the entire dataset (
$3750$ batches). -
steps_per_epochis either inferred or explicitly set to the dataset cardinality.
train_ds = load_celeba_hq(batch_size=batch_size)
train_ds = train_ds.cache().shuffle(1000, seed=SEED).prefetch(tf.data.AUTOTUNE)
history = autoencoder.fit(
train_ds,
epochs=EPOCHS,
callbacks=[checkpoint_cb],
)
Keras automatically infers:
Alternatively, explicitly:
num_batches = int(train_ds.cardinality().numpy()) # ≈ 3750
history = autoencoder.fit(
train_ds,
epochs=EPOCHS,
steps_per_epoch=num_batches,
callbacks=[checkpoint_cb],
)
From the project root:
python src/train.pyThe training will:
-
Build encoder and decoder.
-
Freeze the encoder parameters.
-
Compile the autoencoder using
ssim_l1_loss. -
Load existing weights for fine-tuning, if available.
-
Train for the specified number of epochs.
-
Save:
-
src/models/decoder_checkpoints/decoder_final.h5 -
loss_history.json -
loss_curve.png
-
To evaluate the trained model:
python src/evaluate.py \
--weights src/models/decoder_checkpoints/decoder_final.h5 \
--num-samples 300 \
--batch-size 8 \
--save-dir outputs/eval_run1The script:
-
Builds encoder and decoder, loads decoder weights.
-
Samples images from CelebA-HQ.
-
Computes:
-
Mean squared error (MSE).
-
Peak signal-to-noise ratio (PSNR) with
data_range=1.0. -
Structural similarity (SSIM) with
data_range=1.0,channel_axis=-1.
-
-
Saves to
outputs/eval_run1/:-
metrics_summary.json: means and standard deviations; full lists. -
psnr_histogram.png,ssim_histogram.png. -
sample_reconstructions.png: original vs reconstructed image grid. -
training_loss_curve.png(if loss history is present).
-
For a small sanity check:
python src/test_model.pyIt:
-
Loads encoder and decoder with trained weights.
-
Reconstructs a small batch of images.
-
Prints basic metrics and may save a comparison figure.
Start the UI:
python -m streamlit run app/ui_app.pyOpen the local URL (e.g., http://localhost:8501).
-
Upload a face image (JPG/PNG).
-
The app resizes to
$224\times224$ , normalizes, runs encoder+decoder. -
Displays:
-
Original vs reconstructed images.
-
MSE, PSNR, SSIM for the uploaded sample.
-
-
Uses
streamlit-webrtcandav. -
Implements a
VideoProcessorBasewith arecv()method (current API). -
Each frame is preprocessed, encoded, decoded, and rendered as a reconstructed stream.
-
Python 3.10.
-
Conda environment
CV. -
Dependencies installed from
requirements.txt.
In train.py, set seeds:
import random
import numpy as np
import tensorflow as tf
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)
Use the same seed for dataset shuffling:
train_ds = train_ds.shuffle(1000, seed=SEED)
Note that full bitwise determinism may not be guaranteed on GPU/Metal, but this significantly stabilizes the training outcome.
To reproduce results:
-
Use the same CelebA-HQ dataset in
dataset/celeba_hq/. -
Do not change the number or identity of images.
Keep fixed:
-
batch_size(e.g., 8). -
Number of epochs (e.g., 30).
-
Encoder and decoder architectures.
-
Loss function (
ssim_l1_loss, same$\alpha$ ). -
Learning rate and optimizer.
Ensure each epoch uses the full dataset:
-
Recommended: do not set
steps_per_epoch; let Keras infer it from dataset cardinality. -
Alternatively: set
steps_per_epoch = int(train_ds.cardinality().numpy())(approximately 3750).
Avoid contradictory manual settings (e.g., too small steps_per_epoch
combined with caching) that can cause "input ran out of data" warnings.
-
Small Decoder:
- The constrained decoder limits reconstruction sharpness compared to large decoders or GANs.
-
Face-Only Training:
- Trained only on CelebA-HQ faces; generalisation to non-face data is limited.
-
Fixed Resolution:
- Only supports
$224\times224$ images in the current configuration.
- Only supports
-
No Adversarial Loss:
- Reconstructions may appear over-smoothed compared to GAN-based methods.
-
Slightly deeper decoder while keeping it "small" overall.
-
Additional perceptual loss terms, e.g., VGG-based perceptual loss.
-
Explore reconstruction from different encoder layers (early vs deep features).
-
Multi-resolution outputs and multi-scale training.
-
Enhanced UI:
-
Compare different checkpoints.
-
Visualise error maps.
-
Toggle between different loss configurations.
-
-
Clone the repository
git clone https://github.com/GouthamMallavolu/CAP6415-Project-ImageReconstruction cd CAP6415-Project-ImageReconstruction -
Set up the environment
conda create -n CV python=3.10 -y conda activate CV pip install -r requirements.txt
-
Prepare the dataset
- Download Dataset CelebA-HQ: https://www.kaggle.com/datasets/lamsimon/celebahq?resource=download-directory&select=celeba_hq
- Place CelebA-HQ images under `dataset/celeba_hq/`.
-
Verify the dataset loader
python src/dataset.py
-
Extract features for making training easier
python src/extract_features.py
-
Train the model
python src/train.py
-
Test the model
python src/test_model.py
-
Evaluate the model
python src/evaluate.py
-
Run the UI
python -m streamlit run app/ui_app.py
Following these steps with the newly created conda environment, dataset layout, and scripts will reproduce the training behavior, evaluation metrics, and interactive demo for image reconstruction from CNN features using a small decoder network.
-
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504–507.
-
Liu, Z., Luo, P., Wang, X., & Tang, X. (2015). Deep Learning Face Attributes in the Wild. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
-
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2018). Progressive Growing of GANs for Improved Quality, Stability, and Variation. In International Conference on Learning Representations (ICLR).
-
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13(4), 600–612.
-
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
-
Autoencoder. Wikipedia, The Free Encyclopedia. Accessed 2024. (Overview of autoencoders and their use for unsupervised representation learning and reconstruction.)
- Goutham Mallavolu - gmallavolu2024@fau.edu
- Maahir Mitayeegiri - mmitayeegiri2024@fau.edu
© 2025 This project was created as part of Florida Atlantic University, CAP 6415 Computer Vision course Project.

