High-Performance Encrypted Deduplication with Segment Chunks & Index Locality
Evaluate-SCAIL is the full research and engineering implementation of SCAIL and P-SCAIL, two high-throughput encrypted deduplication systems designed to achieve petabyte-scale metadata processing on commodity hardware.
This repository provides the client, server, storage engine, sorted-index implementation, multiprocessing pipeline, and evaluation tools used to produce the published results.
SCAIL and P-SCAIL are detailed in the accompanying research:
- P-SCAIL – metadata scalability and parallel SCI architecture
- SCAIL NAS 2022 conference paper – segment-level client interface and metadata reduction
- PhD Thesis – complete design, algorithms, and evaluation results
Evaluate-SCAIL provides an end-to-end encrypted deduplication pipeline:
- Run and compare deduplication schemes: Base, Metadedup, SCAIL, P-SCAIL
- Load and process real-world datasets (FSL, MS/UBC), Docker layers, VM images, and synthetic traces
- Execute the full workflow: chunking → encryption → segment formation → client lookup → server-side deduplication → SCI update
- Benchmark performance, memory, upload volume, and metadata overhead
- Visualise disk I/O, throughput, and metadata trends
- Memory reduction up to 94 percent when using 2 MiB segments.
- Approximately 57 GiB DRAM required for 1 PB of unique deduplicated data using 2 MiB segments.
- Parallel throughput (16 processes):
- LDLS: 273–434 GiB/s
- SMPS: 6.9–10.0 GiB/s
- Metadata savings up to 97 percent across long-running workloads.
- Redundant segment uploads typically under 1 percent per backup for long-running workloads.
flowchart TD
subgraph CLIENT1[" CLIENT - Phase 1: Prepare & Query "]
A[Client Files] --> B[CDC Chunking]
B --> C[MLE Encrypt Chunks]
C --> D[Segment Formation]
D --> E[Generate MFP Query]
end
subgraph SERVER1[" SERVER - Phase 1: Lookup "]
F[Metachunk Lookup]
G[Return Missing Segments List]
end
subgraph CLIENT2[" CLIENT - Phase 2: Upload "]
H[Upload Missing Segments:<br/>Encrypted Chunks + PEMCs]
end
subgraph SERVER2[" SERVER - Phase 2: Deduplication & Storage "]
I[SCI Pass: Sorted Chunk Index]
I --> J[Container Allocation]
J --> K[Index Updates]
end
E --> F
F --> G
G --> H
H --> I
style CLIENT1 fill:#2d3748,stroke:#4299e1,stroke-width:2px
style SERVER1 fill:#2d3748,stroke:#48bb78,stroke-width:2px
style CLIENT2 fill:#2d3748,stroke:#4299e1,stroke-width:2px
style SERVER2 fill:#2d3748,stroke:#48bb78,stroke-width:2px
src/
client/
server/
repo/
config/
metrics/
datasets/
utilities/
papers/
- Python 3.9+
- Cython
- Ray
- Redis (optional)
git clone https://github.com/ammons-datalabs/Evaluate-SCAIL.git
cd Evaluate-SCAIL
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python src/sync_cython_files.pypython src/fsl_dedup.py
python src/ubc_dedup.py
python src/docker_dedup.py
python src/gen_dedup.pypython src/metrics/plot.py
python src/logs_viewer.py
python src/utilities/build_results/build_results_to_latex.pypython -m unittest discover -s src/tests- P-SCAIL: Proposed Journal Paper
- SCAIL: Conference Paper NAS 2022
- PhD Thesis
Jaybe Ammons
PhD, Computer Science — Birkbeck, University of London
See repository for license details.