A Universal Chaos & Performance Framework for NVMe SSDs.
Simulate hyperscale SSD workloads and fault conditions to validate endurance, performance consistency, and data integrity under stress. This robust framework uses Python, Behave (BDD), SPDK fio, and QEMU/QMP for realistic NVMe device emulation and fault injection. It is designed for seamless integration into Jenkins CI/CD pipelines for repeatable chaos and regression testing.
This framework is critical for testing the resilience and performance consistency of hyperscale NVMe SSDs. It automates the process of workload application, fault injection, and metric verification.
| Component | Technology | Role |
|---|---|---|
| Test Runner | Behave (BDD) | Executes scenario-driven tests and verifies pass/fail criteria. |
| Workload Generator | SPDK fio | Generates realistic, high-performance NVMe I/O workloads. |
| Fault Injection | QEMU + QMP | Emulates NVMe devices and injects faults (hot-unplug/plug) via QMP. |
| Latency Simulation | nbdkit / Python proxy | Simulates I/O slowdowns and latency spikes. |
- Validate SSD performance under SPDK fio workloads.
- Simulate power-loss (hot-unplug) and device hot-plug events.
- Inject latency spikes / I/O slowdowns.
- Observe system resilience, recovery, and workload impact.
- Automatically verify latency, IOPS, and throughput against defined thresholds.
- Integrate into CI/CD for repeatable chaos and regression testing.
- Endurance and Longevity: Simulate sustained high-write workloads to ensure throughput and latency remain stable over time. Hyperscale SSDs must endure constant, high-volume write workloads without wearing out.
- Performance Consistency & Latency Spikes: Consistent performance is more critical than peak performance. Maintain predictable IOPS and low latency under mixed workloads and shared stress.
- Reliability and Data Integrity: Recover from power loss or hot unplug events without data corruption. The goal is to ensure data remains intact and the device can be brought back online reliably.
- Multi-tenancy & QoS: A single SSD is often shared by multiple virtual machines. Validate QoS mechanisms to throttle one VM without affecting others.
- π Python 3.10+ (recommended)
- π» Linux or Windows 11 host operating system.
- βοΈ QEMU, nbdkit (Linux), fio, and SPDK installed and built.
-
Clone the repository (Example):
git clone [https://github.com/luckyjoy/ssd_hyperscale.git](https://github.com/luckyjoy/ssd_hyperscale.git) cd ssd_hyperscale -
Install Python Dependencies:
pip install -r requirements.txt
-
Host Dependencies (Linux Example):
sudo apt-get update sudo apt-get install -y qemu-system-x86 nbdkit nbdkit-filter-delay python3 socat fio git
-
SPDK Setup:
git clone [https://github.com/spdk/spdk.git](https://github.com/spdk/spdk.git) cd spdk git submodule update --init ./configure --with-fio=/usr/src/fio make -j$(nproc)
-
Create VM Image:
qemu-img create -f qcow2 vm-disk.qcow2 10G
(Note: VM image must have FIO installed and SSH enabled for host access on
localhost:2222)
- Linux:
python3 qemu-faults/qemu_launch_vm.py - Windows:
python qemu-faults\qemu_launch_vm.py
fio spdk_fio_mix.job --output-format=json --output=reports/fio_guest.json| Fault Type | Linux Command | Windows Command |
|---|---|---|
| Hot-Unplug NVMe | python3 qemu-faults/qmp_injector.py --socket /tmp/vm-test.qmp --action remove --device-id nvme0 |
python qemu-faults\qmp_injector.py --socket \\.\pipe\vm-test_qmp --action remove --device-id nvme0 |
| Hot-Plug NVMe | python3 qemu-faults/qmp_injector.py --socket /tmp/vm-test.qmp --action add --device-spec '{"driver":"nvme","drive":"drive0","id":"nvme0"}' |
python qemu-faults\qmp_injector.py --socket \\.\pipe\vm-test_qmp --action add --device-spec '{"driver":"nvme","drive":"drive0","id":"nvme0"}' |
| Latency Spike | ./qemu-faults/nbdkit_delay_server.sh ./vm-disk.qcow2 10810 100 200 |
python qemu-faults\nbd_delay.py 10810 100 |
The framework verifies metrics against example thresholds:
MAX_LATENCY_MS = 50MIN_IOPS = 1000MIN_THROUGHPUT_MB = 50
Execute all BDD scenarios, excluding manual tests, and generate an HTML report:
behave --tags=@all --exclude "features/manual_tests" -f html-pretty -o reports\automation_report.htmlThe framework is built for automated, cross-platform execution using a Jenkins Pipeline.
A sample Jenkins Pipeline stage for fault injection tests:
stage('Fault Injection Tests') {
matrix {
axes { axis { name 'OS'; values 'linux', 'windows' } }
agent { label "${OS}-agent" }
stages {
stage('Run Behave') {
steps {
script {
if ("${OS}" == "windows") {
bat 'behave --tags=@windows features/ssd_fault_injection.feature'
} else {
sh 'behave --tags=@linux features/ssd_fault_injection.feature'
}
}
}
}
}
}
}./ # Root directory
ββ behave.ini # Behave runner configuration
ββ build.bat # Windows build script
ββ environment.py # Behave environment setup file
ββ Jenkinsfile # CI/CD pipeline definition
ββ README.html
ββ README.md
ββ requirements.txt # Python dependencies
ββ run_full_test.ps1 # PowerShell full test runner
ββ run_full_test.sh # Bash full test runner
ββ test.txt
ββ data/ # Data generated by fio (e.g., JSON reports)
β ββ fio.txt
β ββ fio_guest.json
β ββ mixed_random_output.json
ββ examples/ # Example fio job files and QEMU setup files
β ββ multi_tenant_stress.fio
β ββ nvme_queue_depth_saturate.fio
β ββ random_multiIO_stress.fio
β ββ write_endurance.fio
β ββ latency_spike.io
β ββ latency_spike_read_lat.json
β ββ qemu_ubuntu.bat
ββ features/ # Behave BDD test specifications
β ββ advance_hyperscale.feature
β ββ ssd_comparision.feature
β ββ ssd_fault_injection.feature
β ββ ssd_mixed_io.feature
β ββ ssd_performance.feature
β ββ manual_tests/ # Tests tagged for manual execution
β β ββ manual_endurance.feature
β β ββ manual_power_cycle.feature
β β ββ manual_thermal_throttling.feature
β ββ steps/ # Python step definitions for Behave
β ββ ssd_comparison.py
β ββ ssd_steps.py
ββ logs/ # Runtime logs and collected metrics
β ββ execution.log
β ββ scenario.log
β ββ *.log # Various action and verification logs
ββ qemu-faults/ # QEMU/QMP scripts for VM launch and fault injection
β ββ qemu_launch_vm.py
β ββ nbdkit_delay_server.sh
β ββ nbd_delay.py
β ββ qmp_injector.py
β ββ fault_runner.sh
ββ allure-report/ # Dynamic history report files
ββ allure-results/ # Behave-Allure raw results directory
ββ spdk/ # SPDK fio jobfiles (legacy/core)
β ββ fio_mixed_rw.job
β ββ fio_multi_device.job
β ββ fio_multi_queue.job
ββ .github/ # GitHub Actions CI/CD workflows
ββ supports/ # Utility scripts for reporting, telemetry, and CI/CD integration
ββ product.json
ββ ssd_requirements.csv
ββ *.json, *.properties # Other Allure Report support files
| Category | Details |
|---|---|
| Capabilities | Supports Windows 11 and Linux hosts. Handles Hot-unplug, hot-plug, and latency spike injection. Includes automatic pass/fail based on SPDK fio metrics. Ready for Jenkins CI/CD automation. |
| Limitations | QEMU QMP APIs can differ between versions. Hot-removing devices may leave guest filesystem inconsistent (use ephemeral VM snapshots). nbdkit --filter=delay required on Linux; Windows uses Python proxy. Thresholds must be adjusted per workload / SSD type. SPDK fio JSON output is required for automated verification. |
- Fork the repository
- Create a feature branch
- Implement new Behave features, SPDK workloads, or fault injection methods.
- Run
behavelocally and verify results. - Submit a Pull Request with a clear description.
Released under the MIT License β free to use, modify, and distribute.
π¬ Contact: Bang Thien Nguyen ontario1998@gmail.com
βPerformance is a feature, and reliability is its foundation.β