Skip to content

luckyjoy/hyperscale_ssd

Repository files navigation

πŸš€ SSD Hyperscale Simulation Test Framework

A Universal Chaos & Performance Framework for NVMe SSDs.

Simulate hyperscale SSD workloads and fault conditions to validate endurance, performance consistency, and data integrity under stress. This robust framework uses Python, Behave (BDD), SPDK fio, and QEMU/QMP for realistic NVMe device emulation and fault injection. It is designed for seamless integration into Jenkins CI/CD pipelines for repeatable chaos and regression testing.


Python Behave QEMU fio Kubernetes Docker CI/CD Allure Report


πŸ’‘ Project Overview

This framework is critical for testing the resilience and performance consistency of hyperscale NVMe SSDs. It automates the process of workload application, fault injection, and metric verification.

Component Technology Role
Test Runner Behave (BDD) Executes scenario-driven tests and verifies pass/fail criteria.
Workload Generator SPDK fio Generates realistic, high-performance NVMe I/O workloads.
Fault Injection QEMU + QMP Emulates NVMe devices and injects faults (hot-unplug/plug) via QMP.
Latency Simulation nbdkit / Python proxy Simulates I/O slowdowns and latency spikes.

πŸ› οΈ Key Test Purposes & Challenges

πŸ§ͺ Test Purposes

  • Validate SSD performance under SPDK fio workloads.
  • Simulate power-loss (hot-unplug) and device hot-plug events.
  • Inject latency spikes / I/O slowdowns.
  • Observe system resilience, recovery, and workload impact.
  • Automatically verify latency, IOPS, and throughput against defined thresholds.
  • Integrate into CI/CD for repeatable chaos and regression testing.

πŸ“Œ Key Challenges with Hyperscale SSDs

  • Endurance and Longevity: Simulate sustained high-write workloads to ensure throughput and latency remain stable over time. Hyperscale SSDs must endure constant, high-volume write workloads without wearing out.
  • Performance Consistency & Latency Spikes: Consistent performance is more critical than peak performance. Maintain predictable IOPS and low latency under mixed workloads and shared stress.
  • Reliability and Data Integrity: Recover from power loss or hot unplug events without data corruption. The goal is to ensure data remains intact and the device can be brought back online reliably.
  • Multi-tenancy & QoS: A single SSD is often shared by multiple virtual machines. Validate QoS mechanisms to throttle one VM without affecting others.

πŸš€ Getting Started

πŸ”§ Prerequisites

  • 🐍 Python 3.10+ (recommended)
  • πŸ’» Linux or Windows 11 host operating system.
  • βš™οΈ QEMU, nbdkit (Linux), fio, and SPDK installed and built.

βš™οΈ Installation & Setup

  1. Clone the repository (Example):

    git clone [https://github.com/luckyjoy/ssd_hyperscale.git](https://github.com/luckyjoy/ssd_hyperscale.git)
    cd ssd_hyperscale
  2. Install Python Dependencies:

    pip install -r requirements.txt
  3. Host Dependencies (Linux Example):

    sudo apt-get update
    sudo apt-get install -y qemu-system-x86 nbdkit nbdkit-filter-delay python3 socat fio git
  4. SPDK Setup:

    git clone [https://github.com/spdk/spdk.git](https://github.com/spdk/spdk.git)
    cd spdk
    git submodule update --init
    ./configure --with-fio=/usr/src/fio
    make -j$(nproc)
  5. Create VM Image:

    qemu-img create -f qcow2 vm-disk.qcow2 10G

    (Note: VM image must have FIO installed and SSH enabled for host access on localhost:2222)


πŸ”¬ Running Tests

1. Launch VM

  • Linux: python3 qemu-faults/qemu_launch_vm.py
  • Windows: python qemu-faults\qemu_launch_vm.py

2. Normal Workload Test (Inside VM)

fio spdk_fio_mix.job --output-format=json --output=reports/fio_guest.json

3. Inject Faults (Chaos Testing)

Fault Type Linux Command Windows Command
Hot-Unplug NVMe python3 qemu-faults/qmp_injector.py --socket /tmp/vm-test.qmp --action remove --device-id nvme0 python qemu-faults\qmp_injector.py --socket \\.\pipe\vm-test_qmp --action remove --device-id nvme0
Hot-Plug NVMe python3 qemu-faults/qmp_injector.py --socket /tmp/vm-test.qmp --action add --device-spec '{"driver":"nvme","drive":"drive0","id":"nvme0"}' python qemu-faults\qmp_injector.py --socket \\.\pipe\vm-test_qmp --action add --device-spec '{"driver":"nvme","drive":"drive0","id":"nvme0"}'
Latency Spike ./qemu-faults/nbdkit_delay_server.sh ./vm-disk.qcow2 10810 100 200 python qemu-faults\nbd_delay.py 10810 100

4. Automatic Metric Verification

The framework verifies metrics against example thresholds:

  • MAX_LATENCY_MS = 50
  • MIN_IOPS = 1000
  • MIN_THROUGHPUT_MB = 50

5. Running Behave Tests

Execute all BDD scenarios, excluding manual tests, and generate an HTML report:

behave --tags=@all --exclude "features/manual_tests" -f html-pretty -o reports\automation_report.html

βš™οΈ CI/CD Integration

The framework is built for automated, cross-platform execution using a Jenkins Pipeline.

πŸ’» Jenkins CI/CD Workflow

A sample Jenkins Pipeline stage for fault injection tests:

stage('Fault Injection Tests') {
  matrix {
    axes { axis { name 'OS'; values 'linux', 'windows' } }
    agent { label "${OS}-agent" }
    stages {
      stage('Run Behave') {
        steps {
          script {
            if ("${OS}" == "windows") {
              bat 'behave --tags=@windows features/ssd_fault_injection.feature'
            } else {
              sh 'behave --tags=@linux features/ssd_fault_injection.feature'
            }
          }
        }
      }
    }
  }
}

🌳 Framework Architecture

./                           # Root directory
β”œβ”€ behave.ini                # Behave runner configuration
β”œβ”€ build.bat                 # Windows build script
β”œβ”€ environment.py            # Behave environment setup file
β”œβ”€ Jenkinsfile               # CI/CD pipeline definition
β”œβ”€ README.html
β”œβ”€ README.md
β”œβ”€ requirements.txt          # Python dependencies
β”œβ”€ run_full_test.ps1         # PowerShell full test runner
β”œβ”€ run_full_test.sh          # Bash full test runner
β”œβ”€ test.txt
β”œβ”€ data/                     # Data generated by fio (e.g., JSON reports)
β”‚  β”œβ”€ fio.txt
β”‚  β”œβ”€ fio_guest.json
β”‚  └─ mixed_random_output.json
β”œβ”€ examples/                 # Example fio job files and QEMU setup files
β”‚  β”œβ”€ multi_tenant_stress.fio
β”‚  β”œβ”€ nvme_queue_depth_saturate.fio
β”‚  β”œβ”€ random_multiIO_stress.fio
β”‚  β”œβ”€ write_endurance.fio
β”‚  β”œβ”€ latency_spike.io
β”‚  β”œβ”€ latency_spike_read_lat.json
β”‚  └─ qemu_ubuntu.bat
β”œβ”€ features/                 # Behave BDD test specifications
β”‚  β”œβ”€ advance_hyperscale.feature
β”‚  β”œβ”€ ssd_comparision.feature
β”‚  β”œβ”€ ssd_fault_injection.feature
β”‚  β”œβ”€ ssd_mixed_io.feature
β”‚  β”œβ”€ ssd_performance.feature
β”‚  β”œβ”€ manual_tests/           # Tests tagged for manual execution
β”‚  β”‚  β”œβ”€ manual_endurance.feature
β”‚  β”‚  β”œβ”€ manual_power_cycle.feature
β”‚  β”‚  └─ manual_thermal_throttling.feature
β”‚  └─ steps/                  # Python step definitions for Behave
β”‚     β”œβ”€ ssd_comparison.py
β”‚     └─ ssd_steps.py
β”œβ”€ logs/                     # Runtime logs and collected metrics
β”‚  β”œβ”€ execution.log
β”‚  β”œβ”€ scenario.log
β”‚  └─ *.log                  # Various action and verification logs
β”œβ”€ qemu-faults/              # QEMU/QMP scripts for VM launch and fault injection
β”‚  β”œβ”€ qemu_launch_vm.py
β”‚  β”œβ”€ nbdkit_delay_server.sh
β”‚  β”œβ”€ nbd_delay.py
β”‚  β”œβ”€ qmp_injector.py
β”‚  └─ fault_runner.sh
β”œβ”€ allure-report/            # Dynamic history report files
β”œβ”€ allure-results/           # Behave-Allure raw results directory
β”œβ”€ spdk/                     # SPDK fio jobfiles (legacy/core)
β”‚  β”œβ”€ fio_mixed_rw.job
β”‚  β”œβ”€ fio_multi_device.job
β”‚  └─ fio_multi_queue.job
β”œβ”€ .github/                  # GitHub Actions CI/CD workflows
└─ supports/                 # Utility scripts for reporting, telemetry, and CI/CD integration
   β”œβ”€ product.json
   β”œβ”€ ssd_requirements.csv
   └─ *.json, *.properties # Other Allure Report support files


⚠️ Limitations & Capabilities

Category Details
Capabilities Supports Windows 11 and Linux hosts. Handles Hot-unplug, hot-plug, and latency spike injection. Includes automatic pass/fail based on SPDK fio metrics. Ready for Jenkins CI/CD automation.
Limitations QEMU QMP APIs can differ between versions. Hot-removing devices may leave guest filesystem inconsistent (use ephemeral VM snapshots). nbdkit --filter=delay required on Linux; Windows uses Python proxy. Thresholds must be adjusted per workload / SSD type. SPDK fio JSON output is required for automated verification.

🀝 Contributing Guidelines

  1. Fork the repository
  2. Create a feature branch
  3. Implement new Behave features, SPDK workloads, or fault injection methods.
  4. Run behave locally and verify results.
  5. Submit a Pull Request with a clear description.

πŸͺͺ License

Released under the MIT License β€” free to use, modify, and distribute.


πŸ“¬ Contact: Bang Thien Nguyen ontario1998@gmail.com


β€œPerformance is a feature, and reliability is its foundation.”

About

SSD Hyperscale Simulation Test Framework with Behave, HTML Report, CI/CD and Docker

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors