Hinweis: Vage Einträge ohne messbares Ziel, Interface-Spezifikation oder Teststrategie mit
<!-- TODO: add measurable target, interface spec, test strategy -->markieren.
- RocksDB-based persistent storage with MVCC, WAL, BlobDB, multi-path SSTables, and async I/O
- Multi-model key encoding (relational, document, graph, vector, time-series) via
KeySchema - Backup/PITR, blob backends (Filesystem, S3, Azure Blob, WebDAV, GCS), and RAID-1 redundancy
- Field-level AES-256-GCM encryption, HMAC-SHA256 tamper detection, and structured audit logging
- Columnar format for analytics, adaptive write batching, and pluggable per-table compression
- Raft-MVCC integration (
RaftMVCCBridge), hybrid logical clocks (HLC), and transaction retry management - Tiered storage (hot NVMe → warm SATA → cold object storage) implemented in v1.6.0
- All blob backends must implement
IBlobBackend; no direct SDK calls outside backend implementation files -
THEMIS_PRODUCTION_MODEmust reject no-op (plaintext) encryption at startup — fail-closed is mandatory - Data migration (tier moves, schema changes) must use copy-then-delete; partial migration must never leave data inconsistent
- All new RocksDB column families must be registered in
KeySchemabefore use; ad-hoc CF creation is forbidden -
StorageAuditLoggermust record every write, delete, and compaction with caller identity and HLC timestamp - Public
StorageEngineAPI is frozen for v1.x; all new capabilities must be additive (no removed or renamed methods) - Transaction retries must use exponential backoff with jitter; max retry count must be configurable via
TransactionRetryConfig
| Interface | Consumer | Notes |
|---|---|---|
IBlobBackend |
BlobStorageManager, BlobRedundancyManager |
put/get/delete/exists; implemented by Filesystem, S3, Azure, GCS |
IEncryptionProvider |
StorageEngine, SecuritySignatureManager |
AES-256-GCM key provision and rotation |
IKeyProvider |
StorageEngine |
Key derivation per tenant and column family |
IIndexManager |
StorageEngine, IndexMaintenance |
Rebuild, optimize, consistency check |
ICompressionStrategy |
CompressedStorage, ColumnarFormat |
Snappy, Zstd, LZ4, Brotli, None; selected per column family |
IRaftMVCCBridge |
MVCCStore, Raft consensus layer |
Append Raft log entry → apply to MVCC snapshot atomically |
IAuditLogger |
All storage write/delete paths | Structured JSON audit trail with HLC timestamps and caller ID |
Priority: High Target Version: v1.9.0 Status: 🟡 In Progress
- Tier-aware fragmentation analysis per index (hot/warm/cold thresholds independently configurable)
- Produces
IndexAnalysisReportwithIndexRecommendation(NONE / UPDATE_STATS / REORGANIZE / PARTIAL_REBUILD / FULL_REBUILD) - Cron-based automatic scheduling via the existing
CronExpressionparser (standard 5-field POSIX cron + @-shortcuts) - AI/ML intervention hook (
IIndexAnalysisAdvisor) that can override the rule-based recommendation - Full YAML configuration via
config/index_analyze.yaml(or inline inconfig.yamlunderindex_analyze:)
IIndexAnalysisAdvisor::advise()must be thread-safe; the scheduler calls it from a background thread- Thresholds for warm and cold tiers MUST be ≥ hot-tier thresholds (looser = cheaper maintenance)
- AI advisor is disabled by default (
ai_advisor.enabled: false); opt-in to avoid unexpected interventions in production - Cron scheduler uses copy-then-read config snapshot to avoid holding the mutex during RocksDB I/O
- No new dependency introduced; yaml-cpp and CronExpression are already in the build
// AI/ML intervention hook
class IIndexAnalysisAdvisor {
public:
virtual std::optional<std::pair<IndexRecommendation, std::string>>
advise(const IndexAnalysisReport& report) = 0;
};
// Per-index analysis
Result<IndexAnalysisReport> IndexAnalyzer::analyze(
const std::string& index_name,
storage::StorageTierLevel tier,
std::optional<TierThresholds> overrides = std::nullopt);
// Batch analysis of all configured indices
std::vector<IndexAnalysisReport> IndexAnalyzer::analyzeAll();
// YAML config load
static Result<IndexAnalyzeConfig> IndexAnalyzeConfig::fromYamlFile(const std::string& path);
// Cron scheduler lifecycle
Result<void> startScheduled(); // validates cron expression, launches background thread
void stopScheduled(); // graceful shutdown- Fragmentation estimate:
(total_sst - live_sst) / total_sst * 100 + l0_files * 2.0using RocksDB propertiesrocksdb.total-sst-files-size,rocksdb.live-sst-files-size,rocksdb.num-files-at-level0 - Statistics staleness: tracked via
stats_age_hours; initially approximated at 2 h until a metadata CF for stats-update-time is introduced - AI advisor receives the fully populated preliminary
IndexAnalysisReport; returningstd::nulloptleaves the rule-based recommendation unchanged - Scheduler loop: compute
CronExpression::getNextExecution(now),cv_.wait_until(*next), runanalyzeAll(), repeat
- IA-01…IA-15 in
tests/test_index_analyzer.cpp(IndexAnalyzerFocusedTests) - IA-01..IA-04: threshold defaults and tier dispatch (pure logic, no DB)
- IA-05..IA-06: YAML load – error path + valid config
- IA-07: ctor guard for null db_wrapper
- IA-08..IA-09: setConfig / setAdvisor thread safety
- IA-10..IA-14: classify() function covering all five recommendations
- IA-15: lastReports() empty before first run
analyzeAll()for 100 indices: ≤ 50 ms wall-clock (RocksDB property reads are non-blocking)- Scheduler thread wake-up overhead: ≤ 1 ms (uses
condition_variable::wait_until) - AI advisor timeout: caller-controlled; no internal timeout imposed by IndexAnalyzer
- Cron expression validated before thread launch; malformed expression → error returned, no thread started
- AI advisor exceptions are caught and logged; the rule-based recommendation is preserved
stopScheduled()is called from the destructor; no thread left running after object destruction
Priority: Medium Target Version: v1.8.0
rocksdb_wrapper.cpp line 1445: "TODO: Implement proper size calculation". The RocksDBWrapper::getApproximateSize() or equivalent method returns 0 or a placeholder, making disk-space monitoring, compaction triggers, and admin API storage metrics unreliable.
Implementation Notes:
[x]Userocksdb::DB::GetApproximateSizes()API to compute the on-disk SST file sizes for a key range.[x]Alternatively, userocksdb::DB::GetIntProperty(rocksdb::DB::Properties::kTotalSstFilesSize)for total CF size.[x]Wire the result intoDiskSpaceMonitorand the/v1/admin/storage/statsendpoint.
Priority: Medium Target Version: v1.8.0
security_signature_manager.cpp line 110: "TODO: Implement proper RocksDB iteration when RocksDBWrapper supports it". Without iteration, the signature manager cannot verify integrity across all stored records.
Implementation Notes:
[x]AddRocksDBWrapper::iterateRange(start_key, end_key, callback)that uses arocksdb::Iteratorunder the hood.[x]Wire intoSecuritySignatureManager::verifyAll()to scan all document keys and verify their signatures in sequence.
Priority: Low Target Version: v1.8.0 Status: ✅ Implemented
BlobRedundancyManager::createRocksDBListener() returns a working RocksDBBlobListener
subclassing rocksdb::EventListener. Overrides OnTableFileDeleted to mark affected blob
locations unhealthy when backing SST files are deleted during compaction. Register the
listener via rocksdb::Options::listeners at database open time.
Implementation Notes:
[x]BlobRedundancyEventListenersubclassesrocksdb::EventListener;OnTableFileDeletedtriggers re-replication of blobs whose backing SST was deleted.[x]Registered viarocksdb::Options::listenersat database open time.
Priority: High Target Version: v1.7.0 Status: ✅ Implemented
Raft-based distributed transactions across multiple nodes.
Features:
- Two-phase commit (2PC) protocol
- Cross-shard atomic operations
- Abort on any participant vote NO
- Thread-safe coordinator with per-transaction handles
Implementation:
include/storage/distributed_transaction_manager.h—DistributedTransactionManager,IDistributedShardParticipant,DistributedTransaction,DistributedOperationsrc/storage/distributed_transaction_manager.cpp— full 2PC implementation
API:
DistributedTransactionManager dtx_manager(nodes);
auto tx = dtx_manager.beginDistributedTransaction();
// Write to multiple shards
tx->put("shard1:key1", "value1");
tx->put("shard2:key2", "value2");
// Commit atomically across shards
tx->commit(); // 2PC protocolUse Cases:
- Multi-tenant data isolation
- Geographic data distribution
- Horizontal scaling
Priority: High Target Version: v1.6.0
Automatic data migration based on access patterns.
Tiers:
- Hot: NVMe SSDs (frequent access)
- Warm: SATA SSDs (moderate access)
- Cold: Object storage (rare access, archival)
Policies:
- Age-based: Move data older than N days to warm/cold
- Access-based: Move rarely accessed data to cold tier
- Size-based: Move large blobs to object storage
Configuration:
TieredStorageConfig config;
config.hot_tier_path = "/nvme/data";
config.warm_tier_path = "/sata/data";
config.cold_tier_backend = "s3://archive-bucket";
config.hot_to_warm_days = 30;
config.warm_to_cold_days = 90;
TieredStorageManager tiered(config);Priority: Medium Target Version: v1.7.0 Status: ✅ Implemented
Space-efficient redundancy using erasure codes (e.g., Reed-Solomon).
Encoding Schemes:
- RS(10,4): 10 data + 4 parity blocks (40% overhead vs 200% for mirroring)
- RS(6,3): 6 data + 3 parity blocks (50% overhead)
- RS(4,2): 4 data + 2 parity blocks (50% overhead, faster)
Benefits:
- 50-70% storage savings vs mirroring
- Survives multiple node failures
- Configurable fault tolerance
Implementation: src/storage/erasure_coding_backend.cpp / include/storage/erasure_coding_backend.h
BlobRedundancyManager activates erasure coding automatically when BlobRedundancyConfig::mode == RedundancyMode::PARITY.
Example:
ErasureCodingConfig config;
config.data_shards = 10; // RS(10,4): 40% overhead vs 200% for mirroring
config.parity_shards = 4;
ErasureCodingBackend backend(config);
backend.put("blob-123", data); // Automatically encodes and distributes
// Survives loss of up to 4 blocks
auto result = backend.get("blob-123"); // Reconstructs from available blocksPriority: Medium Target Version: v1.7.0
Zero-downtime schema changes for relational and document models.
Supported Operations:
- Add/drop columns
- Rename columns
- Change column types
- Add/drop indexes
- Partition tables
Migration Framework:
SchemaMigrator migrator(storage);
// Define migration
migrator.addColumn("users", "phone_number", "VARCHAR(20)");
migrator.renameColumn("users", "email", "email_address");
migrator.addIndex("users", "email_address");
// Apply migration online (no downtime)
migrator.migrate(); // Background process, versioned migrationsPriority: Low Target Version: v1.8.0 Status: ✅ Implemented (v1.8.0)
Alternative to LSM-tree for write-heavy workloads.
Implementation: src/storage/wom_tree.cpp / include/storage/wom_tree.h
Advantages:
- Lower write amplification (2-5x vs 10-30x for LSM)
- Better for update-heavy workloads
- Reduced compaction overhead
Trade-offs:
- Higher space amplification
- Slower point reads
Priority: Low Target Version: v1.9.0
Immutable append-only log with cryptographic hashing for audit trails.
Features:
- Merkle tree structure
- Hash chaining (each block references previous)
- Tamper detection
- Replay protection
Use Cases:
- Audit logs
- Financial transactions
- Compliance requirements
Priority: High Target Version: v1.6.0 Status: ✅ Implemented
Use CUDA/ROCm for parallel compression/decompression.
Target Algorithms:
- ✅ Zstd (NVIDIA nvCOMP library) —
GpuCompressionAlgorithm::ZSTD/CompressionMethod::GPU_ZSTD - ✅ Snappy (GPU-accelerated variant) —
GpuCompressionAlgorithm::SNAPPY/CompressionMethod::GPU_SNAPPY - ✅ LZ4 (parallel decompress) —
GpuCompressionAlgorithm::LZ4/CompressionMethod::GPU_LZ4
Implementation: src/storage/gpu_compression.cpp / include/storage/gpu_compression.h
GPU paths are enabled via THEMIS_ENABLE_CUDA (nvCOMP) or THEMIS_ENABLE_HIP (ROCm)
compile-time flags. All algorithms transparently fall back to CPU implementations
(zstd_codec, libsnappy, liblz4) when no GPU is present, ensuring zero-dependency
operation in environments without CUDA/HIP toolchains.
Expected Improvement: 5-10x compression throughput
Priority: High Target Version: v1.6.0
Leverage NVMe-specific features for better performance.
Optimizations:
- io_uring: Linux async I/O framework
- Multi-queue: Parallel I/O submission
- Zone namespaces (ZNS): Direct control over flash management
- Direct I/O: Bypass page cache for predictable latency
Expected Improvement: 30-50% lower latency, 2x throughput
Priority: Medium Target Version: v1.7.0
Machine learning-based compaction scheduling.
Approach:
- Monitor read/write patterns
- Predict compaction impact
- Schedule compactions during low-load periods
- Adjust compaction triggers dynamically
Expected Improvement: 20-30% less compaction CPU overhead
Priority: Medium Target Version: v1.7.0
Eliminate memory copies when transferring blobs between backends.
Techniques:
sendfile()for local filesystem- S3 multipart upload with streaming
- Memory-mapped files for large blobs
Expected Improvement: 40-60% faster blob transfers
Priority: Low Target Version: v1.8.0
Replace standard Bloom filters with more efficient alternatives.
Options:
- Cuckoo filters: 20% less space for same false positive rate
- Blocked Bloom filters: Better cache locality
- Xor filters: Fastest queries, immutable
Expected Improvement: 15-20% faster point lookups
Priority: Medium Target Version: v1.7.0
Split monolithic RocksDBWrapper into specialized classes.
Proposed Structure:
RocksDBWrapper (main interface)
├─ RocksDBReader (read operations)
├─ RocksDBWriter (write operations)
├─ RocksDBTransaction (transaction mgmt)
├─ RocksDBIterator (iteration)
├─ RocksDBConfig (configuration)
└─ RocksDBMonitor (statistics/metrics)
Benefits:
- Smaller, more focused classes
- Easier testing
- Better separation of concerns
Priority: Low Target Version: v1.8.0
Create more granular interfaces for blob backends.
Current:
class IBlobStorageBackend {
// All methods required
};Proposed:
class IBlobReader {
virtual Result<std::vector<uint8_t>> get(const std::string& id) = 0;
};
class IBlobWriter {
virtual Result<BlobRef> put(const std::string& id, const std::vector<uint8_t>& data) = 0;
};
class IBlobDeleter {
virtual Result<void> del(const std::string& id) = 0;
};
// Compose interfaces
class IBlobStorageBackend : public IBlobReader, public IBlobWriter, public IBlobDeleter {};Benefits:
- Read-only backends (S3 Glacier)
- Write-only backends (append-only logs)
- Fine-grained permissions
Priority: Medium Target Version: v1.7.0
Merge BackupManager and PITRManager into single cohesive API.
Proposed:
class RecoveryManager {
public:
// Backup operations
Result<BackupId> createBackup(BackupType type = INCREMENTAL);
Result<void> restoreBackup(BackupId id);
// PITR operations
Result<SnapshotId> createSnapshot();
Result<void> restoreToTimestamp(Timestamp ts);
// Combined operations
Result<void> restoreToBackupOrSnapshot(Timestamp ts); // Auto-select best method
};Priority: Low Target Version: v1.8.0
Allow custom key encoding schemes via plugin API.
Benefits:
- Domain-specific key formats
- Custom sorting orders
- Tenant-specific schemas
Severity: Medium Reported: v1.5.0
Write stalls occur when L0 files accumulate faster than compaction.
Workaround:
config.level0_slowdown_writes_trigger = 20;
config.level0_stop_writes_trigger = 36;
config.max_background_jobs = 8; // Increase compaction threadsFix: Implement adaptive compaction scheduling
Planned Fix: v1.6.0
Severity: Medium Reported: v1.5.0
BlobStorageManager doesn't automatically retry on backend failure.
Workaround: Manually retry or use BlobRedundancyManager
Fix: Add automatic failover with circuit breaker pattern
Planned Fix: v1.6.0
Severity: Low Reported: v1.5.1
Old PITR snapshots accumulate, consuming disk space.
Workaround: Manually call cleanupOldSnapshots()
Fix: Add background cleanup job with retention policy
Planned Fix: v1.6.1
Severity: Low Reported: v1.5.2
Default backoff can lead to long delays for contended keys.
Workaround: Configure custom backoff strategy
Fix: Implement jittered exponential backoff
Planned Fix: v1.6.0
Severity: Resolved Reported: v1.5.0 Fixed: v2.0.0
Columnar storage correctness issues resolved; ColumnarFormat is production-ready for analytical workloads. SIMDColumnFilter provides vectorized predicate evaluation; StorageParquetExporter provides native Parquet v2 export.
Focus: Next-generation memory expansion
Explore CXL for:
- Disaggregated memory pools
- Shared RocksDB block cache across nodes
- Near-memory computation
Research Questions:
- Can we share memtables via CXL?
- What's the latency impact?
- How to handle coherency?
Focus: ML-based indexes for sorted data
Replace traditional B-trees with learned models:
- Predict key position from model
- Reduce memory footprint (models vs index)
- Faster lookups for skewed distributions
Research Questions:
- What ML models work best?
- How to handle updates efficiently?
- Can we learn RocksDB SSTable positions?
Focus: Intel Optane or future PMem technologies
Use PMem for:
- Persistent memtables (no WAL needed)
- Write-ahead log (faster than disk)
- Block cache (survive restarts)
Research Questions:
- How to handle PMem errors?
- What's the durability guarantee?
- Can we bypass filesystem?
Focus: Alternative to LSM-tree for MVCC
Explore MVBT for:
- Lower write amplification
- Better point update performance
- Simpler compaction
Research Questions:
- How to integrate with RocksDB?
- What's the space overhead?
- Can we match LSM scan performance?
Focus: Post-quantum cryptography for field encryption
Prepare for quantum computers:
- NIST PQC algorithms (CRYSTALS-Kyber, CRYSTALS-Dilithium)
- Hybrid classical/quantum schemes
- Key rotation strategies
Research Questions:
- What's the performance impact?
- How to migrate existing data?
- Which algorithms to standardize on?
Breaking Changes: None (additive)
New APIs:
TieredStorageManager tiered(config);
storage->setTieredStorage(tiered);Migration Steps:
- Update to v1.6.0
- Configure tiered storage
- Enable automatic data migration
- Monitor tier distribution
Timeline: 3 months gradual rollout
Breaking Changes: Transaction API extends
Old API:
auto tx = db->beginTransaction();
tx->commit();New API (backward compatible):
auto tx = db->beginTransaction(); // Local transaction
auto dtx = dtx_manager.beginDistributedTransaction(); // Distributed transactionMigration Steps:
- Update to v1.7.0
- Test existing transactions (no changes needed)
- Optionally adopt distributed transactions
Timeline: 6 months parallel support
Breaking Changes: Storage engine selection at creation
Configuration Change:
// Old: Always use RocksDB (LSM-tree)
RocksDBWrapper::Config config;
// New: Choose storage engine
StorageEngineConfig config;
config.engine_type = EngineType::LSM; // or EngineType::WOMMigration Steps:
- Update to v1.8.0
- Default is still LSM-tree (no changes needed)
- Test WOM tree on non-critical workloads
- Migrate write-heavy workloads to WOM
Timeline: 12 months evaluation period
Breaking Changes: Major API redesign
Refactored API:
// Old: RocksDBWrapper directly
RocksDBWrapper db(config);
// New: Generic storage interface
auto storage = StorageFactory::create(config);
// Automatically selects best engine (LSM, WOM, etc.)Migration Steps:
- Update to v2.0.0
- Replace RocksDBWrapper with StorageFactory
- Update configuration format
- Rebuild and test
Automated Migration Tool: scripts/migrate_storage_v2.sh
Timeline: 24 months deprecation period (v1.x maintained in parallel)
We welcome contributions in the following areas:
- Additional compression algorithms (Brotli, LZMA) — Brotli implemented in
CompressionStrategy - Blob backend for Google Cloud Storage —
blob_backend_gcs.cpp(requiresTHEMIS_ENABLE_GCS) - Improved error messages and logging — ongoing
- Performance benchmarks for different workloads — ongoing
- Automatic failover for blob backends
- PITR snapshot cleanup automation
- Jittered exponential backoff for transaction retries — implemented in
TransactionRetryManager - Additional merge operators (sets, counters) —
SetMergeOperator,CounterMergeOperator,AppendMergeOperator,MaxMergeOperatorinmerge_operators.cpp
- Distributed transactions (Raft-based) —
DistributedTransactionManager(v1.7.0) - Tiered storage implementation —
TieredStorageManager(v1.6.0) - Erasure coding for blob storage —
ErasureCodingBackendRS(k,m) (v1.7.0) - GPU-accelerated compression —
GpuCompressionManagerCUDA/ROCm with CPU fallback - NVMe optimizations (io_uring, ZNS) —
NVMeManager(v1.6.0)
Contribution Guide: See CONTRIBUTING.md
Have ideas for storage improvements? We'd love to hear from you:
- 💡 Feature requests: GitHub Issues
- 💬 Design discussions: GitHub Discussions
- 🐛 Bug reports: GitHub Issues
- 📊 Performance results: Share benchmarks in discussions
Last Updated: April 2026 Module Version: v2.0.0 Next Review: v2.1.0 Release
- Unit test coverage ≥ 80% for all storage classes:
MVCCStore,WALStorage,BackupManager,PITRManager, andBatchWriteOptimizer - Integration tests for read-after-write, PITR restore to a specific HLC timestamp, and cross-backend blob round-trips (Filesystem, S3-emulator, Azure-emulator)
- Fault-injection tests: kill process during WAL replay, corrupt SSTable block checksums, verify self-healing recovery without data loss
- Tiered storage integration test: write a key on hot tier, trigger migration to warm, verify read returns original value within 1 s
- Erasure coding round-trip test: encode with RS(4, 2), drop any 2 shards, verify full decode correctness
- Encryption regression test: confirm process fails at startup with
THEMIS_PRODUCTION_MODEwhen no AES-256-GCM key is configured
- Sustained write throughput ≥ 100,000 ops/s on NVMe with batch size of 256 writes at 4 KB average value size
- p99 point-read latency ≤ 1 ms for hot-tier key lookups with bloom filter enabled
- Incremental backup throughput ≥ 500 MB/s on NVMe using parallel SSTable file copy
- Tiered-storage migration background I/O overhead ≤ 5% of sustained foreground write throughput
- Columnar scan throughput ≥ 4× scalar baseline on integer equality predicates with AVX2 SIMD (v2.0.0 target)
- Write amplification factor ≤ 10× under sustained uniform-random-write workload with level-based compaction
THEMIS_PRODUCTION_MODEstartup check must reject any configuration without AES-256-GCM encryption; no silent degradation to plaintext storage- All blob backend credentials (S3 access keys, Azure SAS, GCS ADC) must be loaded from environment variables or a secret store; plaintext credentials in config files are rejected
- HMAC-SHA256 tamper detection is verified on every read via
SecuritySignatureManager; a mismatch returnsStorageError::TAMPEREDand raises an audit log event - WAL replay must be idempotent: re-applying the same WAL sequence number must produce the same storage state without duplicate side effects
DiskSpaceMonitortriggers write rejection at 95% disk capacity to prevent WAL and SSTable corruption from space exhaustion- All backup bundles include a SHA-256 checksum manifest; restore operation aborts if any file checksum does not match the manifest
All planned features in this document are grounded in the following peer-reviewed research and industry specifications (IEEE format):
-
P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil, "The log-structured merge-tree (LSM-tree)," Acta Informatica, vol. 33, no. 4, pp. 351–385, 1996, doi: 10.1007/s002360050048. — Foundational design of RocksDB (
rocksdb_wrapper.cpp); informs compaction strategy, write amplification trade-offs, and theBatchWriteOptimizer. -
S. Dong, M. Callaghan, L. Galanis, D. Borthakur, T. Savor, and M. Strum, "Optimizing space amplification in RocksDB," in Proc. 8th Biennial Conf. Innovative Data Systems Research (CIDR), 2017. [Online]. Available: https://www.cidrdb.org/cidr2017/papers/p82-dong-cidr17.pdf [Accessed: 2026-03-10] — Informs
CompactionManagertuning, level-based compaction, and theBlobDBvalue separation for write amplification reduction. -
D. Ongaro and J. Ousterhout, "In search of an understandable consensus algorithm," in Proc. USENIX Annual Technical Conf. (ATC), 2014, pp. 305–319. [Online]. Available: https://raft.github.io/raft.pdf [Accessed: 2026-03-10] — Informs
RaftMVCCBridgedesign and the planned two-phase-commit (2PC) distributed transactions (DistributedTransactionManager, ROADMAP v1.7.0). -
J. N. Gray and A. Reuter, Transaction Processing: Concepts and Techniques. San Mateo, CA: Morgan Kaufmann, 1992. — Informs 2PC coordinator/participant protocol design for cross-shard atomicity and the
TransactionRetryManagerexponential backoff strategy. -
I. S. Reed and G. Solomon, "Polynomial codes over certain finite fields," J. SIAM, vol. 8, no. 2, pp. 300–304, 1960, doi: 10.1137/0108018. — Theoretical foundation for the planned Reed-Solomon erasure coding in
BlobRedundancyManager(ROADMAP v1.7.0,ErasureCodingConfig). -
A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, and K. Ramchandran, "Network coding for distributed storage systems," IEEE Trans. Inf. Theory, vol. 56, no. 9, pp. 4539–4551, Sep. 2010, doi: 10.1109/TIT.2010.2054295. — Informs minimum storage regenerating (MSR) codes for the erasure coding redundancy mode; motivates RS(4,2) default parameter choice.
-
A. Verbitski et al., "Amazon Aurora: Design considerations for high throughput cloud-native relational databases," in Proc. ACM SIGMOD Int. Conf. Management of Data, 2017, pp. 1041–1052, doi: 10.1145/3035918.3056101. — Informs
TieredStorageManagerdesign (hot/warm/cold with background migration) and the PITR restore architecture. -
T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, "The case for learned index structures," in Proc. ACM SIGMOD Int. Conf. Management of Data, 2018, pp. 489–504, doi: 10.1145/3183713.3196909. — Research basis for the
Learned Index Structuresexploration area; motivates replacing B-trees with ML models for RocksDB SSTable position prediction. -
V. Balmau, D. Didona, R. Guerraoui, W. Zwaenepoel, H. Yuan, A. Arora, K. Rao, and P. Tsuru, "TRIAD: Creating synergies between memory, disk and log in log structured key-value stores," in Proc. USENIX Annual Technical Conf. (ATC), 2017, pp. 363–375. — Informs the
Write-Optimized Merge (WOM) Treeresearch area; motivates alternative compaction strategies for update-heavy workloads. -
A. Kemper and T. Neumann, "HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots," in Proc. IEEE 27th Int. Conf. Data Engineering (ICDE), 2011, pp. 195–206, doi: 10.1109/ICDE.2011.5767867. — Informs MVCC snapshot isolation design in
MVCCStoreand motivates the plannedMulti-Version B-Trees (MVBT)research area. -
P. Boncz, M. Zukowski, and N. Nes, "MonetDB/X100: Hyper-pipelining query execution," in Proc. 2nd Biennial Conf. Innovative Data Systems Research (CIDR), 2005, pp. 225–237. — Informs the planned vectorized execution (AVX2 SIMD) in
ColumnarFormat(ROADMAP v2.0.0,simd_filter.cpp). -
J. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, and J. Schaffner, "SIMD-scan: Ultra fast in-memory table scan using on-chip vector processing units," Proc. VLDB Endow., vol. 2, no. 1, pp. 385–394, 2009, doi: 10.14778/1687627.1687671. — Provides algorithmic foundations for SIMD-accelerated integer equality and range predicates targeting the
ColumnarFormatscan throughput goal (≥4× scalar baseline). -
Apache Software Foundation, "Apache Parquet format specification," Apache Parquet, 2023. [Online]. Available: https://parquet.apache.org/docs/file-format/ [Accessed: 2026-03-10] — Specification basis for the planned native Parquet export from
ColumnarFormat(ROADMAP v2.0.0,parquet_exporter.cpp). -
A. Coviello et al. (NIST), "Module-lattice-based key-encapsulation mechanism standard (FIPS 203)," National Institute of Standards and Technology, 2024, doi: 10.6028/NIST.FIPS.203. — Defines CRYSTALS-Kyber (ML-KEM) standard referenced in the
Quantum-Resistant Encryptionresearch area. -
D. Bernhard, T. Jager, A. Lehmann, and M. Püschel (NIST), "Module-lattice-based digital signature standard (FIPS 204)," National Institute of Standards and Technology, 2024, doi: 10.6028/NIST.FIPS.204. — Defines CRYSTALS-Dilithium (ML-DSA) for post-quantum signatures; relevant to the
Quantum-Resistant Encryptionresearch area insecurity_signature.cpp. -
J. Axboe, "Efficient I/O with io_uring," 2019. [Online]. Available: https://kernel.dk/io_uring.pdf [Accessed: 2026-03-10] — Basis for the
NVMe Optimizations(io_uringkernel bypass) and Zero-Copy Blob Transfers performance area. -
P. Mishra, U. Roesler, J. Luo, and R. Zhao, "CXL: Enabling innovations in memory through an open industry-standard interconnect," IEEE Micro, vol. 41, no. 3, pp. 8–17, May–Jun. 2021, doi: 10.1109/MM.2021.3059102. — Reference for the
CXL (Compute Express Link) Integrationresearch area; motivates disaggregated shared block-cache across nodes.
src/storage/README.md— Module overview and API entry pointssrc/storage/ARCHITECTURE.md— Component diagram and data flowsrc/storage/ROADMAP.md— Implementation phases and planned milestonesdocs/de/storage/README.md— German-language secondary documentationdocs/de/storage/missing-implementations.md— Reality-check findings
GAP-015 – identified via static analysis (2026-04-21). Reference:
docs/governance/SOURCECODE_COMPLIANCE_GOVERNANCE.md.
Scope: src/storage/backup_manager.cpp:940,979
- libarchive (BSD licence) is already indirectly available in the dependency tree; it must be added as an explicit CMake target dependency
- Backup/restore semantics (gzip-compressed tarball) must be preserved
- The "directory" parameter from
POST /admin/backupmust be sandboxed to a configurable backup root before being passed toBackupManager
// New helper replacing system(tar):
Result<std::string> archiveDirectory(const std::filesystem::path& src_dir,
const std::filesystem::path& dest_archive);
Result<std::string> extractArchive(const std::filesystem::path& src_archive,
const std::filesystem::path& dest_dir);- Uses
archive_write_open_filename+archive_write_add_filter_gzip+archive_write_set_format_pax_restrictedfrom<archive.h>
- Sandbox check: before calling
BackupManager,admin_api_handler.cppmust validate thatbody["directory"]is withinconfig_.backup_root_dir:if (!std::filesystem::weakly_canonical(dir).string().starts_with(backup_root)) { return makeErrorResponse(400, "backup path out of sandbox"); }
system()calls are unsafe even with double-quoted arguments because a newline inbackup_dirterminates the command and starts a new shell command
- Unit test: create a temp directory, archive it via
archiveDirectory, extract it viaextractArchive, compare file trees - Unit test: path
"../../etc"as backup dir → 400 from handler, no archive created - Fuzz test: random path strings as
backup_dir→ no shell command executed
- Throughput: ≥ gzip(1) baseline (libarchive uses the same zlib backend)
- No measurable regression in CI backup integration tests
- No shell process spawned; OS signal delivery to child process cannot escape the parent
- On libarchive error: return structured error, no partial archive left on disk