Hinweis: Vage Einträge ohne messbares Ziel, Interface-Spezifikation oder Teststrategie mit
<!-- TODO: add measurable target, interface spec, test strategy -->markieren.
This document covers planned enhancements to ThemisDB's shared utilities subsystem, which provides foundational cross-cutting infrastructure consumed by all other modules: audit logging (audit_logger.cpp), cursor/pagination (cursor.cpp), HKDF key derivation (hkdf_helper.cpp, hkdf_cache.cpp), LEK (Local Encryption Key) management (lek_manager.cpp), structured logging (logger.cpp), data normalisation (normalizer.cpp), PII detection and pseudonymisation (pii_detection_engine.cpp, pii_detector.cpp, pii_pseudonymizer.cpp), PKI client (pki_client.cpp), retention manager (retention_manager.cpp), SAGA logging (saga_logger.cpp), serialisation (serialization.cpp), text processing (stemmer.cpp, stopwords.cpp), distributed tracing (tracing.cpp), ZSTD codec (zstd_codec.cpp), and geospatial utilities (geo/). The module is Beta-stage and is the primary dependency of all other ThemisDB modules.
- All utilities must be header-safe for inclusion in multiple translation units; shared state must be explicitly managed through thread-safe singleton or injected instances.
- The PII detection and pseudonymisation pipeline must operate in a streaming fashion to handle arbitrarily large documents without loading them fully into memory.
- Key derivation via
hkdf_helper.cppmust use HKDF-SHA-256 per RFC 5869; raw symmetric keys must never appear in log output, audit records, or error messages. - Utilities must have zero mandatory external network dependencies at runtime;
pki_client.cppandtracing.cppdegrade gracefully when their upstream services are unreachable.
| Interface | Consumer | Notes |
|---|---|---|
AuditLogger::log(event) |
All modules requiring compliance audit trail | Structured JSON; append-only; tamper-evident |
PIIDetectionEngine::scan(text) |
prompt_engineering, training, ingestion |
Returns span list with entity type and confidence |
PIIPseudonymizer::pseudonymize(text, policy) |
Same as PII detection consumers | Deterministic per tenant; reversible under admin key |
HKDFHelper::derive_key(ikm, info, length) |
timeseries, sharding, security |
Must not cache raw IKM beyond a single derive call |
LEKManager::get_current_dek(series_id) |
timeseries, storage module |
Returns AES-256 DEK; rotates on schedule |
ZSTDCodec::compress() / decompress() |
sharding, timeseries, exporters |
Streaming API; supports preset levels 1–22 |
SAGALogger::log_step(txn_id, step, state) |
sharding, transaction module, training |
Write-ahead log for SAGA compensation recovery |
Tracing::start_span(name, parent) |
All modules | OpenTelemetry-compatible; no-op when collector unreachable |
Priority: Medium Target Version: v1.8.0
capability_auto_generator.cpp has 3 TODOs:
- Line 193: "Check last update time and compare with schedule interval" — schedule checking is a no-op; updates run every invocation regardless of interval.
- Line 373: "Store previous document count somewhere" — document count delta cannot be computed without persistence.
- Line 435: "Implement YAML serialization and file writing" — generated capabilities are not persisted to disk.
Implementation Notes:
[x]Use a small RocksDB key (utils_capgen_state) to persistlast_run_timestampandlast_document_count; load on construction.[x]At line 193: comparenow - last_run_timestampagainstconfig_.schedule_interval_s; skip regeneration if within interval.[x]At line 373: persist the currentdocument_countto the state key after each successful run.[x]At line 435: serialize the generatedCapabilitySetto YAML usingyaml-cppand atomically write viaConfigPathResolver::resolveWritable().
Priority: High Target Version: v1.8.0
pki_client.cpp has 2 stub fallback paths:
- Line 456: "Fallback: stub behavior (base64 of hash)" — certificate issuance falls back to a non-standard base64 encoding instead of real PKCS#10 / X.509 certificate.
- Line 575: "Fallback stub verification: compare base64(hash) equality" — TLS certificate verification falls back to comparing base64 hashes instead of validating the certificate chain.
Implementation Notes:
[x]Line 456: implement real PKCS#10 CSR generation and submission using OpenSSLX509_REQ_*API; only fall back when ACME/internal CA is not configured.[x]Line 575: implement real X.509 chain verification usingX509_verify_cert()with the configured trust store; never fall back to hash comparison for production traffic.[x]Add explicit#ifdef THEMIS_TEST_MODEguard around the stub paths so they cannot be used in production builds.
Priority: High Target Version: v0.9.0
Refactor pii_detection_engine.cpp and pii_pseudonymizer.cpp to operate on a chunked streaming interface so that large legal documents (>100 MB) can be processed without full in-memory buffering. Entity spans that straddle chunk boundaries must be detected and merged correctly.
Implementation Notes:
[x]Define aPIIStreamScannerclass inpii_detection_engine.cppwithscan_chunk(chunk, is_last)→PIISpanList; internally maintains a lookahead buffer sized to the maximum entity length (configurable, default 256 bytes) to handle cross-boundary spans.[x]pii_pseudonymizer.cppadds a companionPIIStreamPseudonymizer::process_chunk()that applies replacements using the span offsets fromPIIStreamScanner; replacements are deterministic per(entity_text, tenant_id)using HMAC-SHA-256 keyed with the tenant pseudonymisation key fromlek_manager.cpp.[x]Theregex_detection_engine.cppmust also support chunk-boundary-aware matching; cross-chunk regex detection uses a sliding window overlap equal tomax_pattern_length(implemented viaIPIIDetectionEngine::maxPatternLength()andRegexDetectionEngine::maxPatternLength(), used byPIIStreamScannerconstructor to auto-derivelookahead_bytes).[x]Add a throughput benchmark inbenchmarks/measuring end-to-end scan+pseudonymise throughput on 100 MB synthetic legal text (bench_pii_stream_scanner.cpp).
Performance Targets:
- Streaming PII scan throughput: >100 MB/s per core for English legal text.
- Memory footprint during streaming scan of 1 GB document: <10 MB.
Priority: High Target Version: v0.9.0
Extend audit_logger.cpp to link audit entries into a cryptographic hash chain (each entry includes the SHA-256 hash of the previous entry). This makes offline tampering detectable without requiring a trusted external log service.
Implementation Notes:
- Add a
HashChainAuditWriterclass inaudit_logger.cppthat maintains the running chain head in a memory-mapped file; on write, computesentry_hash = SHA256(prev_hash || entry_json)and appends both fields to the log record. - Provide a
AuditLogVerifier::verify_chain(log_path)tool (standalone binary) that replays the hash chain and reports the first tampered or missing entry. - Chain head must be persisted to a separate
audit_chain_head.binfile (fsync'd after each write) so verification can resume from any point without replaying the full log. - Integrate
utils/hkdf_helper.cppto derive the initial chain seed from the cluster's root key so chain heads are cluster-specific and cannot be forged with a fresh chain.
Performance Targets:
- Audit log write throughput: >10k events/s per node (SHA-256 computed inline, no external call).
- Hash chain verification throughput: >50k entries/s.
Priority: Medium Target Version: v0.9.0
Harden hkdf_cache.cpp to enforce per-entry TTL eviction and cap the maximum number of cached derived keys to prevent unbounded memory growth. Cached entries must be zeroed from memory on eviction to prevent derived key material from persisting in heap memory.
Implementation Notes:
- Refactor
hkdf_cache.cppto use a bounded LRU structure (max entries configurable, default 1000) with per-entry TTL (default 300 s); on TTL expiry or LRU eviction, callOPENSSL_cleanse()on the key buffer before deallocation. - Add
HKDFCache::purge_by_ikm_hash()to allow immediate invalidation of all entries derived from a given root key when that key is rotated inlek_manager.cpp. - Cache hit/miss and eviction counts must be exported as Prometheus-compatible counters via
utils/tracing.cppspan attributes. - Ensure
hkdf_cache.cppis thread-safe under concurrent key derivation requests from thetimeseriesandshardingmodules using a sharded mutex to reduce contention.
Performance Targets:
- Cache hit latency: <1 µs (pointer lookup, no crypto).
- HKDF-SHA-256 derive on cache miss: <100 µs (OpenSSL EVP HKDF).
- Memory overhead of 1000-entry cache: <1 MB.
Priority: Medium Target Version: v0.10.0
Extend logger.cpp with per-call-site log sampling and rate-limiting to prevent high-frequency code paths from flooding the log pipeline under load. Sampling rate is configurable per log level and module, hot paths can declare their expected burst rate.
Implementation Notes:
- Add a
SampledLoggerdecorator inlogger.cppthat wraps the underlyingILoggerinterface; sampling is implemented with a token-bucket rate limiter (one bucket per unique(file, line, level)key) to avoid suppressing all instances of a message. - Configuration is loaded from the
configmodule at startup and hot-reloadable via SIGHUP handler; default sampling rates: DEBUG 1%, INFO 10%, WARN 100%, ERROR 100%. - Suppressed log counts are emitted as a
logs_suppressed_totalcounter via the observability subsystem so operators can detect unexpected sampling. audit_logger.cppevents must bypass the rate limiter entirely; compliance events must never be sampled.
Performance Targets:
- Logger hot path overhead with sampling enabled: <200 ns per suppressed log call.
- Rate-limiter memory overhead: <10 KB per 100 unique call sites.
Priority: Medium Target Version: v0.9.0
Add a compaction job and a public replay API to saga_logger.cpp so that completed SAGA transactions are compacted out of the active WAL and compensating transactions can be replayed programmatically during disaster recovery.
Implementation Notes:
- Implement
SAGALogCompactor::compact(before_txn_id)insaga_logger.cppthat rewrites the WAL, retaining only steps for in-flight or failed transactions; completed transactions are archived to thetimeseriesmodule for audit retention. - Add
SAGALogReplayer::replay_incomplete(recovery_handler)that scans the WAL for transactions inCOMPENSATINGstate and calls the provided handler for each unconfirmed compensation step; used bysharding/cross_shard_transaction.cppduring node recovery. - Compaction must be atomic (write new WAL, fsync, rename); partial compaction must not leave the WAL in a corrupted state.
- Compaction progress must be emitted as a structured log entry to
audit_logger.cppfor the compliance audit trail.
Performance Targets:
- Compaction throughput: >50k SAGA step records/s.
- Replay scan of 1M-step WAL: <10 s.
- WAL write throughput: >20k step records/s under concurrent SAGA transactions from sharding module.
| Test Type | Coverage Target | Notes |
|---|---|---|
| Unit | >85% new code | Cover PIIStreamScanner chunk-boundary cases, HashChainAuditWriter, HKDFCache eviction, SampledLogger rate-limiting |
| Integration | Cross-module key derivation flows | Verify lek_manager.cpp → hkdf_helper.cpp → timeseries encryption round-trip |
| Security | PII detection recall ≥95% | Test against legal-domain PII fixture dataset; verify no PII leaks in audit log |
| Reliability | Audit chain tamper detection | Inject bit-flip in log file; verify AuditLogVerifier detects it |
| Performance | P99 < budgets above | Streaming PII scan, audit write throughput, HKDF cache hit latency |
| Metric | Current | Target | Method |
|---|---|---|---|
| PII scan throughput | ~20 MB/s | >100 MB/s | Streaming scanner microbenchmark on legal corpus |
| Audit log write throughput | ~3k events/s | >10k events/s | Hash-chain writer benchmark (SHA-256 inline) |
| HKDF derive (cache miss) | ~500 µs | <100 µs | OpenSSL EVP HKDF microbenchmark |
| HKDF cache hit latency | N/A (no cache) | <1 µs | LRU pointer-lookup microbenchmark |
| Logger hot path (suppressed, sampling) | ~800 ns | <200 ns | Flame-graph profiling of sampled logger |
| SAGA WAL write throughput | ~8k steps/s | >20k steps/s | Concurrent SAGA stress test from sharding module |
| SAGA compaction throughput | N/A | >50k records/s | Compaction microbenchmark on 10M-step WAL fixture |
- HKDF-derived keys must be zeroed from memory (
OPENSSL_cleanse) immediately after use and must not appear in any log output, metric label, or error message; validate this in code review and viaaudit_logger.cppscrub checks. - The
audit_logger.cpphash chain must be verified on node startup; if the chain is broken, the node must refuse to start and alert operators rather than silently accepting a potentially tampered log. -
pii_pseudonymizer.cpppseudonymisation keys must be tenant-scoped and derived viahkdf_helper.cppfrom the tenant root key; cross-tenant pseudonymisation with a shared key is architecturally prohibited. -
pki_client.cppcertificate validation must reject expired, revoked (via OCSP stapling), and self-signed certificates unless explicitly whitelisted in the node trust store; noverify=falseescape hatch in production builds. - [!] Review whether
regex_detection_engine.cppReDoS exposure exists on attacker-controlled PII patterns; fuzz the engine withfuzz/harness before GA. -
lek_manager.cppmust enforce a maximum DEK age policy (default 30 days) and automatically trigger key rotation; rotation events must be logged toaudit_logger.cppwith old and new key IDs (not key material).
GAP-005 – identified via static analysis (2026-04-21). Reference:
docs/governance/SOURCECODE_COMPLIANCE_GOVERNANCE.md.
Scope: src/utils/checksum_utils.cpp:58 (calculateMD5)
- All existing callers of
calculateMD5()must be migrated; the function must be renamed or deprecated with a clear error to prevent new callers - Binary format of stored checksums in metadata tables will change from 32-hex-char (MD5) to 64-hex-char (SHA-256); migration guide required
// New function (replaces calculateMD5):
std::string calculateSHA256(const std::string& file_path);
// Deprecated shim (compile-time warning):
[[deprecated("Use calculateSHA256; MD5 is cryptographically broken")]]
std::string calculateMD5(const std::string& file_path);- Use
EVP_MD_CTX+EVP_sha256()from OpenSSL (already a dependency):EVP_MD_CTX* ctx = EVP_MD_CTX_new(); EVP_DigestInit_ex(ctx, EVP_sha256(), nullptr); // read file in chunks, EVP_DigestUpdate per chunk EVP_DigestFinal_ex(ctx, digest, &len); EVP_MD_CTX_free(ctx);
- The existing
MD5_CTX/MD5_Init/MD5_Update/MD5_FinalOpenSSL APIs are deprecated in OpenSSL 3.0 and will be removed in a future release
- Unit test: known file → SHA-256 digest matches
sha256sumreference value - Unit test:
calculateMD5call → compiler deprecation warning (test via-Werror=deprecated) - Migration test: update fixture checksums in all tests that use
calculateMD5
- SHA-256 throughput (OpenSSL, AES-NI-class CPU): ≥ 500 MB/s for large files
- Overhead vs MD5: ≤ 2× (acceptable for a one-time file integrity check)
- MD5 collision attacks are feasible with commodity hardware (< 1 hour); SHA-256 has no known collision attacks
- OpenSSL's EVP interface is FIPS 140-3 compliant when using the FIPS provider
Stub: src/utils/input_validator.cpp — validateJsonStub(): returns nullopt (accept-all) when schema file absent; no WARN logged
Risk: Arbitrary JSON payloads pass validation silently; missing schema file is a silent security gap in production.
- Deploy JSON schema files to
THEMIS_SCHEMA_DIR(env var or config YAML; default/etc/themis/schemas/). - Log WARN in
validateJsonStub()when schema file is absent (currently silent). - Add
aql_request.json,query_request.json, andchangefeed_request.jsonschema files.
- Schema files must be read-only on disk; validate file permissions on startup.
- If schema file is malformed JSON, log ERROR and treat as "no schema" (fail-open) rather than crashing.
- With schema file: oversized query field →
validateAqlRequest()returns error. - Without schema file: arbitrary payload passes
validateJsonStub()but subsequent hard-coded checks still apply. - Warn in log: absent schema file →
WARN [InputValidator] schema file 'aql_request.json' not found in schema_dir.