Samples or fully scans records from an Aerospike namespace (where object types are encoded as key prefixes, not separate sets) and reports the true storage cost per type, including per-record overhead and primary-index memory.
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtThis installs dependencies for the local analysis scripts (analyze_storage_report.py,
generate_dashboard.py, dashboard.py). The main scan script
(storage_leak_debug.py) runs on the pod using the pod's existing Python
environment and has no extra dependencies.
kubectl cp storage_leak_debug.py <pod>:/data/storage-leak-debug/storage_leak_debug.pyThe script must be run through IPython on the pod (for access to starkware
packages):
kubectl exec -it <pod> -- bash
$(find . -name "*ipython_py_binary") -c "%run /data/storage-leak-debug/storage_leak_debug.py [flags]"Examples:
# Quick overview (10K samples)
$(find . -name "*ipython_py_binary") -c "%run /data/storage-leak-debug/storage_leak_debug.py"
# Larger sample for more accuracy
$(find . -name "*ipython_py_binary") -c "%run /data/storage-leak-debug/storage_leak_debug.py -n 50000 -o /data/storage-leak-debug/"
# Full scan of all records (hours on large DBs)
$(find . -name "*ipython_py_binary") -c "%run /data/storage-leak-debug/storage_leak_debug.py --full-scan -o /data/storage-leak-debug/"
# Full scan + dump specific type's keys for deletion/archival
$(find . -name "*ipython_py_binary") -c "%run /data/storage-leak-debug/storage_leak_debug.py --full-scan -o /data/storage-leak-debug/ --write-records /data/storage-leak-debug/tx_exec_v2.csv --write-records-filter tx_execution_info_version2"
# Analyze the merkle-facts set
$(find . -name "*ipython_py_binary") -c "%run /data/storage-leak-debug/storage_leak_debug.py --merkle-facts"
# Just the console summary, no CSVs
$(find . -name "*ipython_py_binary") -c "%run /data/storage-leak-debug/storage_leak_debug.py --no-csv"Running in background (survives disconnects and auth expirations):
Full scans take hours. Rather than keeping an interactive session open, launch
the script in the background with nohup so it continues even if your
kubectl exec session disconnects:
kubectl exec <pod> -- bash -c 'nohup $(find . -name "*ipython_py_binary") -c "%run /data/storage-leak-debug/storage_leak_debug.py --full-scan -o /data/storage-leak-debug/" > /data/storage-leak-debug/output.log 2>&1 &'This returns immediately. Check progress and completion with:
# Tail the log
kubectl exec <pod> -- tail -1 /data/storage-leak-debug/output.log
# Check if DONE marker exists (written on successful completion)
kubectl exec <pod> -- ls /data/storage-leak-debug/DONEIf the pod restarts or the script is killed, re-run with --resume to
continue from the last checkpoint:
kubectl exec <pod> -- bash -c 'nohup $(find . -name "*ipython_py_binary") -c "%run /data/storage-leak-debug/storage_leak_debug.py --full-scan --resume -o /data/storage-leak-debug/" > /data/storage-leak-debug/output.log 2>&1 &'Options:
| Flag | Default | Description |
|---|---|---|
-n, --samples |
10000 | Number of records to sample |
-o, --output-dir |
/tmp/ |
Directory for CSV output |
--full-scan |
off | Scan all records instead of sampling. Ignores -n |
--resume |
off | Resume a previously interrupted scan from the last checkpoint |
--record-overhead |
64 | Per-record on-disk overhead in bytes (digest + metadata + bin headers) |
--merkle-facts |
off | Analyze the merkle-facts set |
--no-csv |
off | Print summary only, skip CSV generation |
--write-records PATH |
off | Write each record's key and device size to a CSV |
--write-records-filter TYPE |
off | Only write records matching this object type |
Output files (written to --output-dir):
type_summary.csv-- per-type breakdown: count, device size, value size, index costDONE-- marker file with the full summary, written on successful completioncheckpoint.json-- resume state, written after each partition (deleted on completion)
When --write-records is used:
<path>.csv-- one row per record:key,size(device size including overhead)
kubectl cp <pod>:/data/storage-leak-debug/type_summary.csv .
kubectl cp <pod>:/data/storage-leak-debug/DONE . # verify completionkubectl cp does not compress and can fail on large files (it uses tar
internally, which may hit memory/timeout limits). For large files like
--write-records output:
# 1. Compress on the pod (use -1 for fast compression, ~3-4x faster than default)
kubectl exec <pod> -- gzip -1 /data/storage-leak-debug/records.csv
# 2. If the .gz is under ~5GB, download directly:
kubectl exec <pod> -c batcher -- cat /data/storage-leak-debug/records.csv.gz > records.csv.gz
# 3. If the .gz is larger (or download truncates), split on the pod first:
kubectl exec <pod> -- bash -c "split -b 2G /data/storage-leak-debug/records.csv.gz /data/storage-leak-debug/records.csv.gz.part-"
# Download each chunk (use cat, not kubectl cp, for reliability):
kubectl exec <pod> -- ls /data/storage-leak-debug/records.csv.gz.part-*
for part in aa ab ac ad ae af ag ah ai aj; do
kubectl exec <pod> -c batcher -- cat /data/storage-leak-debug/records.csv.gz.part-$part > records.csv.gz.part-$part || break
done
# Reassemble and verify:
cat records.csv.gz.part-* > records.csv.gz
gzip -t records.csv.gz # integrity check
gunzip records.csv.gz
# Clean up parts on the pod:
kubectl exec <pod> -- rm /data/storage-leak-debug/records.csv.gz.part-*analyze_storage_report.py reads type_summary.csv and produces a PNG with:
- KPI strip (total device size, value, overhead, index memory, record count)
- Cost breakdown by type (stacked bars: value / overhead / index)
- Object count by type
pip install matplotlib
python3 analyze_storage_report.py -i . -o storage_report.png
# Both main + merkle-facts sets
python3 analyze_storage_report.py -i . --bothlen(bin_data["value"]) (value size) is what your application wrote.
The actual storage cost per record is higher:
device_size = value_size + record_overhead
The default overhead of 64 bytes covers the Aerospike digest (20B), record
metadata (~26B), and single-bin header (~12B). Adjust with --record-overhead
if your setup differs (check asinfo -v "namespace/<ns>" for actual usage vs
object count).
On top of device storage, each record costs ~64 bytes in the primary index
(kept in memory). For types with many small records (e.g. tx_request_data at
~16 bytes/value), the index memory cost dominates the value size.
All scans use partition-filter-based iteration (each of the 4096 partitions scanned independently), which ensures each record is visited exactly once and avoids replica double-counting. When sampling, each partition returns an equal share of records for uniform coverage across the cluster.
Scans every record in the namespace. The script queries the cluster for the total object count at startup (summing across all nodes, dividing by the replication factor) and shows progress with a percentage. Memory usage is constant — records are aggregated on the fly, not held in a list.
A DONE marker file containing the full summary is written to the output
directory on successful completion, so you can verify the script finished even
if the pod restarts.
Full scans of large databases can take hours. If the pod crashes or the script
is interrupted, use --resume to continue from the last checkpoint:
$(find . -name "*ipython_py_binary") -c "%run /data/storage-leak-debug/storage_leak_debug.py --full-scan --resume -o /data/storage-leak-debug/"The script saves a checkpoint.json file after each partition completes
(atomic write via tmp + rename). On resume, it loads the checkpoint, restores
all aggregation counters, and continues from the next partition. The records
file (if --write-records is used) is opened in append mode.
Safety: --resume without a checkpoint exits immediately rather than
starting a fresh scan. This prevents accidentally overwriting an existing
records file after a completed scan.
Copy run_scan.sh to the PVC and add a lifecycle hook to the container spec:
lifecycle:
postStart:
exec:
command: ["/bin/bash", "/data/storage-leak-debug/run_scan.sh"]The script checks for a DONE marker and exits immediately if the scan already
completed. Otherwise it launches the scan in the background with --resume.
Progress is logged to output.log (line-based, no terminal escape codes) so
you can monitor with kubectl exec <pod> -- tail /data/storage-leak-debug/output.log.
--write-records dumps each record's full Aerospike key and device size to a
CSV. Use --write-records-filter to limit to a specific object type. The key
is written as the ASCII string (or hex:-prefixed hex for binary keys) and can
be reconstructed for deletion via bytearray(key.encode("ascii")).