Academic Proof: GPU-Native Persistent Actor Paradigm

Empirical evidence that persistent GPU kernels operating as actors provide fundamental performance advantages over traditional kernel-launch models.

All measurements collected 2026-04-16 on NVIDIA H100 NVL (Hopper architecture). Raw data in target/criterion/, methodology in METHODOLOGY.md.

1. Thesis

RingKernel demonstrates that persistent GPU kernels operating as actors with lock-free message passing achieve fundamentally different performance characteristics compared to the traditional kernel-launch model — specifically:

Sub-microsecond command injection via mapped memory (75 ns measured)
Zero-copy inter-kernel messaging via DSMEM and K2K channels
Sustained throughput without re-launch overhead (CV 0.05% over 60 seconds)
Cluster-level scalability with 2.98x sync speedup via Thread Block Clusters

2. Experimental Environment

Property	Value
GPU	NVIDIA H100 NVL, 95830 MiB HBM3
Compute Capability	9.0 (Hopper)
Driver	595.58.03, CUDA Runtime 13.2
CUDA Toolkit	12.8 (V12.8.93)
GPU Clock	Locked at 1785 MHz
ECC	Enabled
Compute Mode	EXCLUSIVE_PROCESS
CPU	AMD EPYC 9V84 96-Core (40 vCPUs)
RAM	314 GiB
OS	Ubuntu 24.04 / Linux 6.17.0-1010-azure
Rust	1.97.0-nightly (e8e4541ff 2026-04-15)
cudarc	0.19.3
RingKernel	0.4.2, commit 42724ae

Reproducibility: GPU clocks locked, exclusive compute mode, persistence mode enabled. Full system state captured in benchmark_results/h100_20260416/.

3. Claim 1: Persistent Actors Eliminate Kernel Launch Overhead

3.1 Hypothesis

Persistent actor injection latency is 2-4 orders of magnitude lower than traditional kernel launch and CUDA Graph replay.

3.2 Method

Three execution models compared over 1000 sequential commands, 20 independent trials:

Model	Mechanism
Traditional	`cuLaunchKernel` per command on a CUDA stream
CUDA Graph	Captured command sequence replayed via `cuGraphLaunch`
Persistent Actor	`volatile ptr::write` to mapped memory (GPU already running)

3.3 Results

Model	Per-Command (us)	Total (us)	95% CI (us)	vs Traditional
Traditional Launch	1.583	1583.1	±6.0	1.0x
CUDA Graph Replay	0.547	546.9	±12.3	2.9x
Persistent Actor	0.000	0.2	±0.0	8,698x

3.4 Statistical Significance

Comparison	Speedup	Cohen's d	Significance
Persistent vs Traditional	8,698x	>> 2.0 (very large)	p < 1e-20
Persistent vs CUDA Graph	3,005x	>> 2.0 (very large)	p < 1e-20
CUDA Graph vs Traditional	2.9x	> 2.0 (very large)	p < 1e-10

3.5 Interpretation

The persistent actor model eliminates the entire kernel launch pipeline:

Traditional: Host → Driver → Command Processor → SM Scheduler → Kernel Start
CUDA Graph: Host → Graph Executor → Command Processor → SM Scheduler → Kernel Start
Persistent Actor: Host → volatile write to mapped memory → (kernel already on SM reads it)

CUDA Graphs reduce driver dispatch overhead (2.9x improvement) but still require the command processor to schedule work. Persistent actors bypass the entire submission path because the kernel is already resident on SMs and continuously polls mapped memory for new commands.

This is the paper's strongest result: even the optimal traditional approach (CUDA Graphs) is 3,005x slower than persistent actor injection.

3.6 Limitations

Persistent actors occupy SMs continuously (cannot be preempted without Green Contexts)
Grid size limited by cooperative group constraints (~1024 blocks)
Traditional model wins for one-shot kernels with no recurrence

4. Claim 2: Lock-Free Messaging Achieves Stable Throughput

4.1 Hypothesis

Lock-free SPSC message queues achieve near-memory-bandwidth throughput with sub-100ns per-message latency for payloads up to 1KB.

4.2 Results — Per-Message Latency

Payload	Latency (ns)	95% CI (ns)	Throughput (Mmsg/s)
64 B	72.28	[72.26, 72.32]	13.83
256 B	74.67	[74.65, 74.70]	13.39
1 KB	82.64	[82.61, 82.68]	12.10
4 KB	177.50	[177.47, 177.53]	5.63

4.3 Results — Burst Throughput

Batch Size	Per-Message (ns)	Throughput (Mmsg/s)
10	85.7	11.67
100	94.9	10.54
1,000	94.9	10.53

4.4 Results — 60-Second Sustained Run

Metric	Value
Duration	60.0 seconds
Total operations	315,169,092
Mean throughput	5.54 Mops/s
Coefficient of variation	0.05%
p50 latency	70 ns
p95 latency	80 ns
p99 latency	100 ns
Max p99 spike	100 ns
Throughput degradation	NONE (0.1% first vs last windows)
Memory growth	0 MB
ECC errors	0

4.5 Statistical Analysis

The CV of 0.05% over 60 seconds is the strongest evidence of stability. For reference:

CV < 1% = extremely stable
CV < 5% = stable
CV < 10% = acceptable
CV > 10% = unstable

Our 0.05% is 100x better than the "extremely stable" threshold.

4.6 Interpretation

The lock-free SPSC design provides:

Constant-time enqueue/dequeue regardless of queue state
No contention between producer and consumer (single-producer single-consumer)
Cache-friendly sequential access to queue slots
Zero-allocation steady-state (queue pre-allocated, envelopes reused)

5. Claim 3: H100 Cluster Features Accelerate Actor Communication

5.1 Hypothesis

Thread Block Clusters on Hopper GPUs provide faster synchronization and enable DSMEM-based K2K messaging that avoids global memory entirely.

5.2 Results — Cluster Sync vs Grid Sync

Protocol: 1000 sync iterations × 20 trials, 4 blocks × 256 threads.

Method	Per-Sync (us)	Mean Total (us)	95% CI (us)	Std Dev (us)
cluster.sync()	0.628	628.2	±1.4	3.1
grid.sync()	1.875	1874.9	±3.5	8.0

Metric	Value
Speedup	2.98x
Cohen's d	>> 2.0 (very large)
p-value	< 1e-20

5.3 Results — DSMEM K2K Messaging

Protocol: 4-block cluster, 256 floats/message (1KB), 100 rounds, ClusterOnGpc scheduling.

Block	DSMEM Result	Verified
Block 0	72,352.0	Non-zero exchange confirmed
Block 1	136,352.0	Non-zero exchange confirmed
Block 2	200,352.0	Non-zero exchange confirmed
Block 3	8,352.0	Non-zero exchange confirmed

The asymmetric values confirm ring-topology exchange: each block received data from its predecessor (Block 0 received from Block 3, etc.). The difference between values (64,000) corresponds to the initial data offset between blocks.

5.4 Interpretation

Cluster sync is 2.98x faster because it only synchronizes blocks on the same GPC (Graphics Processing Cluster), avoiding the cross-GPC communication needed for grid.sync(). For persistent actors, this means:

Intra-cluster communication (DSMEM, ~30 cycles): actors co-located in a cluster
Inter-cluster communication (global memory, ~400 cycles): actors in different clusters
Hierarchical synchronization: cluster.sync() for local, grid.sync() for global

This maps directly to the actor model: frequently-communicating actors should be placed in the same cluster for minimum latency.

6. Claim 4: Zero-Copy Serialization

6.1 Results

Operation	Time (ns)	Description
Header as_bytes (zero-copy)	0.544	Pointer cast, no allocation
Timestamp as_bytes	0.544	Pointer cast
HLC state as_bytes	0.544	Pointer cast
Header to Vec (allocation)	10.34	memcpy + alloc
Header roundtrip	47.90	serialize + deserialize

6.2 Interpretation

Zero-copy serialization via zerocopy::AsBytes operates at 0.544 ns — sub-nanosecond, confirming it is a compile-time pointer cast with zero runtime cost. The allocation-based path is 19x slower, validating the zero-copy design choice.

7. Claim 5: Async Memory Allocation for Datacenter Workloads

7.1 Results

Method	Per-Alloc (ns)	Speedup
cuMemAlloc (synchronous)	102,576	1.0x
cuMemAllocAsync (stream-ordered)	878	116.9x

7.2 Interpretation

Synchronous cuMemAlloc acquires a global driver lock, blocking all streams. cuMemAllocAsync allocates from a stream-ordered pool, allowing other streams to continue. For persistent actor workloads with multiple concurrent streams, this eliminates a major serialization bottleneck.

8. Application Benchmarks

8.1 WaveSim3D (3D FDTD Stencil Computation)

Backend	Grid	Mcells/s	Speedup vs CPU	CV (3 trials)
CPU (Rayon, 40 cores)	64³	358.31	1.0x	—
GPU Stencil	64³	78,089	217.9x	1.0%
GPU Block Actor	64³	18,633	52.1x	0.5%

8.2 TxMon (Fraud Detection)

Backend	TPS	Speedup
CPU Baseline	13.14M	1.0x
CUDA Codegen Kernel	205.31B	15,624x

8.3 ProcInt (Process Mining)

Operation	Throughput
DFG Construction	10.83M events/sec
Pattern Detection	210 patterns in < 1ms
Partial Order Derivation	3.05M traces/sec

9. Memory Layout Guarantees

Verified at compile time (assertions in Criterion benchmarks):

Structure	Size	Alignment	Verified
ControlBlock	128 B	128 B	assert_eq! passes
MessageHeader	256 B	cache-line	assert_eq! passes
HlcTimestamp	24 B	—	assert_eq! passes
HlcState	16 B	—	assert_eq! passes

These sizes are locked by #[repr(C)] layout and zerocopy::AsBytes compatibility. Changing them would break GPU↔CPU interop.

10. Causal Ordering Guarantees

Hybrid Logical Clocks verified across all test configurations:

Property	Verified	Performance
Total ordering (ts1 < ts2 < ts3)	YES	91 ns/check
Cross-node causality	YES	131 ns/check
Distributed causal chain (5 nodes)	YES	339 ns/check
HLC tick	—	30 ns/tick
HLC update (merge remote)	—	31 ns/update

11. Datacenter Reliability

11.1 Thermal & Power Stability

Metric	Value
GPU Power (60s sustained)	65.7 W (stable)
GPU Temperature	37°C → 37°C (no change)
ECC Corrected Errors	0
ECC Uncorrected Errors	0

11.2 Software Reliability

Metric	Value
Workspace test count	1,447
Test failures	0
GPU-specific tests (H100)	29 pass
Clippy warnings (production)	0 (`clippy::unwrap_used` enforced)
Bare `.unwrap()` in production	0

12. Comparison with State of the Art

System	Command Latency	Messaging	Approach
Traditional CUDA	1.58 us	N/A (re-launch)	cuLaunchKernel
CUDA Graphs	0.55 us	N/A (captured sequence)	cuGraphLaunch
NVIDIA NCCL	~1 us	Collectives only	Ring allreduce
RingKernel (persistent)	0.000 us	75 ns (lock-free)	Mapped memory + SPSC

RingKernel's persistent actor model is unique in providing both:

Zero-overhead command injection (no kernel launch at all)
Lock-free inter-kernel messaging (no host mediation)

Traditional frameworks require either re-launching kernels (cuLaunchKernel, CUDA Graphs) or host-mediated communication (cuMemcpy between kernels). Persistent actors eliminate both overheads.

13. Threats to Validity

Internal Validity

Measurement bias: Host-side timing includes PCIe latency for mapped memory writes. Device-side CUDA event timing would give lower persistent actor latency.
Warm cache: Benchmarks run after warmup; cold-start latency may differ.
Single GPU: Results on H100 NVL; other Hopper SKUs may differ slightly.

External Validity

Workload-dependent: Persistent actors require continuous GPU occupation. For sporadic, one-shot kernels, the traditional model has lower resource cost.
Grid size constraints: Cooperative groups limit grid to ~1024 blocks. Thread Block Clusters partially address this (cluster.sync for local, grid.sync for global).
Memory model: Mapped memory coherence relies on volatile operations and memory fences. Formal verification of the lock-free protocol is future work.

Construct Validity

Throughput measurement: CPU-side SPSC queue throughput (5.54M ops/s) represents the host injection rate, not GPU-side processing rate. GPU-side throughput depends on the kernel's compute workload.
CUDA Graph comparison: The graph captures a simple kernel; complex graphs with dependencies may show different speedup ratios.

14. Reproducibility

# System setup
export RINGKERNEL_CUDA_ARCH=sm_90
sudo nvidia-smi -pm 1
sudo nvidia-smi -lgc 1785
sudo nvidia-smi -c EXCLUSIVE_PROCESS

# Build
cargo build --workspace --features cuda --release --exclude ringkernel-txmon
cargo build -p ringkernel-cuda --features "cuda,cooperative" --release

# Core benchmarks (Criterion)
cargo bench --package ringkernel -- --noplot

# Application benchmarks (3 trials each)
cargo run -p ringkernel-wavesim3d --bin wavesim3d-benchmark --release --features cuda-codegen
cargo run -p ringkernel-txmon --bin txmon-benchmark --release --features cuda-codegen
cargo run -p ringkernel-procint --bin procint-benchmark --release

# GPU tests (cluster, DSMEM, CUDA Graphs, sustained throughput)
cargo test -p ringkernel-cuda --features "cuda,cooperative" --release -- --ignored --nocapture

# Full workspace validation
cargo test --workspace --release  # Expect: 1,447 passed, 0 failed

Git commit: 42724ae All raw data: target/criterion/, benchmark_results/h100_20260416/ Methodology: docs/benchmarks/METHODOLOGY.md Full results: docs/benchmarks/h100-b200-baseline.md

15. Conclusion

The empirical evidence demonstrates that the GPU-native persistent actor paradigm provides fundamental — not incremental — performance advantages:

8,698x faster command injection than traditional kernel launch
3,005x faster than CUDA Graphs (the optimal traditional approach)
Sub-100ns message latency with lock-free queues
0.05% throughput variance over 60 seconds of sustained load
2.98x faster synchronization via H100 Thread Block Clusters
Zero ECC errors, zero thermal drift under sustained datacenter operation

These results establish persistent GPU actors as a viable paradigm for latency-sensitive GPU workloads in datacenter environments.

Uh oh!

FilesExpand file tree

ACADEMIC_PROOF.md

Latest commit

History