Remote GPU Acceleration for video transcoding over QUIC/UDP with TCP fallback.
Offload video transcoding from a VPS/cloud server to home machines with NVIDIA GPUs.
- Multi-Protocol Transport: QUIC (primary) with automatic TCP fallback
- Zstandard Compression: 25-35% bandwidth reduction with minimal CPU overhead
- Load Balancing: Consistent hashing for multi-GPU server clusters
- Health Monitoring: Real-time GPU temperature, memory, and utilization tracking
- Service Discovery: mDNS auto-discovery of GPU servers on local networks
- Ring Buffers: Efficient segment queuing with jitter compensation
┌─────────────────┐ QUIC/TCP ┌─────────────────┐
│ VPS Client │ ◄────────────────► │ Home Server │
│ (Compression) │ Binary Protocol │ (NVIDIA GPU) │
│ (Discovery) │ │ (NVENC) │
│ (Load Balance) │ │ (Health Mon) │
└─────────────────┘ └─────────────────┘
│ │
Video Source FFmpeg NVENC
(segmented) (h264_nvenc, etc)
│ │
┌────┴────┐ ┌─────┴─────┐
│ Zstd │ │ Zstd │
│ Compress│ │ Decompress│
└─────────┘ └───────────┘
- NVIDIA GPU with NVENC support
- FFmpeg with NVENC enabled
- Python 3.8+
- Open UDP port 4433 (QUIC) and TCP port 4434
- FFmpeg (for segmentation/reassembly)
- Python 3.8+
git clone https://github.com/TheKhosa/RGPU-Acceleration.git
cd RGPU-Acceleration
pip install -r requirements.txt
pip install -e .# Generate TLS certificates
python -m rgpu.home_server --generate-certs
# Start server with mDNS advertising
python -m rgpu.home_server# Transcode with auto-discovery and compression
python -m rgpu.vps_client --host YOUR_HOME_IP transcode input.mkv output.mp4# Basic usage (auto node ID, mDNS enabled)
python -m rgpu.home_server
# Custom configuration
python -m rgpu.home_server --config config/home_config.yaml
# Specify node ID and ports
python -m rgpu.home_server --node-id gpu-rtx4090 --port 4433 --tcp-port 4434
# Disable mDNS advertising
python -m rgpu.home_server --no-mdns# Transcode with compression and packed protocol (default)
python -m rgpu.vps_client transcode input.mkv output.mp4
# Force specific transport
python -m rgpu.vps_client --transport quic transcode input.mkv output.mp4
python -m rgpu.vps_client --transport tcp transcode input.mkv output.mp4
# Disable compression
python -m rgpu.vps_client --no-compression transcode input.mkv output.mp4
# Use legacy v1 protocol (for compatibility)
python -m rgpu.vps_client --legacy-protocol transcode input.mkv output.mp4
# Use preset
python -m rgpu.vps_client transcode input.mkv output.mp4 --preset 720p- 100x Latency Reduction: Frame latency drops from 500-2000ms to 5-20ms
- Process Reuse: Single FFmpeg process handles thousands of frames
- CUDA Context Caching: Avoid repeated GPU initialization overhead
- Configuration-Based Pooling: Automatic encoder reuse for matching configs
- Pre-allocated Buffer Pools: Eliminate allocation overhead during streaming
- FrameBuffer Context Manager: Automatic pool return on scope exit
- Memory Copy Reduction: 3-5 copies reduced to 1 per segment
- Multi-Size Pool Support: Efficient handling of different resolutions
- GPU Scheduler: Load-balanced scheduling across multiple GPUs
- 3-Stage Pipeline: Receive → Encode → Send run concurrently
- Health-Aware Routing: Avoid overloaded or hot GPUs
- Consistent Hashing: Job affinity for cache locality
- Quality Ladders: Predefined levels from 240p to 4K60
- Bandwidth Estimation: EWMA-based network measurement
- Buffer-Aware Adaptation: Quality adjusts to buffer health
- Smooth Transitions: Hysteresis prevents oscillation
- rgpu-nvenc: Direct NVENC bindings (placeholder, CUDA SDK required)
- rgpu-core: Protocol/buffer implementations for 10-100x speedup
- Python FFI: ctypes-compatible C API for easy integration
- LEB128 Variable-Length Encoding: Segment indices use 1-3 bytes instead of fixed 4 bytes
- Job ID Registry: Replace 8-64 byte job_id strings with 1-byte indices (67-83% savings)
- Packed Segment Headers: 4-8 bytes vs 24+ bytes in v0.2 (79% overhead reduction)
- Binary Job Params: ~25 bytes vs ~600 bytes JSON (96% savings)
- Zero-Copy Unpacking: memoryview-based parsing for efficient processing
- Message Batching: Coalesce small messages to reduce header overhead
- Concurrency Safety: Added asyncio.Lock for all shared state
- Memory Bounds: Bounded pending data buffers (10MB/stream, 100MB total)
- TTL Cleanup: Automatic cleanup of stale jobs and segments
- Complete Shutdown: Proper cleanup of connections and resources
- Input Validation: Size limits and field validation on all protocol messages
- Automatic compression of video segments (level 5 default)
- 25-35% bandwidth reduction
- Only applied when beneficial (>5% savings)
- Configurable compression level (1-19)
- QUIC as primary protocol (30-50% lower latency)
- Automatic TCP fallback when UDP is blocked
- Protocol racing: tries both, uses first to succeed
- Consistent hashing for segment routing
- GPU utilization-aware routing
- Automatic failover on node failure
- Health-based node weighting
- mDNS auto-discovery of GPU servers
- Manual server configuration support
- Hybrid mode (mDNS + manual)
- Real-time GPU metrics (temp, memory, utilization)
- System metrics (CPU, RAM)
- 1-second health check intervals
- Warning/critical thresholds
- Efficient O(1) segment queuing
- Jitter buffer for network variance
- Configurable buffer sizes
server:
node_id: "" # Auto-generated if empty
host: "0.0.0.0"
port: 4433
tcp_port: 4434
max_concurrent_jobs: 2
segment_buffer_size: 50
enable_mdns: true
health_check_interval: 1.0
gpu_temp_warning: 80.0
gpu_temp_critical: 90.0client:
server_host: "your-home-ip"
server_port: 4433
tcp_port: 4434
transport: "auto" # quic, tcp, or auto
enable_compression: true
compression_level: 5
enable_load_balancing: false
enable_mdns_discovery: true| Type | Name | Direction | Purpose |
|---|---|---|---|
| 0x01 | JOB_CREATE | VPS→Home | New transcode job |
| 0x02 | JOB_ACK | Home→VPS | Job accepted |
| 0x03 | JOB_REJECT | Home→VPS | Job rejected |
| 0x10 | SEGMENT_DATA | VPS→Home | Compressed video segment |
| 0x11 | SEGMENT_DONE | Home→VPS | Encoded segment returned |
| 0x20 | JOB_COMPLETE | VPS→Home | No more segments |
| 0x21 | JOB_FINISHED | Home→VPS | All processing done |
| 0x30 | HEALTH_REQUEST | VPS→Home | Request health report |
| 0x31 | HEALTH_REPORT | Home→VPS | Health status |
| 0xF0 | HEARTBEAT | Both | Keep connection alive |
| 0xF1 | PING | Both | Latency measurement |
| 0xF2 | PONG | Both | Latency response |
| 0xFF | ERROR | Both | Error with details |
[2 bytes: job_id length]
[N bytes: job_id]
[4 bytes: segment_index]
[1 byte: is_last]
[1 byte: compression_type] # 0=none, 1=zstd, 2=lz4
[4 bytes: original_size] # Size before compression
[N bytes: data]
Frame Header (2-3 bytes):
[4 bits: version=2][4 bits: frame_type]
[1-4 bytes: payload_length (LEB128)]
Packed Segment (4-8 bytes header vs 24+ in v1):
[1 byte: job_index] # From job registry
[1-3 bytes: segment_index] # LEB128 varint
[1 byte: flags] # keyframe|last|compressed|reserved
[0-4 bytes: original_size] # LEB128, only if compressed
[N bytes: payload data]
| Component | v1 (Legacy) | v2 (Packed) | Savings |
|---|---|---|---|
| Frame header | 5 bytes | 2-3 bytes | 40-60% |
| Segment header | 24+ bytes | 4-8 bytes | 67-83% |
| Job params | ~600 bytes | ~25 bytes | 96% |
| Per 1000 segments | ~29KB | ~6KB | 79% |
For multiple GPU servers:
# VPS client config
client:
enable_load_balancing: true
load_balance_strategy: "consistent_hash" # or "round_robin", "least_loaded"
enable_mdns_discovery: true
manual_servers:
- host: "192.168.1.100"
port: 4433
- host: "192.168.1.101"
port: 4433| Level | Speed | Ratio | Use Case |
|---|---|---|---|
| 1-3 | Fast | Low | High bandwidth, low CPU |
| 4-6 | Balanced | Medium | General use (default: 5) |
| 7-10 | Slow | High | Low bandwidth |
segment_buffer_size: 50 (default) - increase for high latency networksjitter_buffer_min_ms: 20ms - minimum buffering delayjitter_buffer_max_ms: 200ms - maximum buffering delay
The rgpu-pipe command enables real-time streaming transcoding via stdin/stdout pipes:
# Basic usage: transcode raw video from FFmpeg
ffmpeg -i input.mp4 -f rawvideo -pix_fmt nv12 -s 1920x1080 pipe:1 | \
rgpu-pipe --host GPU_SERVER -w 1920 -h 1080 | \
ffmpeg -f h264 -i pipe:0 -c copy output.mp4
# HEVC encoding at 8Mbps for 4K
ffmpeg -i input.mp4 -f rawvideo -pix_fmt nv12 -s 3840x2160 pipe:1 | \
rgpu-pipe --host GPU_SERVER -w 3840 -h 2160 --codec hevc_nvenc --bitrate 8000 | \
ffmpeg -f hevc -i pipe:0 -c copy output.mp4RGPU integrates with MediaMTX via the runOnReady hook:
# mediamtx.yml - GPU-accelerated stream transcoding
paths:
# Original stream input
camera1:
source: rtsp://192.168.1.100/stream
# Transcoded output using remote GPU
camera1_hd:
runOnReady: >
ffmpeg -i rtsp://localhost:$RTSP_PORT/camera1
-f rawvideo -pix_fmt nv12 -s 1920x1080 pipe:1 |
rgpu-pipe --host GPU_SERVER -w 1920 -h 1080 --codec h264_nvenc --bitrate 4000 |
ffmpeg -f h264 -i pipe:0 -c copy -f rtsp rtsp://localhost:$RTSP_PORT/$MTX_PATH
runOnReadyRestart: yes
# Low-latency stream for WebRTC
camera1_webrtc:
runOnReady: >
ffmpeg -i rtsp://localhost:$RTSP_PORT/camera1
-f rawvideo -pix_fmt nv12 -s 1280x720 pipe:1 |
rgpu-pipe --host GPU_SERVER -w 1280 -h 720 --low-latency --bitrate 2000 |
ffmpeg -f h264 -i pipe:0 -c copy -f rtsp rtsp://localhost:$RTSP_PORT/$MTX_PATH
runOnReadyRestart: yesFor programmatic integration:
import asyncio
from rgpu import StreamTranscoder, StreamConfig, StreamFrame
async def transcode_stream():
config = StreamConfig(
width=1920,
height=1080,
fps_num=30,
codec="h264_nvenc",
bitrate_kbps=4000,
low_latency=True,
)
transcoder = StreamTranscoder(config)
await transcoder.connect("gpu-server", 4433)
async for encoded in transcoder.transcode_stream(raw_frames()):
output.write(encoded.data)
await transcoder.close()
asyncio.run(transcode_stream())For high-performance parallel encoding:
import asyncio
from rgpu import (
EncoderPool, EncoderConfig, GPUScheduler,
TranscodePipeline, create_abr, get_ladder,
)
async def pipeline_encode():
# Initialize components
scheduler = GPUScheduler()
await scheduler.start()
encoder_pool = EncoderPool(max_encoders=4)
await encoder_pool.start()
config = EncoderConfig(
width=1920,
height=1080,
fps=30,
codec="h264_nvenc",
bitrate_kbps=4000,
)
# Create pipeline
pipeline = TranscodePipeline(
scheduler=scheduler,
encoder_pool=encoder_pool,
config=config,
num_encode_workers=2,
)
# Handle encoded frames
pipeline.on_encoded(lambda frame: output.write(frame.data))
await pipeline.start()
# Submit frames
for raw_frame in frames:
await pipeline.submit_frame(raw_frame)
await pipeline.stop()
await encoder_pool.stop()
await scheduler.stop()
asyncio.run(pipeline_encode())For network-aware quality adaptation:
from rgpu import ABRController, get_ladder
# Create ABR controller with streaming ladder
abr = ABRController(ladder=get_ladder("streaming"))
# Update with network measurements
await abr.update_bandwidth(bytes_transferred=1024000, duration_ms=1000)
await abr.update_buffer(level_ms=3000)
# Get recommended quality
quality = await abr.select_quality()
print(f"Selected: {quality.name} @ {quality.bitrate_kbps} kbps")
# Get encoder config for quality
config = quality.to_encoder_config()- Ensure UDP port 4433 is open
- Try TCP fallback:
--transport tcp - Check firewall allows UDP traffic
- Increase compression level
- Use TCP for more stable networks
- Adjust jitter buffer settings
ffmpeg -encoders | grep nvenc
nvidia-smiMIT