Skip to content

feat: depth-2 encoder pipelining#12

Open
porkloin wants to merge 1 commit into
hgaiser:mainfrom
porkloin:pipelining
Open

feat: depth-2 encoder pipelining#12
porkloin wants to merge 1 commit into
hgaiser:mainfrom
porkloin:pipelining

Conversation

@porkloin
Copy link
Copy Markdown
Contributor

@porkloin porkloin commented May 7, 2026

Smaller scope followup to #10 that only includes encoder pipelining at depth-2.

This allows the next frame to begin being processed by the encoder before the current frame is complete.

As with #10, this aims to address latency in situations where the GPU is saturated by rendering work, and especially in cases where the frame budget is tight (4k 175 fps, for example).

Each codec encoder (H.264, H.265, AV1) gets a pair of slots; each slot owns its own command buffer, query pool, fence, bitstream buffer, input image, and DPB output slot. encode() becomes:

  1. Drain the slot we're about to overwrite: wait its fence, read the bitstream from its query pool, build the packet from those bytes plus metadata cached at submit time.
  2. Record the new submission into that slot and submit it to the encode queue without waiting for completion.
  3. Advance to the other slot for the next call.
  4. Return the packet drained in step 1 (which is for one frame ago).

Unlike #10, this is a non-breaking change for Encoder::encode(), the single arg function signature remains the same. This also doesn't touch the converter process or do any parallel convert/encode.

There is a real trade off with this approach: since frames are functionally arriving 1 frame late, end-to-end latency will rise by whatever the current host processing latency is. However, the increased throughput should make this a worthwhile tradeoff.

Benchmarking:

I've been benchmarking this with moonshine pull 77 which adds the ability to benchmark without a real moonlight client.

The important part about benchmarking this is to make sure you have a benchmark workload that completely saturates your GPU to the point that the real host encoder latency begins to exceed the frame budget for the requested framerate. For me on an AMD 9070XT, gravitybench at 4K HDR 175 with raytracing has worked well for getting those conditions with a bench-friendly executable.

Steps to benchmark:

  1. Download gravitybench from https://gravitymark.tellusim.com/
  2. Checkout the benchmark branch from moonshine PR 77
  3. Update pixelforge dep in the local moonshine branch to point at this branch: pixelforge = { git = "https://github.com/porkloin/pixelforge", branch = "pipelining", features = ["dmabuf"] }
  4. cargo build --release --bin moonshine
  5. ./target/release/moonshine /path/to/config.toml bench --duration 30 --warmup 5 --codec hevc --resolution 3840x2160 --fps 175 --hdr /path/to/gravitymark/run_fullscreen_vk_rt.sh -close 0

My results:

With this PR:

   ======================================================================
   moonshine bench report                                                                                                                                                
  ======================================================================
   config:    3840x2160 @ 175Hz, 50000000 bps, Hevc, hdr=true                                                                                                            
   duration:  30.00s elapsed (target 30s, warmup 5s, 418 frames discarded)
   frames:    3814 (0 key)  observed_fps=152.55                                                                                                                          
   bitrate:   43585757 bps observed (target 50000000 bps)
   spikes:    3790 frames > 5714us frame interval (99.4%)                                                                                                                
                                                                                                                                                                         
   stage         min     p50     p95     p99     max     (microseconds)                                                                                                  
   -----         ---     ---     ---     ---     ---                                                                                                                     
   channel_wait       1    9778   13136   14495   16309           
   import             0       0       0       0       3                                                                                                                  
   convert          163    6300    6856   12733   13729                                                                                                                  
   encode            61     120     136     149    3856                                                                                                                  
   packetize          1       1       2       2       6                                                                                                                  
   send               0       0       1       2       8           
   total           2414   16323   19891   21290   22983                                                                                                                  
                                                                  
   gpu          min     p50     p95     p99     max     (249 samples)                                                                                                    
   ---          ---     ---     ---     ---     ---               
   sclk MHz     2990    3048    3076    3104    3117                                                                                                                     
   busy %         87     100     100     100     100                                                                                                                     
   
   worst spikes (frame >5714us with nearest GPU sample):                                                                                                                 
      t (s)   total (us)   convert (us)   encode (us)   sclk MHz   busy %
      22.11        22983           6596            74       3058      100                                                                                                
      28.87        22970           6719           117       3039      100                                                                                                
      28.93        22765           6806           121       3039      100                                                                                                
      29.23        22628           6668           121       3022      100                                                                                                
      29.99        22621           6809           124       3024      100                                                                                                
      28.18        22578           6613            71       3031      100                                                                                                
      28.28        22356           6682            65       3023      100                                                                                                
      25.64        22331           6755            70       3025      100                                                                                                
      24.79        22285           6693            73       3041      100                                                                                                
      26.75        22119           6651            67       3009      100                                                                                                
  ======================================================================                                                                                       

Without this PR:

  ======================================================================                                                                                                 
   moonshine bench report                                         
  ======================================================================
   config:    3840x2160 @ 175Hz, 50000000 bps, Hevc, hdr=true
   duration:  30.00s elapsed (target 30s, warmup 5s, 341 frames discarded)                                                                                               
   frames:    2891 (0 key)  observed_fps=115.63                                                                                                                          
   bitrate:   33037412 bps observed (target 50000000 bps)                                                                                                                
   spikes:    2891 frames > 5714us frame interval (100.0%)                                                                                                               
                                                                  
   stage         min     p50     p95     p99     max     (microseconds)                                                                                                  
   -----         ---     ---     ---     ---     ---              
   channel_wait    4062   11112   20900   22272   23508                                                                                                                  
   import             0       0       0       0       4                                                                                                                  
   convert          160    2559    8762    9120    9599
   encode          4160    4485    4980    5546    6039                                                                                                                  
   packetize          1       1       2       3       6                                                                                                                  
   send               0       1       2       2       5                                                                                                                  
   total          10546   18121   28699   29040   35734                                                                                                                  
                                                                                                                                                                         
   gpu          min     p50     p95     p99     max     (248 samples)
   ---          ---     ---     ---     ---     ---                                                                                                                      
   sclk MHz     3008    3148    3187    3195    3197                                                                                                                     
   busy %         76      88      92      93      94                                                                                                                     
                                                                                                                                                                         
   worst spikes (frame >5714us with nearest GPU sample):                                                                                                                 
      t (s)   total (us)   convert (us)   encode (us)   sclk MHz   busy %
      29.24        35734           7667          4758       3050       88                                                                                                
      29.57        35471           7473          4486       3140       81                                                                                                
      28.69        35388           7143          4813       3069       80
      29.77        35385           7324          4643       3088       81                                                                                                
      22.11        30609           4429          4449       3160       82                                                                                                
      20.09        29870           4409          4638       3163       84                                                                                                
      10.49        29553           6270          4501       3173       88                                                                                                
      29.61        29425           5013          4815       3140       81                                                                                                
      21.88        29325           2917          4804       3140       89                                                                                                
      28.53        29321           4237          4738       3092       84                                                                                                
  ======================================================================     

Consolidated Results:

With this PR:

      observed_fps  = 152.55
      encode p50    = 120 us                                                                                                                                             
      encode p99    = 149 us                                               
      total p99     = 21290 us

Without this PR:

      observed_fps  = 115.63                                               
      encode p50    = 4485 us                                                                                                                                            
      encode p99    = 5546 us                                              
      total p99     = 29040 us                                                                                                                                           

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant