feat: overlap convert and encode, pipelined encoders#10
Conversation
b5ba313 to
17efb1e
Compare
|
Damn. Those convert times are way longer than what I expect. Granted I think I tested with 1440p, but I remember convert taking ~200-300 microseconds. I considered further improvements like these before, but since the entire pipeline took ~1-3msec on my system I didn't think it was worthwhile. I'm more curious now why the convert step is seemingly so slow on your system. I could maybe test it on ab AMD iGPU for comparison. |
|
Yeah it might be worth testing on an AMD card, this might be a case of AMD problems that don't affect nvidia at all. Edit: also I'm gonna go ahead and mark this one "ready for review" until you have a chance to look into it more. I still don't have an nvidia card so all of my testing has been on AMD hardware |
|
I tested this on my AMD iGPU on my previous laptop (5800H CPU). I got the following results after applying some optimizations: This is 4k where I tried just the color conversion using an example application in this branch. Can you try that benchmark to see what your processing times are? I would expect they are at least as good as mine, considering your GPU. I would suspect there is something else using your GPU if it is still much slower. For what it's worth, before these optimizations I still had decent speeds: |
|
@hgaiser finally got a chance to test and yeah that bench runs super fast: With optimizations: Without optimizations: I think I may have been a bit hung up on the idea that serial convert and encode was the source of the stutter under simultaneous render+streaming load. The convert step is slow if you run a full integration benchmark with the moonshine benchmark branch I have and use a 4K HDR 175hz bench of gravitymark RT, but after digging into it more and trying to alternate options, it looks like the convert/encode overlapping isn't really that important and can probably be dropped. What actually does help a lot in that circumstance is the encode pipeline stuff, which allows the next frame to start encoding even while the previous frame is still in flight. I'm going to close this PR and reopen with something that won't require any API change or semaphore stuff, but should still have the same observable perf improvement on a benchmark where rendering load and streaming load are simultaneous on the same gpu. |
I've been using moonshine to stream at 4k 175fps and host latency was very high because moonshine/pixelforge was unable to keep up with frame budget at that rate (8.33ms). This PR stacks two changes that improve consistency on frame delivery by:
I benchmarked this using my moonshine benchmark PR at 4k HDR 175fps GravityMark with raytracing on a 9070xt/9800x3d host system. At this resolution/rate the encoder is oversaturated (8.33ms frame budget vs ~4-8ms GPU encode time), so the improvements are easier to see. 1080@60 probably wouldn't look much different, because at least on my GPU it doesn't apply enough contention/pressure to the device.
Subjectively, I noticed a very significant reduction in stutters and jitter in a MH Wilds stream at 2560x1440@120, HDR.
Note: this is a breaking change for
Encoder::encode()- will require (moonshine PR here) to be merged after for moonshine compatibility.