[rlsw] ESP32 optimizations#5827
Conversation
Adds a `sw_rcp(x)` inline reciprocal that on Xtensa (ESP32 / ESP32-S3 LX6/LX7) emits a `recip0.s` seed plus two Newton-Raphson refinement steps -- 1-ULP accurate in ~7 instructions, all in FPU registers. On every other target it expands to plain `1.0f/x`, so generated code is byte-identical to before for non-Xtensa builds. Replaces the hot-path `1.0f/x` calls that were previously compiling to the `__divsf3` software helper on Xtensa: - perspective divide (1/w) in triangle clip-and-project (PCT and PC paths) - line and point clip-and-project NDC conversion - triangle span setup: dxRcp, blockLenRcp, wRcpA, wRcpB - triangle scanline setup: h02Rcp, h01Rcp, h12Rcp - axis-aligned quad: wRcp, hRcp - line rasterizer: stepRcp Other `1.0f/x` uses (matrix translate/normalize, texture init `tx`/`ty`, sw_matrix_rotate inverse-length) are not on the per-pixel hot path and are left untouched. Measured on ESP32-S3 @ 240 MHz, R5G6B5 240x240, textured 3D model: contributes to a ~10-15% rasterization speedup. Made-with: Cursor
Adds an opt-in ESP-DSP code path for ESP32 / ESP32-S3 builds. ESP-DSP is
ESP-IDF's official optimized math library and ships hand-vectorized
kernels that beat the scalar implementations on Xtensa.
Two integration points:
1. `sw_matrix_mul_rst` -> `dspm_mult_4x4x4_f32` for any 4x4*4x4 multiply
(used for MVP build, gluLookAt, push/multiply, etc.). rlsw stores
matrices column-major and ESP-DSP reads row-major; the comment on the
call site explains why the flat-buffer call still produces the
correct column-major product (transpose-of-transposes equivalence).
2. `sw_immediate_push_vertex` -> `dspm_mult_4x4x1_f32` for the per-vertex
clip-space transform. Because ESP-DSP expects a row-major matrix in
this case, a row-major copy `matMVP_rm[16]` is maintained alongside
`matMVP` and refreshed once per `isDirtyMVP` rebuild in
`sw_immediate_begin`. Cost is 16 scalar copies per matrix update,
amortized over thousands of vertices per frame.
Detection is **opt-in** via `SW_USE_ESP_DSP` so existing ESP-IDF projects
that don't depend on the `esp-dsp` component keep building unchanged.
A user enables it from CMakeLists.txt (or anywhere before including
rlgl.h):
target_compile_definitions(${COMPONENT_LIB} PRIVATE SW_USE_ESP_DSP=1)
and adds the dependency to `idf_component.yml`:
espressif/esp-dsp: "^1.4.0"
Measured on ESP32-S3 @ 240 MHz, R5G6B5 240x240, textured 3D model:
contributes meaningfully to the overall frame-time improvement
(combined with sw_rcp).
Made-with: Cursor
Adds an opt-in compile-time flag that replaces the SW_REPEAT wrap chain
with a bitmask (`x & (size-1)`) for power-of-two textures. NPOT textures
keep using the original `sw_fract` / signed-modulo paths via a runtime
`(size & (size-1)) == 0` check, so SW_REPEAT remains correct for them.
Affects two samplers:
- `sw_texture_sample_nearest`: drops the `floorf` + multiply + cast for
POT textures in REPEAT mode (saves a software call on Xtensa).
- `sw_texture_sample_linear`: replaces the `(x % w + w) % w` two-step
modulo (a software divide on Xtensa) with a single bitwise AND for
POT textures in REPEAT mode. Two's-complement int wrap covers
negative coordinates correctly.
Off by default: for POT textures sampled with negative UVs, bitmask wrap
can differ from `sw_fract` wrap by one texel at the boundary. That is
imperceptible at typical resolutions but technically a behavior change,
so existing users get bit-for-bit identical output. Opt in if you
control your asset UVs and want the speedup:
#define SW_TEXTURE_REPEAT_POT_FAST
This addresses the long-standing TODO comment "If the textures are POT,
avoid the division for SW_REPEAT" in `sw_texture_sample_linear`.
Made-with: Cursor
| #ifdef SW_TEXTURE_REPEAT_POT_FAST | ||
| if ((tex->sWrap == SW_REPEAT) && ((tex->width & tex->wMinus1) == 0)) |
There was a problem hiding this comment.
I think it's fine that when SW_TEXTURE_REPEAT_POT_FAST is defined, we restrict SW_REPEAT to POT textures only, this would remove all extra branches here.
This would require updating swTexParameteri so SW_REPEAT is applied only if the texture is POT, otherwise leave it unchanged and set RLSW.errCode = SW_INVALID_OPERATION.
We could also store a bool isPOT; in sw_texture_t determined during sw_texture_alloc.
We should also explicitly set default parameters in sw_texture_alloc, like SW_CLAMP for wrapping.
There was a problem hiding this comment.
I usually try to minimize to store information that can be calculated when needed but I understand that isPOT would be useful for every texture access. I forget that we are operating one level below OpenGL...
Even though ESP32 is not "natively" supported by raylib, I think it would be interesting to consider how to implement the necessary components for this in rlsw. |
This is how we solved it in the esp32 repo. Double-buffer the color framebuffer and run the display transfer on a dedicated FreeRTOS task pinned to core 0, overlapping it with rendering on core 1. rlsw.h changes:
Port layer changes:
SwapScreenBuffer flow: Memory cost: ~115 KB PSRAM (second color buffer) + 4 KB task stack. Since render time (~42 ms) > transfer time (~15 ms), the wait never blocks in |
| #ifdef SW_TEXTURE_REPEAT_POT_FAST | ||
| if ((tex->sWrap == SW_REPEAT) && ((tex->width & tex->wMinus1) == 0)) |
There was a problem hiding this comment.
I usually try to minimize to store information that can be calculated when needed but I understand that isPOT would be useful for every texture access. I forget that we are operating one level below OpenGL...
Agree. I think ESP32 is a great testbed for further performance improvements that can be extrapolated to other microcontrollers, I'd love to see raylib running on RPI Pico 2 in the future! 😄 |
|
@jensroth-git @Bigfoot71 thanks for the improvement and the review! I'm merging the changes! Feel free to send other PRs for further revisions and improvements! 😄 |
[rlsw] ESP32 / Xtensa hot-path optimizations (sw_rcp + ESP-DSP + opt-in POT wrap)
Summary
Three small, isolated, opt-out-safe optimizations to
rlsw.hthat togetherroughly halve frame time on ESP32-S3 for textured 3D rendering, while
staying bit-for-bit identical on existing desktop / RISC-V / non-ESP32 builds.
These were extracted from a larger ESP32 port experiment after a code review
with @Bigfoot71, who suggested upstreaming the parts related to
sw_rcpanddspm_mult_4x4x?_f32while leaving the rest (clear, conversions, samplerrefactor, async double-buffer) for a separate effort.
The PR is split into three logically-isolated commits so each can be reviewed
or reverted independently.
1.
sw_rcp— Xtensarecip0.sfast reciprocal (commit 1)Float division on Xtensa LX6 / LX7 (ESP32, ESP32-S3) compiles to a software
__divsf3call. The rasterizer performs multiple1.0f/xoperations pertriangle setup, per 16-pixel affine block, and per vertex transform.
A new
sw_rcp(x)helper emits the hardwarerecip0.sseed plus twoNewton-Raphson refinement steps — 1-ULP accurate in ~7 instructions, all in
FPU registers. On every other target it expands to plain
1.0f/x, sogenerated code is byte-identical to before for non-Xtensa builds.
Applied only to documented hot-path reciprocals:
1/w) in clip-and-project (PCT and PC paths)dxRcp,blockLenRcp,wRcpA,wRcpBh02Rcp,h01Rcp,h12RcpwRcp,hRcpstepRcpOther
1.0f/xuses (sw_matrix_translatelength, texture inittx/ty)are not on the per-pixel hot path and are left untouched.
2. ESP-DSP matrix kernels (commit 2)
ESP-DSP is ESP-IDF's official optimized math library and ships hand-vectorized
kernels for the matrix sizes rlsw uses. Two integration points:
sw_matrix_mul_rst→dspm_mult_4x4x4_f32— used for MVP build, lookat,push/multiply, etc. The flat-buffer call still produces the correct
column-major product (transpose-of-transposes equivalence; see comment).
sw_immediate_push_vertex→dspm_mult_4x4x1_f32— the per-vertex cliptransform. ESP-DSP wants a row-major matrix here, so a
matMVP_rm[16]row-major copy is maintained alongside
matMVPand refreshed once perisDirtyMVPrebuild insw_immediate_begin.Detection is opt-in via
SW_USE_ESP_DSPso existing ESP-IDF projectsthat don't depend on the
esp-dspcomponent keep building unchanged.A user enables it from CMakeLists.txt (or anywhere before including rlgl.h):
and adds the dependency to
idf_component.yml:3.
SW_TEXTURE_REPEAT_POT_FAST— opt-in POT bitmask wrap (commit 3)Addresses the long-standing
// NOTE: If the textures are POT, avoid the division for SW_REPEATTODO insw_texture_sample_linear.When defined, textures whose width/height are powers of two use a bitmask
wrap (
x & (size-1)) instead offloorf-based fractional wrap (nearest)or the signed
(x % w + w) % wchain (linear). NPOT textures keep usingthe original paths via a runtime
(size & (size-1)) == 0check, soSW_REPEAT remains correct for them.
Off by default: for POT textures sampled with negative UV coordinates,
bitmask wrap (two's complement) can differ from
sw_fractwrap by onetexel at the boundary. Imperceptible at typical resolutions but technically
a behavior change, so existing users get bit-for-bit identical output.
Numbers
Measured on ESP32-S3 @ 240 MHz, 240×240 R5G6B5 framebuffer, textured 3D model
with depth testing, all three optimizations enabled:
modeltime(Larger frame-time numbers in the original write-up include a separate
async-double-buffer change in the platform layer that's not part of this PR.)
Risk / compatibility
#if defined(__XTENSA__),#ifdef SW_HAS_ESP_DSP, or#ifdef SW_TEXTURE_REPEAT_POT_FAST.sw_rcpactivates (auto-on via__XTENSA__). Should be a strict perf improvement;recip0.s+ 2 N-Rsteps gives 1-ULP accuracy, equivalent to
1.0f/x.SW_USE_ESP_DSP: requiresespressif/esp-dspinthe project's component manifest. Without it, the include of
dspm_mult.hfails fast at compile time.
SW_TEXTURE_REPEAT_POT_FAST: NPOT textures are unaffected (runtimebranch). POT + negative UVs differ by one texel from
sw_fractwrap.Testing
-fsyntax-onlyclean for the fullrlsw.h(and viarlgl.hwithGRAPHICS_API_OPENGL_SOFTWARE), with andwithout
SW_TEXTURE_REPEAT_POT_FAST.PLATFORM=Memory+OPENGL_VERSION=Software(which routes through rlsw),ran the
shapes_basic_shapesexample, captured the rlsw framebuffer viaTakeScreenshot. Thesw_rcpscalar fallback produces a correctlyrasterized scene (shapes/text/lines all rendered as expected). Verified
the build is clean both with and without
-DSW_TEXTURE_REPEAT_POT_FAST.(Note: the resulting PNG is upside-down due to an unrelated upstream
bug in
rlReadScreenPixels-- it unconditionally vertical-flips assumingreal-
glReadPixelsbottom-left origin, butswReadPixelsalready returnstop-down. Visible content is correct; can be fixed in a separate PR.)
SW_USE_ESP_DSP): cannot be exercised on x86 since thekernels don't exist outside ESP-IDF. Sits behind a compile-time guard so
builds without the flag are byte-identical to the scalar version. Validated
end-to-end on real hardware (next bullet).
the full optimization stack (
__XTENSA__sw_rcp+SW_USE_ESP_DSP+SW_TEXTURE_REPEAT_POT_FAST). Rendering is correct and the perf deltamatches the table above.
Credits
rlswdesign and review: @Bigfoot71parts related to
sw_rcpanddspm_mult_4x4x4/1_f32").Created in Cursor