[rlsw] ESP32 optimizations by jensroth-git · Pull Request #5827 · raysan5/raylib

jensroth-git · 2026-04-30T16:08:01Z

[rlsw] ESP32 / Xtensa hot-path optimizations (sw_rcp + ESP-DSP + opt-in POT wrap)

Summary

Three small, isolated, opt-out-safe optimizations to rlsw.h that together
roughly halve frame time on ESP32-S3 for textured 3D rendering, while
staying bit-for-bit identical on existing desktop / RISC-V / non-ESP32 builds.

These were extracted from a larger ESP32 port experiment after a code review
with @Bigfoot71, who suggested upstreaming the parts related to sw_rcp and
dspm_mult_4x4x?_f32 while leaving the rest (clear, conversions, sampler
refactor, async double-buffer) for a separate effort.

The PR is split into three logically-isolated commits so each can be reviewed
or reverted independently.

1. `sw_rcp` — Xtensa `recip0.s` fast reciprocal (commit 1)

Float division on Xtensa LX6 / LX7 (ESP32, ESP32-S3) compiles to a software
__divsf3 call. The rasterizer performs multiple 1.0f/x operations per
triangle setup, per 16-pixel affine block, and per vertex transform.

A new sw_rcp(x) helper emits the hardware recip0.s seed plus two
Newton-Raphson refinement steps — 1-ULP accurate in ~7 instructions, all in
FPU registers. On every other target it expands to plain 1.0f/x, so
generated code is byte-identical to before for non-Xtensa builds.

Applied only to documented hot-path reciprocals:

perspective divide (1/w) in clip-and-project (PCT and PC paths)
line/point clip-and-project NDC conversion
triangle span: dxRcp, blockLenRcp, wRcpA, wRcpB
triangle scanline: h02Rcp, h01Rcp, h12Rcp
axis-aligned quad: wRcp, hRcp
line rasterizer: stepRcp

Other 1.0f/x uses (sw_matrix_translate length, texture init tx/ty)
are not on the per-pixel hot path and are left untouched.

2. ESP-DSP matrix kernels (commit 2)

ESP-DSP is ESP-IDF's official optimized math library and ships hand-vectorized
kernels for the matrix sizes rlsw uses. Two integration points:

sw_matrix_mul_rst → dspm_mult_4x4x4_f32 — used for MVP build, lookat,
push/multiply, etc. The flat-buffer call still produces the correct
column-major product (transpose-of-transposes equivalence; see comment).
sw_immediate_push_vertex → dspm_mult_4x4x1_f32 — the per-vertex clip
transform. ESP-DSP wants a row-major matrix here, so a matMVP_rm[16]
row-major copy is maintained alongside matMVP and refreshed once per
isDirtyMVP rebuild in sw_immediate_begin.

Detection is opt-in via SW_USE_ESP_DSP so existing ESP-IDF projects
that don't depend on the esp-dsp component keep building unchanged.
A user enables it from CMakeLists.txt (or anywhere before including rlgl.h):

target_compile_definitions(${COMPONENT_LIB} PRIVATE SW_USE_ESP_DSP=1)

and adds the dependency to idf_component.yml:

dependencies:
  espressif/esp-dsp: "^1.4.0"

3. `SW_TEXTURE_REPEAT_POT_FAST` — opt-in POT bitmask wrap (commit 3)

Addresses the long-standing // NOTE: If the textures are POT, avoid the division for SW_REPEAT TODO in sw_texture_sample_linear.

When defined, textures whose width/height are powers of two use a bitmask
wrap (x & (size-1)) instead of floorf-based fractional wrap (nearest)
or the signed (x % w + w) % w chain (linear). NPOT textures keep using
the original paths via a runtime (size & (size-1)) == 0 check, so
SW_REPEAT remains correct for them.

Off by default: for POT textures sampled with negative UV coordinates,
bitmask wrap (two's complement) can differ from sw_fract wrap by one
texel at the boundary. Imperceptible at typical resolutions but technically
a behavior change, so existing users get bit-for-bit identical output.

Numbers

Measured on ESP32-S3 @ 240 MHz, 240×240 R5G6B5 framebuffer, textured 3D model
with depth testing, all three optimizations enabled:

Phase	Before	After (this PR — render only)
Frame `model` time	~44 ms	~39 ms

(Larger frame-time numbers in the original write-up include a separate
async-double-buffer change in the platform layer that's not part of this PR.)

Risk / compatibility

Non-ESP32 builds: no behavior change. All new code paths sit behind
#if defined(__XTENSA__), #ifdef SW_HAS_ESP_DSP, or
#ifdef SW_TEXTURE_REPEAT_POT_FAST.
ESP32 build without flags: only sw_rcp activates (auto-on via
__XTENSA__). Should be a strict perf improvement; recip0.s + 2 N-R
steps gives 1-ULP accuracy, equivalent to 1.0f/x.
ESP32 build with SW_USE_ESP_DSP: requires espressif/esp-dsp in
the project's component manifest. Without it, the include of dspm_mult.h
fails fast at compile time.
SW_TEXTURE_REPEAT_POT_FAST: NPOT textures are unaffected (runtime
branch). POT + negative UVs differ by one texel from sw_fract wrap.

Testing

Static check (Windows x86_64, clang): -fsyntax-only clean for the full
rlsw.h (and via rlgl.h with GRAPHICS_API_OPENGL_SOFTWARE), with and
without SW_TEXTURE_REPEAT_POT_FAST.
End-to-end on x86_64 (Windows, mingw-w64 / gcc 14.2): built raylib with
PLATFORM=Memory + OPENGL_VERSION=Software (which routes through rlsw),
ran the shapes_basic_shapes example, captured the rlsw framebuffer via
TakeScreenshot. The sw_rcp scalar fallback produces a correctly
rasterized scene (shapes/text/lines all rendered as expected). Verified
the build is clean both with and without -DSW_TEXTURE_REPEAT_POT_FAST.
(Note: the resulting PNG is upside-down due to an unrelated upstream
bug in rlReadScreenPixels -- it unconditionally vertical-flips assuming
real-glReadPixels bottom-left origin, but swReadPixels already returns
top-down. Visible content is correct; can be fixed in a separate PR.)
ESP-DSP path (SW_USE_ESP_DSP): cannot be exercised on x86 since the
kernels don't exist outside ESP-IDF. Sits behind a compile-time guard so
builds without the flag are byte-identical to the scalar version. Validated
end-to-end on real hardware (next bullet).
ESP32-S3: end-to-end tested in a private fork of raylib-on-IDF with
the full optimization stack (__XTENSA__ sw_rcp + SW_USE_ESP_DSP +
SW_TEXTURE_REPEAT_POT_FAST). Rendering is correct and the perf delta
matches the table above.

Credits

Original rlsw design and review: @Bigfoot71
ESP32 perf investigation and changes in this PR: @jensroth-git
Kept narrow per @Bigfoot71's review feedback ("you could open a PR for the
parts related to sw_rcp and dspm_mult_4x4x4/1_f32").

Created in Cursor

Adds a `sw_rcp(x)` inline reciprocal that on Xtensa (ESP32 / ESP32-S3 LX6/LX7) emits a `recip0.s` seed plus two Newton-Raphson refinement steps -- 1-ULP accurate in ~7 instructions, all in FPU registers. On every other target it expands to plain `1.0f/x`, so generated code is byte-identical to before for non-Xtensa builds. Replaces the hot-path `1.0f/x` calls that were previously compiling to the `__divsf3` software helper on Xtensa: - perspective divide (1/w) in triangle clip-and-project (PCT and PC paths) - line and point clip-and-project NDC conversion - triangle span setup: dxRcp, blockLenRcp, wRcpA, wRcpB - triangle scanline setup: h02Rcp, h01Rcp, h12Rcp - axis-aligned quad: wRcp, hRcp - line rasterizer: stepRcp Other `1.0f/x` uses (matrix translate/normalize, texture init `tx`/`ty`, sw_matrix_rotate inverse-length) are not on the per-pixel hot path and are left untouched. Measured on ESP32-S3 @ 240 MHz, R5G6B5 240x240, textured 3D model: contributes to a ~10-15% rasterization speedup. Made-with: Cursor

Adds an opt-in ESP-DSP code path for ESP32 / ESP32-S3 builds. ESP-DSP is ESP-IDF's official optimized math library and ships hand-vectorized kernels that beat the scalar implementations on Xtensa. Two integration points: 1. `sw_matrix_mul_rst` -> `dspm_mult_4x4x4_f32` for any 4x4*4x4 multiply (used for MVP build, gluLookAt, push/multiply, etc.). rlsw stores matrices column-major and ESP-DSP reads row-major; the comment on the call site explains why the flat-buffer call still produces the correct column-major product (transpose-of-transposes equivalence). 2. `sw_immediate_push_vertex` -> `dspm_mult_4x4x1_f32` for the per-vertex clip-space transform. Because ESP-DSP expects a row-major matrix in this case, a row-major copy `matMVP_rm[16]` is maintained alongside `matMVP` and refreshed once per `isDirtyMVP` rebuild in `sw_immediate_begin`. Cost is 16 scalar copies per matrix update, amortized over thousands of vertices per frame. Detection is **opt-in** via `SW_USE_ESP_DSP` so existing ESP-IDF projects that don't depend on the `esp-dsp` component keep building unchanged. A user enables it from CMakeLists.txt (or anywhere before including rlgl.h): target_compile_definitions(${COMPONENT_LIB} PRIVATE SW_USE_ESP_DSP=1) and adds the dependency to `idf_component.yml`: espressif/esp-dsp: "^1.4.0" Measured on ESP32-S3 @ 240 MHz, R5G6B5 240x240, textured 3D model: contributes meaningfully to the overall frame-time improvement (combined with sw_rcp). Made-with: Cursor

Adds an opt-in compile-time flag that replaces the SW_REPEAT wrap chain with a bitmask (`x & (size-1)`) for power-of-two textures. NPOT textures keep using the original `sw_fract` / signed-modulo paths via a runtime `(size & (size-1)) == 0` check, so SW_REPEAT remains correct for them. Affects two samplers: - `sw_texture_sample_nearest`: drops the `floorf` + multiply + cast for POT textures in REPEAT mode (saves a software call on Xtensa). - `sw_texture_sample_linear`: replaces the `(x % w + w) % w` two-step modulo (a software divide on Xtensa) with a single bitwise AND for POT textures in REPEAT mode. Two's-complement int wrap covers negative coordinates correctly. Off by default: for POT textures sampled with negative UVs, bitmask wrap can differ from `sw_fract` wrap by one texel at the boundary. That is imperceptible at typical resolutions but technically a behavior change, so existing users get bit-for-bit identical output. Opt in if you control your asset UVs and want the speedup: #define SW_TEXTURE_REPEAT_POT_FAST This addresses the long-standing TODO comment "If the textures are POT, avoid the division for SW_REPEAT" in `sw_texture_sample_linear`. Made-with: Cursor

Bigfoot71 · 2026-04-30T22:46:03Z

+#ifdef SW_TEXTURE_REPEAT_POT_FAST
+    if ((tex->sWrap == SW_REPEAT) && ((tex->width & tex->wMinus1) == 0))


I think it's fine that when SW_TEXTURE_REPEAT_POT_FAST is defined, we restrict SW_REPEAT to POT textures only, this would remove all extra branches here.

This would require updating swTexParameteri so SW_REPEAT is applied only if the texture is POT, otherwise leave it unchanged and set RLSW.errCode = SW_INVALID_OPERATION.

We could also store a bool isPOT; in sw_texture_t determined during sw_texture_alloc.

We should also explicitly set default parameters in sw_texture_alloc, like SW_CLAMP for wrapping.

I usually try to minimize to store information that can be calculated when needed but I understand that isPOT would be useful for every texture access. I forget that we are operating one level below OpenGL...

Bigfoot71 · 2026-04-30T23:09:48Z

(Larger frame-time numbers in the original write-up include a separate
async-double-buffer change in the platform layer that's not part of this PR.)

Even though ESP32 is not "natively" supported by raylib, I think it would be interesting to consider how to implement the necessary components for this in rlsw.

jensroth-git · 2026-05-01T08:36:47Z

(Larger frame-time numbers in the original write-up include a separate
async-double-buffer change in the platform layer that's not part of this PR.)

Even though ESP32 is not "natively" supported by raylib, I think it would be interesting to consider how to implement the necessary components for this in rlsw.

This is how we solved it in the esp32 repo.

Double-buffer the color framebuffer and run the display transfer on a dedicated FreeRTOS task pinned to core 0, overlapping it with rendering on core 1.

rlsw.h changes:

sw_default_framebuffer_t now holds sw_texture_t color[2] + int colorIndex
sw_default_framebuffer_alloc/free manage both buffers
New public function swSwapColorBuffer() toggles the render target and returns a pointer to the just-rendered buffer

Port layer changes:

raylib_port_flush_async(src, w, h) — stores buffer pointer, notifies
flush task, returns immediately
raylib_port_wait_flush() — blocks until the previous transfer completes
Flush task: persistent FreeRTOS task on core 0 (4 KB stack), sleeps via
ulTaskNotifyTake, runs vflip + byte-swap + DMA, signals completion

SwapScreenBuffer flow:

wait_flush()          // ensure previous transfer finished
rendered = swSwapColorBuffer()  // swap render target, get finished buffer
flush_async(rendered) // kick off transfer on core 0

Memory cost: ~115 KB PSRAM (second color buffer) + 4 KB task stack.

Since render time (~42 ms) > transfer time (~15 ms), the wait never blocks in
steady state. Effective swap cost drops to ~30 µs.

raysan5 · 2026-05-01T17:38:26Z

+#ifdef SW_TEXTURE_REPEAT_POT_FAST
+    if ((tex->sWrap == SW_REPEAT) && ((tex->width & tex->wMinus1) == 0))


I usually try to minimize to store information that can be calculated when needed but I understand that isPOT would be useful for every texture access. I forget that we are operating one level below OpenGL...

raysan5 · 2026-05-01T17:41:48Z

Even though ESP32 is not "natively" supported by raylib, I think it would be interesting to consider how to implement the necessary components for this in rlsw.

Agree. I think ESP32 is a great testbed for further performance improvements that can be extrapolated to other microcontrollers, I'd love to see raylib running on RPI Pico 2 in the future! 😄

raysan5 · 2026-05-06T10:39:39Z

@jensroth-git @Bigfoot71 thanks for the improvement and the review! I'm merging the changes! Feel free to send other PRs for further revisions and improvements! 😄

jensroth-git added 3 commits April 30, 2026 16:32

raysan5 changed the title ~~Rlsw esp32 optimizations~~ [rlsw] ESP32 optimizations Apr 30, 2026

Bigfoot71 reviewed Apr 30, 2026

View reviewed changes

raysan5 reviewed May 1, 2026

View reviewed changes

raysan5 merged commit 7207c03 into raysan5:master May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rlsw] ESP32 optimizations#5827

[rlsw] ESP32 optimizations#5827
raysan5 merged 3 commits intoraysan5:masterfrom
jensroth-git:rlsw-esp32-optimizations

jensroth-git commented Apr 30, 2026

Uh oh!

Bigfoot71 Apr 30, 2026

Uh oh!

raysan5 May 1, 2026

Uh oh!

Uh oh!

Bigfoot71 commented Apr 30, 2026

Uh oh!

jensroth-git commented May 1, 2026

Uh oh!

Uh oh!

raysan5 May 1, 2026

Uh oh!

Uh oh!

raysan5 commented May 1, 2026

Uh oh!

raysan5 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		#ifdef SW_TEXTURE_REPEAT_POT_FAST
		if ((tex->sWrap == SW_REPEAT) && ((tex->width & tex->wMinus1) == 0))

Uh oh!

Conversation

jensroth-git commented Apr 30, 2026

[rlsw] ESP32 / Xtensa hot-path optimizations (sw_rcp + ESP-DSP + opt-in POT wrap)

Summary

1. sw_rcp — Xtensa recip0.s fast reciprocal (commit 1)

2. ESP-DSP matrix kernels (commit 2)

3. SW_TEXTURE_REPEAT_POT_FAST — opt-in POT bitmask wrap (commit 3)

Numbers

Risk / compatibility

Testing

Credits

Uh oh!

Bigfoot71 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

raysan5 May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Bigfoot71 commented Apr 30, 2026

Uh oh!

jensroth-git commented May 1, 2026

Uh oh!

Uh oh!

raysan5 May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

raysan5 commented May 1, 2026

Uh oh!

raysan5 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. `sw_rcp` — Xtensa `recip0.s` fast reciprocal (commit 1)

3. `SW_TEXTURE_REPEAT_POT_FAST` — opt-in POT bitmask wrap (commit 3)