Skip to content

[rlsw] ESP32 optimizations#5827

Merged
raysan5 merged 3 commits intoraysan5:masterfrom
jensroth-git:rlsw-esp32-optimizations
May 6, 2026
Merged

[rlsw] ESP32 optimizations#5827
raysan5 merged 3 commits intoraysan5:masterfrom
jensroth-git:rlsw-esp32-optimizations

Conversation

@jensroth-git
Copy link
Copy Markdown
Contributor

[rlsw] ESP32 / Xtensa hot-path optimizations (sw_rcp + ESP-DSP + opt-in POT wrap)

Summary

Three small, isolated, opt-out-safe optimizations to rlsw.h that together
roughly halve frame time on ESP32-S3 for textured 3D rendering, while
staying bit-for-bit identical on existing desktop / RISC-V / non-ESP32 builds.

These were extracted from a larger ESP32 port experiment after a code review
with @Bigfoot71, who suggested upstreaming the parts related to sw_rcp and
dspm_mult_4x4x?_f32 while leaving the rest (clear, conversions, sampler
refactor, async double-buffer) for a separate effort.

The PR is split into three logically-isolated commits so each can be reviewed
or reverted independently.


1. sw_rcp — Xtensa recip0.s fast reciprocal (commit 1)

Float division on Xtensa LX6 / LX7 (ESP32, ESP32-S3) compiles to a software
__divsf3 call. The rasterizer performs multiple 1.0f/x operations per
triangle setup, per 16-pixel affine block, and per vertex transform.

A new sw_rcp(x) helper emits the hardware recip0.s seed plus two
Newton-Raphson refinement steps — 1-ULP accurate in ~7 instructions, all in
FPU registers. On every other target it expands to plain 1.0f/x, so
generated code is byte-identical to before for non-Xtensa builds.

Applied only to documented hot-path reciprocals:

  • perspective divide (1/w) in clip-and-project (PCT and PC paths)
  • line/point clip-and-project NDC conversion
  • triangle span: dxRcp, blockLenRcp, wRcpA, wRcpB
  • triangle scanline: h02Rcp, h01Rcp, h12Rcp
  • axis-aligned quad: wRcp, hRcp
  • line rasterizer: stepRcp

Other 1.0f/x uses (sw_matrix_translate length, texture init tx/ty)
are not on the per-pixel hot path and are left untouched.

2. ESP-DSP matrix kernels (commit 2)

ESP-DSP is ESP-IDF's official optimized math library and ships hand-vectorized
kernels for the matrix sizes rlsw uses. Two integration points:

  • sw_matrix_mul_rstdspm_mult_4x4x4_f32 — used for MVP build, lookat,
    push/multiply, etc. The flat-buffer call still produces the correct
    column-major product (transpose-of-transposes equivalence; see comment).

  • sw_immediate_push_vertexdspm_mult_4x4x1_f32 — the per-vertex clip
    transform. ESP-DSP wants a row-major matrix here, so a matMVP_rm[16]
    row-major copy is maintained alongside matMVP and refreshed once per
    isDirtyMVP rebuild in sw_immediate_begin.

Detection is opt-in via SW_USE_ESP_DSP so existing ESP-IDF projects
that don't depend on the esp-dsp component keep building unchanged.
A user enables it from CMakeLists.txt (or anywhere before including rlgl.h):

target_compile_definitions(${COMPONENT_LIB} PRIVATE SW_USE_ESP_DSP=1)

and adds the dependency to idf_component.yml:

dependencies:
  espressif/esp-dsp: "^1.4.0"

3. SW_TEXTURE_REPEAT_POT_FAST — opt-in POT bitmask wrap (commit 3)

Addresses the long-standing // NOTE: If the textures are POT, avoid the division for SW_REPEAT TODO in sw_texture_sample_linear.

When defined, textures whose width/height are powers of two use a bitmask
wrap (x & (size-1)) instead of floorf-based fractional wrap (nearest)
or the signed (x % w + w) % w chain (linear). NPOT textures keep using
the original paths via a runtime (size & (size-1)) == 0 check, so
SW_REPEAT remains correct for them.

Off by default: for POT textures sampled with negative UV coordinates,
bitmask wrap (two's complement) can differ from sw_fract wrap by one
texel at the boundary. Imperceptible at typical resolutions but technically
a behavior change, so existing users get bit-for-bit identical output.


Numbers

Measured on ESP32-S3 @ 240 MHz, 240×240 R5G6B5 framebuffer, textured 3D model
with depth testing, all three optimizations enabled:

Phase Before After (this PR — render only)
Frame model time ~44 ms ~39 ms

(Larger frame-time numbers in the original write-up include a separate
async-double-buffer change in the platform layer that's not part of this PR.)

Risk / compatibility

  • Non-ESP32 builds: no behavior change. All new code paths sit behind
    #if defined(__XTENSA__), #ifdef SW_HAS_ESP_DSP, or
    #ifdef SW_TEXTURE_REPEAT_POT_FAST.
  • ESP32 build without flags: only sw_rcp activates (auto-on via
    __XTENSA__). Should be a strict perf improvement; recip0.s + 2 N-R
    steps gives 1-ULP accuracy, equivalent to 1.0f/x.
  • ESP32 build with SW_USE_ESP_DSP: requires espressif/esp-dsp in
    the project's component manifest. Without it, the include of dspm_mult.h
    fails fast at compile time.
  • SW_TEXTURE_REPEAT_POT_FAST: NPOT textures are unaffected (runtime
    branch). POT + negative UVs differ by one texel from sw_fract wrap.

Testing

  • Static check (Windows x86_64, clang): -fsyntax-only clean for the full
    rlsw.h (and via rlgl.h with GRAPHICS_API_OPENGL_SOFTWARE), with and
    without SW_TEXTURE_REPEAT_POT_FAST.
  • End-to-end on x86_64 (Windows, mingw-w64 / gcc 14.2): built raylib with
    PLATFORM=Memory + OPENGL_VERSION=Software (which routes through rlsw),
    ran the shapes_basic_shapes example, captured the rlsw framebuffer via
    TakeScreenshot. The sw_rcp scalar fallback produces a correctly
    rasterized scene (shapes/text/lines all rendered as expected). Verified
    the build is clean both with and without -DSW_TEXTURE_REPEAT_POT_FAST.
    (Note: the resulting PNG is upside-down due to an unrelated upstream
    bug in rlReadScreenPixels -- it unconditionally vertical-flips assuming
    real-glReadPixels bottom-left origin, but swReadPixels already returns
    top-down. Visible content is correct; can be fixed in a separate PR.)
  • ESP-DSP path (SW_USE_ESP_DSP): cannot be exercised on x86 since the
    kernels don't exist outside ESP-IDF. Sits behind a compile-time guard so
    builds without the flag are byte-identical to the scalar version. Validated
    end-to-end on real hardware (next bullet).
  • ESP32-S3: end-to-end tested in a private fork of raylib-on-IDF with
    the full optimization stack (__XTENSA__ sw_rcp + SW_USE_ESP_DSP +
    SW_TEXTURE_REPEAT_POT_FAST). Rendering is correct and the perf delta
    matches the table above.

Credits

  • Original rlsw design and review: @Bigfoot71
  • ESP32 perf investigation and changes in this PR: @jensroth-git
  • Kept narrow per @Bigfoot71's review feedback ("you could open a PR for the
    parts related to sw_rcp and dspm_mult_4x4x4/1_f32").

Created in Cursor

Adds a `sw_rcp(x)` inline reciprocal that on Xtensa (ESP32 / ESP32-S3
LX6/LX7) emits a `recip0.s` seed plus two Newton-Raphson refinement
steps -- 1-ULP accurate in ~7 instructions, all in FPU registers.
On every other target it expands to plain `1.0f/x`, so generated code
is byte-identical to before for non-Xtensa builds.

Replaces the hot-path `1.0f/x` calls that were previously compiling to
the `__divsf3` software helper on Xtensa:

  - perspective divide (1/w) in triangle clip-and-project (PCT and PC paths)
  - line and point clip-and-project NDC conversion
  - triangle span setup: dxRcp, blockLenRcp, wRcpA, wRcpB
  - triangle scanline setup: h02Rcp, h01Rcp, h12Rcp
  - axis-aligned quad: wRcp, hRcp
  - line rasterizer: stepRcp

Other `1.0f/x` uses (matrix translate/normalize, texture init `tx`/`ty`,
sw_matrix_rotate inverse-length) are not on the per-pixel hot path and
are left untouched.

Measured on ESP32-S3 @ 240 MHz, R5G6B5 240x240, textured 3D model:
contributes to a ~10-15% rasterization speedup.

Made-with: Cursor
Adds an opt-in ESP-DSP code path for ESP32 / ESP32-S3 builds. ESP-DSP is
ESP-IDF's official optimized math library and ships hand-vectorized
kernels that beat the scalar implementations on Xtensa.

Two integration points:

  1. `sw_matrix_mul_rst` -> `dspm_mult_4x4x4_f32` for any 4x4*4x4 multiply
     (used for MVP build, gluLookAt, push/multiply, etc.). rlsw stores
     matrices column-major and ESP-DSP reads row-major; the comment on the
     call site explains why the flat-buffer call still produces the
     correct column-major product (transpose-of-transposes equivalence).

  2. `sw_immediate_push_vertex` -> `dspm_mult_4x4x1_f32` for the per-vertex
     clip-space transform. Because ESP-DSP expects a row-major matrix in
     this case, a row-major copy `matMVP_rm[16]` is maintained alongside
     `matMVP` and refreshed once per `isDirtyMVP` rebuild in
     `sw_immediate_begin`. Cost is 16 scalar copies per matrix update,
     amortized over thousands of vertices per frame.

Detection is **opt-in** via `SW_USE_ESP_DSP` so existing ESP-IDF projects
that don't depend on the `esp-dsp` component keep building unchanged.
A user enables it from CMakeLists.txt (or anywhere before including
rlgl.h):

    target_compile_definitions(${COMPONENT_LIB} PRIVATE SW_USE_ESP_DSP=1)

and adds the dependency to `idf_component.yml`:

    espressif/esp-dsp: "^1.4.0"

Measured on ESP32-S3 @ 240 MHz, R5G6B5 240x240, textured 3D model:
contributes meaningfully to the overall frame-time improvement
(combined with sw_rcp).

Made-with: Cursor
Adds an opt-in compile-time flag that replaces the SW_REPEAT wrap chain
with a bitmask (`x & (size-1)`) for power-of-two textures. NPOT textures
keep using the original `sw_fract` / signed-modulo paths via a runtime
`(size & (size-1)) == 0` check, so SW_REPEAT remains correct for them.

Affects two samplers:

  - `sw_texture_sample_nearest`: drops the `floorf` + multiply + cast for
    POT textures in REPEAT mode (saves a software call on Xtensa).
  - `sw_texture_sample_linear`: replaces the `(x % w + w) % w` two-step
    modulo (a software divide on Xtensa) with a single bitwise AND for
    POT textures in REPEAT mode. Two's-complement int wrap covers
    negative coordinates correctly.

Off by default: for POT textures sampled with negative UVs, bitmask wrap
can differ from `sw_fract` wrap by one texel at the boundary. That is
imperceptible at typical resolutions but technically a behavior change,
so existing users get bit-for-bit identical output. Opt in if you
control your asset UVs and want the speedup:

    #define SW_TEXTURE_REPEAT_POT_FAST

This addresses the long-standing TODO comment "If the textures are POT,
avoid the division for SW_REPEAT" in `sw_texture_sample_linear`.

Made-with: Cursor
@raysan5 raysan5 changed the title Rlsw esp32 optimizations [rlsw] ESP32 optimizations Apr 30, 2026
Comment thread src/external/rlsw.h
Comment on lines +2474 to +2475
#ifdef SW_TEXTURE_REPEAT_POT_FAST
if ((tex->sWrap == SW_REPEAT) && ((tex->width & tex->wMinus1) == 0))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine that when SW_TEXTURE_REPEAT_POT_FAST is defined, we restrict SW_REPEAT to POT textures only, this would remove all extra branches here.

This would require updating swTexParameteri so SW_REPEAT is applied only if the texture is POT, otherwise leave it unchanged and set RLSW.errCode = SW_INVALID_OPERATION.

We could also store a bool isPOT; in sw_texture_t determined during sw_texture_alloc.

We should also explicitly set default parameters in sw_texture_alloc, like SW_CLAMP for wrapping.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually try to minimize to store information that can be calculated when needed but I understand that isPOT would be useful for every texture access. I forget that we are operating one level below OpenGL...

Comment thread src/external/rlsw.h
@Bigfoot71
Copy link
Copy Markdown
Contributor

(Larger frame-time numbers in the original write-up include a separate
async-double-buffer change in the platform layer that's not part of this PR.)

Even though ESP32 is not "natively" supported by raylib, I think it would be interesting to consider how to implement the necessary components for this in rlsw.

@jensroth-git
Copy link
Copy Markdown
Contributor Author

(Larger frame-time numbers in the original write-up include a separate
async-double-buffer change in the platform layer that's not part of this PR.)

Even though ESP32 is not "natively" supported by raylib, I think it would be interesting to consider how to implement the necessary components for this in rlsw.

This is how we solved it in the esp32 repo.

Double-buffer the color framebuffer and run the display transfer on a dedicated FreeRTOS task pinned to core 0, overlapping it with rendering on core 1.

rlsw.h changes:

sw_default_framebuffer_t now holds sw_texture_t color[2] + int colorIndex
sw_default_framebuffer_alloc/free manage both buffers
New public function swSwapColorBuffer() toggles the render target and returns a pointer to the just-rendered buffer

Port layer changes:

  • raylib_port_flush_async(src, w, h) — stores buffer pointer, notifies
    flush task, returns immediately
  • raylib_port_wait_flush() — blocks until the previous transfer completes
  • Flush task: persistent FreeRTOS task on core 0 (4 KB stack), sleeps via
    ulTaskNotifyTake, runs vflip + byte-swap + DMA, signals completion

SwapScreenBuffer flow:

wait_flush()          // ensure previous transfer finished
rendered = swSwapColorBuffer()  // swap render target, get finished buffer
flush_async(rendered) // kick off transfer on core 0

Memory cost: ~115 KB PSRAM (second color buffer) + 4 KB task stack.

Since render time (~42 ms) > transfer time (~15 ms), the wait never blocks in
steady state. Effective swap cost drops to ~30 µs.

Comment thread src/external/rlsw.h
Comment thread src/external/rlsw.h
Comment on lines +2474 to +2475
#ifdef SW_TEXTURE_REPEAT_POT_FAST
if ((tex->sWrap == SW_REPEAT) && ((tex->width & tex->wMinus1) == 0))
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually try to minimize to store information that can be calculated when needed but I understand that isPOT would be useful for every texture access. I forget that we are operating one level below OpenGL...

Comment thread src/external/rlsw.h
@raysan5
Copy link
Copy Markdown
Owner

raysan5 commented May 1, 2026

Even though ESP32 is not "natively" supported by raylib, I think it would be interesting to consider how to implement the necessary components for this in rlsw.

Agree. I think ESP32 is a great testbed for further performance improvements that can be extrapolated to other microcontrollers, I'd love to see raylib running on RPI Pico 2 in the future! 😄

@raysan5 raysan5 merged commit 7207c03 into raysan5:master May 6, 2026
@raysan5
Copy link
Copy Markdown
Owner

raysan5 commented May 6, 2026

@jensroth-git @Bigfoot71 thanks for the improvement and the review! I'm merging the changes! Feel free to send other PRs for further revisions and improvements! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants