Skip to content

Comments

Optimize asyncio shared router for reduced NIF overhead and lock contention#6

Closed
benoitc wants to merge 14 commits intomainfrom
asyncio-router-optimization
Closed

Optimize asyncio shared router for reduced NIF overhead and lock contention#6
benoitc wants to merge 14 commits intomainfrom
asyncio-router-optimization

Conversation

@benoitc
Copy link
Owner

@benoitc benoitc commented Feb 23, 2026

Summary

  • Increase PENDING_HASH_SIZE from 128 to 512 for higher capacity before rejection
  • Add off_heap mailbox to router for reduced GC pressure under high message load
  • Add combined handle_fd_event_and_reselect/2 NIF that reduces NIF call overhead
  • Only signal pthread_cond on 0->1 queue transition to reduce contention
  • Implement snapshot-under-lock in py_run_once for reduced lock contention

Also adds test/py_event_loop_bench.erl for measuring event throughput.

🤖 Generated with Claude Code

…ention

- Increase PENDING_HASH_SIZE from 128 to 512 for higher capacity
- Add off_heap mailbox to router for reduced GC pressure
- Add combined handle_fd_event_and_reselect/2 NIF (reduces NIF calls)
- Only signal pthread_cond on 0->1 queue transition
- Implement snapshot-under-lock in py_run_once for reduced contention

Also adds test/py_event_loop_bench.erl for measuring event throughput.
New architecture uses Erlang mailbox as event queue instead of pthread_cond:
- py_event_loop_proc.erl: Event process receives FD/timer events directly
- py_event_loop_v2.erl: Drop-in replacement for py_event_router
- Timers fire directly to event process (no dispatch_timer NIF hop)
- FD events from enif_select go directly to event process

New NIFs:
- event_loop_set_event_proc/2: Set event process for a loop
- poll_via_proc/2: Poll via event process message passing

Backward compatible: legacy py_event_router still works.
@benoitc
Copy link
Owner Author

benoitc commented Feb 23, 2026

Performance Improvements

Commit 1: Shared Router Optimizations

  • Increased PENDING_HASH_SIZE from 128 to 512
  • Added off_heap mailbox to router
  • Combined handle_fd_event_and_reselect/2 NIF (reduces NIF calls)
  • Wake pthread_cond only on 0→1 queue transition
  • Snapshot-under-lock in py_run_once for reduced contention

Commit 2: Event Process Architecture (New)

Introduces py_event_loop_proc - uses Erlang mailbox as event queue:

Metric V1 (Router) V2 (Event Process) Improvement
Timer throughput 49,104/sec 1,327,669/sec 27x faster

Why it's faster:

  • Timers fire directly to event process (no dispatch_timer NIF hop)
  • FD events from enif_select go directly to event process
  • No pthread_cond signaling for Erlang-side event collection

Usage:

{ok, LoopRef, EventProc} = py_event_loop_v2:new(),
Events = py_event_loop_v2:poll(EventProc, TimeoutMs),

Backward compatible - legacy py_event_router still works.

@benoitc
Copy link
Owner Author

benoitc commented Feb 23, 2026

Extended Events (Commit 3)

Event process now handles all event types through unified mailbox:

Event Type Use Case
call_result, call_error Sync Python call completions
async_result, async_error Async Python call completions
subprocess_exit, subprocess_stdout, subprocess_stderr Subprocess I/O
socket_data, socket_closed, socket_error Network events
Native {tcp, ...}, {udp, ...} Direct gen_tcp/gen_udp handling

Benchmark

=== Extended Events Benchmark ===
Events per type: 50000

  call_result          5,626,829 events/sec
  async_result         5,703,205 events/sec
  subprocess_stdout    6,165,228 events/sec
  socket_data          5,469,861 events/sec

  Average: 5,741,281 events/sec

All 113 tests pass.

@benoitc
Copy link
Owner Author

benoitc commented Feb 23, 2026

Python asyncio Integration (Commit 4)

The ErlangEventLoop now handles all extended event types for unified event processing:

Event Types Added

Constant Value Use Case
EVENT_TYPE_CALL_RESULT 10 Sync call succeeded
EVENT_TYPE_CALL_ERROR 11 Sync call failed
EVENT_TYPE_ASYNC_RESULT 12 Async call succeeded
EVENT_TYPE_ASYNC_ERROR 13 Async call failed
EVENT_TYPE_SUBPROCESS_EXIT 20 Process exited
EVENT_TYPE_SUBPROCESS_STDOUT 21 Stdout data
EVENT_TYPE_SUBPROCESS_STDERR 22 Stderr data
EVENT_TYPE_SOCKET_DATA 30 Socket received data
EVENT_TYPE_SOCKET_CLOSED 31 Socket closed
EVENT_TYPE_SOCKET_ERROR 32 Socket error

Registration API

loop = ErlangEventLoop()

# For py.call_async results
callback_id = loop._next_callback_id()
loop._register_async_future(callback_id, my_future)

# For subprocess
loop._register_subprocess(callback_id, protocol, transport)

# For socket/TCP/UDP
loop._register_socket(callback_id, protocol, transport)

All 24 event loop tests pass.

@benoitc benoitc force-pushed the asyncio-router-optimization branch from 1e358ad to 2e14f78 Compare February 23, 2026 14:25
Phase 1 of unified event-driven architecture.

- Add py_callback_id module with atomic counter
- Initialize counter in erlang_python_sup
- Uses persistent_term + atomics for lock-free, thread-safe ID generation
- IDs are monotonically increasing positive integers starting from 1

This provides unique callback IDs for correlating async operations
with their results in subsequent phases.
Add call_handlers map to state for tracking pending call results.
New message handlers:
- {register_call, CallbackId, Caller, Ref} - Register call handler
- {unregister_call, CallbackId} - Unregister before result arrives
- {call_result, CallbackId, Result} - Dispatch result to caller
- {call_error, CallbackId, Error} - Dispatch error to caller

Results are delivered as {py_result, Ref, Result} or {py_error, Ref, Error}
to the registered caller. Handlers work in both normal loop and wait_loop.
Safe to unregister before result arrives.

Phase 2 of unified event-driven architecture.
Submit Python calls to a background worker thread that delivers
results via enif_send to py_event_loop_proc. Worker thread is
lazily started after Python initialization.

New files: c_src/py_submit.{c,h}, test/py_submit_test.erl
…ture

- Delete py_async_worker.erl, py_async_worker_sup.erl, py_async_pool.erl
- Remove async worker supervision from erlang_python_sup.erl
- Update py:async_gather to use py_async_driver (submit all, await all)
- Update py:async_stream to use async_stream_helper Python module
- Remove legacy async NIF exports from py_nif.erl
- Remove legacy async NIF table entries from py_nif.c
- Add priv/async_stream_helper.py for async generator collection

All async operations now go through py_async_driver which uses
the unified ErlangEventLoop via py_event_loop_proc.
- Add test/py_unified_bench.erl with benchmarks for:
  - Synchronous py:call throughput and latency
  - Async py:async_call with latency percentiles (p50, p90, p99, p999)
  - Concurrent request handling at various concurrency levels
  - Async gather batch performance

- Add docs/architecture.md documenting:
  - Component architecture diagram
  - Event-driven async flow
  - NIF architecture and GIL management
  - ASGI integration
  - Callback mechanism
  - Performance characteristics

- Update README.md with link to architecture docs
- Update docs/scalability.md to remove deprecated num_async_workers config

Run benchmarks: rebar3 as test shell, then py_unified_bench:run_all()
@benoitc
Copy link
Owner Author

benoitc commented Feb 23, 2026

Benchmark Results

Ran the new unified architecture benchmarks (py_unified_bench:run_all()):

--- Synchronous py:call ---
  71,438 ops/sec, 13 μs avg

--- Async py:async_call ---
  15,968 ops/sec
  Latency: p50=60μs, p90=69μs, p99=98μs, p999=133μs

--- Concurrent Requests ---
  10 workers:  182,049 ops/sec (p50=51μs)
  50 workers:  102,732 ops/sec (p50=145μs)
  100 workers:  38,032 ops/sec (p50=172μs)

--- Async Gather ---
  Batch of 10: 18,440 ops/sec, 542 μs/batch
  Batch of 50: 18,088 ops/sec, 2,764 μs/batch

The unified event loop handles async operations well. Sync calls hit ~71K/sec which is solid given GIL constraints. Async has more overhead (~16K/sec) but provides non-blocking behavior and good tail latencies.

Concurrent throughput scales nicely up to ~50 workers before contention kicks in.

- Update waiter field type spec to match actual 4-tuple storage
- Fix pattern match in handle_msg for DOWN message
- Update test_error_handling to accept flexible error formats
- Fix wait_loop escaping when cancel_timer arrives: inline timer
  cancellation instead of calling handle_cancel_timer which tail-calls
  loop/1 and exits wait mode, causing poll to hang indefinitely

- Fix async_gather mailbox leak: drain remaining py_result/py_error
  messages when an early error occurs to prevent leftover messages
  in caller's mailbox
- py_asgi:run_async/5: use Opts parameter for custom runner
- py_event_loop.c: fix OOM cleanup to return ALL events to freelist
- py_async_driver: cache event_proc pid in persistent_term for fast lookup
- py_event_loop_proc: simplify handle_msg DOWN, add dialyzer nowarn
@benoitc benoitc closed this Feb 23, 2026
@benoitc benoitc deleted the asyncio-router-optimization branch February 23, 2026 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant