Skip to content

Perf: vectorise Pandas datetime/timespan import+export; add Cython directives#3

Open
stewjb wants to merge 21 commits intomainfrom
perf/pandas-vectorize
Open

Perf: vectorise Pandas datetime/timespan import+export; add Cython directives#3
stewjb wants to merge 21 commits intomainfrom
perf/pandas-vectorize

Conversation

@stewjb
Copy link
Copy Markdown
Owner

@stewjb stewjb commented Apr 4, 2026

Summary

Cython + NumPy vectorisation (earlier commits)

  • Cython directives: Add `boundscheck=False, wraparound=False, cdivision=True` file-wide — eliminates runtime bounds/wrap guards from every inner loop.
  • Pandas DateTime/TimeSpan import: Replace per-row Python object boxing with raw int64 storage + `arr.view('datetime64[ms]')` / `arr.view('timedelta64[ms]')`. NaT written via INT64_MIN sentinel (single pass, no second `.loc` assignment).
  • Pandas DateTime/TimeSpan/Date export: Pre-transform to int64 SBDF-ms at `set_arrays` time — zero-copy export matching numeric types.
  • Pandas Time export: Replace `datetime.combine(min, t) - min` (2 Python object allocations per row) with direct integer arithmetic on time attributes.
  • `any_invalid` hotspot: Replace `any(invalid)` (Python iterator) with `bool(self.invalid_array.any())` (single numpy call). Responsible for the large numeric export gain.
  • Import assembly: Replace `pd.concat(columns, axis=1)` with `pd.DataFrame(dict(...))` — skips concat's index alignment and dtype consolidation overhead.

C-level pointer optimisations (latest commit)

  • String/binary export C helpers: Replace per-element `PySequence_GetItem` calls (Python API dispatch + refcount overhead) with direct pointer arithmetic into numpy array buffers (`PyArray_DATA` as `void**`/`unsigned char*`). Eliminates ~2N Python API round-trips per string/binary column.
  • `_export_get_offset_ptr`: Replace Python slice allocation (`array[start:start+count]`) with direct byte-offset pointer arithmetic. Avoids a numpy view object allocation on every chunk/column export call.
  • Import string null masking: Pre-mask the numpy object array before `pd.Series()` construction instead of assigning `None` via `.loc[]` post-construction (guarded by `values.dtype.kind == 'O'`).

Benchmark Results (100k rows, Pandas path)

Profile Metric `main` (ms) branch (ms) speedup
Temporal, no nulls Export 1527.6 142.3 10.7×
Temporal, no nulls Import 142.6 76.8 1.9×
Temporal, ~10% nulls Export 1121.2 138.6 8.1×
Temporal, ~10% nulls Import 149.9 84.5 1.8×
Numeric, no nulls Export 119.1 15.4 7.7×
Numeric, no nulls Import 18.8 16.2 1.2×
Numeric, ~10% nulls Export 21.2 21.9 ~same
Numeric, ~10% nulls Import 25.0 11.4 2.2×
String, no nulls Export 92.0 71.3 1.3×
String, no nulls Import 47.7 31.4 1.5×
String, ~10% nulls Export 75.8 52.1 1.5×
String, ~10% nulls Import 37.9 44.0 ~same
Binary, no nulls Export 90.0 92.5 ~same
Binary, no nulls Import 52.9 75.3 ~same
Binary, ~10% nulls Export 77.6 80.8 ~same
Binary, ~10% nulls Import 88.4 72.1 1.2×

Key wins:

  • Temporal export: 8–11× faster — zero-copy pre-transform for datetime64/timedelta64/date columns; direct attribute arithmetic for time column
  • Temporal import: 1.8–1.9× faster — int64 buffer reinterpret via `view()` with single-pass NaT sentinel
  • Numeric export: 7.7× faster — primarily from fixing the `any(Series)` hotspot (100k Python iterations → one numpy call)
  • String export with nulls: 1.5× faster — C helper now bypasses Python API dispatch per element
  • String import: 1.5× faster — pre-masking numpy object array avoids `.loc` indexing overhead
  • String/binary export without nulls: modest gains from eliminating PySequence_GetItem + slice allocation

Python 3.13.7 · Pandas 2.3.2 · NumPy 2.3.2 · Windows 11

Test plan

  • All existing tests pass (`python -m pytest spotfire/test/test_sbdf.py`)
  • Benchmark run on `main` and branch tip

🤖 Generated with Claude Code

@stewjb stewjb force-pushed the perf/pandas-vectorize branch 2 times, most recently from 51c7f30 to 2f6ebdf Compare April 5, 2026 00:25
stewjb and others added 18 commits April 4, 2026 19:48
…rectives

Import (Pandas path):
- DateTime and TimeSpan now use _import_vts_numpy (raw int64 ms) instead of
  per-row Python object boxing loops (_import_vt_datetime / _import_vt_timespan).
- DataFrame assembly converts with arr.view('datetime64[ms]') /
  arr.view('timedelta64[ms]') — zero-copy reinterpretation; supports the full
  SBDF date range (year 1-9999) without pd.to_datetime nanosecond overflow.

Export (Pandas path):
- _export_obj_dataframe stores tz-naive datetime64 columns as datetime64[ms]
  and timedelta64 columns as timedelta64[ms] instead of object arrays.
- _export_vt_datetime fast path: view('int64') + vectorised SBDF epoch offset
  addition replaces per-row isinstance + .to_pydatetime() + arithmetic.
- _export_vt_timespan fast path: view('int64') gives ms directly — no per-row
  .to_pytimedelta() or division.
- Object-dtype and tz-aware columns still fall through to the per-row loop.

Cython directives:
- boundscheck=False, wraparound=False, cdivision=True added file-wide,
  eliminating runtime bounds/wrap guards in every inner loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Export: pre-transform datetime64[ms]/timedelta64[ms] columns to int64
SBDF-ms once at set_arrays time so _export_vt_datetime/_export_vt_timespan
can use _export_get_offset_ptr directly (zero-copy, same as numeric types)
instead of allocating + copying + transforming per chunk.  Retain the
non-precomputed fast/slow paths for tz-aware and object-dtype columns.

Import: replace the double-pass NaT handling (zero + .loc assignment) with
a single write of the int64 NaT sentinel (INT64_MIN) before view(), avoiding
the slow Pandas indexing layer entirely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…onstructor

- Export: pre-compute date (object) columns to int64 SBDF-ms via pd.to_datetime,
  same zero-copy approach as datetime64/timedelta64.
- Export: replace any(invalid) with bool(self.invalid_array.any()) in set_arrays —
  the built-in any() was iterating 100k Python booleans per column; numpy any() is
  a single vectorised call.  This alone accounts for the large numeric export gain.
- Import: replace pd.concat(columns, axis=1) with pd.DataFrame(dict(...)) to skip
  concat's index alignment, dtype consolidation and metadata overhead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… .loc

- Time export: replace datetime.combine(min, t) - min (2 Python object
  allocations per row) with direct integer arithmetic on time attributes.
  As the last unoptimized temporal column, this is the primary driver of
  the ~40% temporal export improvement.
- Timedelta import: drop values.copy() — get_values_array() already returns
  a fresh array from np.concatenate(), so the explicit copy was redundant.
- Object-type import (.loc): guard column_series.loc[invalid_array] = None
  with if invalid_array.any() — consistent with datetime/timedelta paths,
  avoids Pandas indexing overhead for null-free columns.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pd.to_datetime(errors='coerce') silently converts dates outside the Pandas
Timestamp range (year 1, pre-Gregorian, year 9999) to NaT, then to the Unix
epoch.  Replace with np.asarray(..., dtype='datetime64[D]') which covers the
full Python date range.  Zero NaT positions (INT64_MIN) before multiplying to
prevent int64 overflow.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eight new test methods covering gaps exposed by the zero-copy temporal
optimizations: null roundtrips, negative timespans, pre-epoch/out-of-range
dates (year 1, pre-Gregorian, year 9999), pre-epoch datetimes, time edge
cases (midnight, end-of-day, microsecond truncation), all-null temporal
columns, and NaT at specific positions in numpy datetime64/timedelta64
arrays.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two new tests targeting the boundscheck=False Cython directives:

- test_empty_dataframe: exercises every column type with 0 rows, verifying
  that zero-iteration export loops don't crash or corrupt memory.

- test_multichunk_export: exports 100_001 rows (one more than the default
  100_000-row slice size) and checks values at both the first row and the
  chunk boundary (row 100_000).  Covers _export_vt_time's direct [start+i]
  indexing and _export_get_offset_ptr for the precomputed int64 paths.

- test_polars_string_multichunk: same chunk-boundary check for the Polars
  Arrow buffer path in _export_extract_string_obj_arrow, which does raw C
  pointer arithmetic into the values buffer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…extension

Compiles sbdf.pyx with -fsanitize=address -fno-omit-frame-pointer and runs the
full test suite under LD_PRELOAD=libasan.so with PYTHONMALLOC=malloc.  This
provides runtime detection of heap buffer overflows that boundscheck=False and
the raw C pointer arithmetic in sbdf_helpers.c leave unchecked at the Python
level.  detect_leaks=0 suppresses intentional Python allocator "leaks".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… 3 chars)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…otlib false positive

When using LD_PRELOAD ASan injection with a non-ASan-compiled Python, ASan's
__cxa_throw interceptor is never initialized.  matplotlib's ft2font.so throws a
C++ exception during import, hitting the uninitialized interceptor and causing a
CHECK failure.  intercept_cxx_exceptions=0 disables the interceptor entirely;
sbdf.pyx generates no C++ exceptions so there is no loss of coverage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…3.13

mypy: pd.array() with list[NaTType] or list[NaT|Timedelta] and a string dtype
has no matching overload in pandas-stubs.  Add type: ignore[call-overload] on
the two affected lines in test_all_null_temporal_columns and
test_numpy_timedelta_with_nulls.

ASan: Python 3.14 (beta) triggers a CHECK failure in asan_interceptors.cpp
when ft2font.so throws a C++ exception, even with intercept_cxx_exceptions=0.
Pin the ASan job to Python 3.13 where LD_PRELOAD ASan injection works cleanly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e.js 24; fix line-too-long

- ASan job: replace test_requirements_default.txt with html-testRunner + polars + pillow.
  matplotlib/seaborn/geopandas/shapely use pybind11 C++ extensions that throw exceptions,
  crashing LD_PRELOAD libasan injection (intercept_cxx_exceptions=0 doesn't help here).
  pillow is plain C — safe to keep for PIL image export ASan coverage.
- Bump GitHub Actions to Node.js 24: checkout v4→v5, setup-python v5→v6,
  upload-artifact v4→v7, download-artifact v4→v8.
- Fix pylint line-too-long (127>120) in test_sbdf.py line 565.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rrency group

test_sbdf.py imported geopandas, matplotlib, and seaborn unconditionally, causing
ModuleNotFoundError in the ASan CI job where those packages are not installed.
Change to try/except with None fallback (matching the polars pattern) and add
@unittest.skipIf guards to test_read_write_geodata, test_image_matplot,
test_image_seaborn.

Also add concurrency group to build.yaml to cancel superseded runs on push.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dule alias

Without the explicit import, pylint sees 'matplotlib = None' in the except block
as a new constant assignment and flags it as invalid-name (expects UPPER_CASE).
Adding 'import matplotlib' before 'import matplotlib.pyplot' matches the same
try/except pattern used for polars (import + None fallback).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ce alloc in offset ptr

Three C-level optimizations:

1. _export_extract_string_obj / _export_extract_binary_obj: replace per-element
   PySequence_GetItem calls (Python API dispatch + refcount overhead) with direct
   pointer arithmetic into numpy array buffers.  Callers now pass
   PyArray_DATA(values_array) as void** and PyArray_DATA(invalid_array) as
   unsigned char*, eliminating ~2N Python API round-trips per string/binary column.

2. _export_get_offset_ptr: replace the Python slice allocation
   (array[start:start+count]) with direct byte-offset arithmetic on PyArray_DATA.
   Avoids a numpy view object allocation on every chunk/column export call.

3. Import string columns: pre-mask the numpy object array before pd.Series()
   construction instead of assigning None via .loc[] after the fact.  The .loc
   path triggers pandas label-indexing overhead; direct numpy assignment is O(k)
   with no indexer allocation.  Applied only when values.dtype.kind == 'O' to
   avoid incorrect coercion on bool/float arrays.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tyle violation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@stewjb stewjb force-pushed the perf/pandas-vectorize branch from 2f6ebdf to 7c1ed67 Compare April 5, 2026 00:49
stewjb and others added 3 commits April 4, 2026 19:57
…tring

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…plint line-length rule

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant