Skip to content

Add native Polars DataFrame support#99

Open
stewjb wants to merge 41 commits intospotfiresoftware:mainfrom
stewjb:main
Open

Add native Polars DataFrame support#99
stewjb wants to merge 41 commits intospotfiresoftware:mainfrom
stewjb:main

Conversation

@stewjb
Copy link
Copy Markdown

@stewjb stewjb commented Mar 24, 2026

Closes #98

Summary

  • Export: export_data() now accepts polars.DataFrame and polars.Series directly, mapping Polars dtypes to SBDF types without any Pandas intermediary. Supported types: Boolean, Int8/16/32, Int64, Float32/64, Utf8/String, Date, Datetime, Duration, Time, Binary, Decimal, Categorical.
  • Import: import_data() gains an output_format parameter (default "pandas" for backwards compatibility). When output_format="polars", a polars.DataFrame is built directly from the raw numpy arrays — no Pandas DataFrame is created at any point.
  • Dependency: Polars is added as an optional dependency (spotfire[polars]), following the same pattern as spotfire[geo] and spotfire[plot].

Performance benefit

The previous workaround required polars_df.to_pandas() before export, which doubles peak memory usage and adds 2–5 seconds of conversion time at 10M rows. The native path eliminates this entirely for export.

Spotfire data function context

When running inside a Spotfire data function, SBDF import and export happen automatically via data_function.py — users never call import_data or export_data directly. This has two implications:

Export (output variables): Full benefit. A user can build a polars.DataFrame in their script and return it as an output variable — export_data() handles it natively with no conversion.

Import (input variables): No benefit from import_data(output_format="polars"). Input data is always loaded by the framework via sbdf.import_data(self._file) (no output_format argument), so input variables always arrive in the script as pd.DataFrame. Users who want Polars for processing would still need to call pl.from_pandas(input_df) themselves. Fixing this properly would require changes to data_function.py and a mechanism for users to declare their preference — out of scope for this PR.

In short: the output_format parameter on import_data is primarily useful outside the Spotfire data function context (e.g. standalone scripts using the spotfire package directly). Inside a data function, only the export side benefits.

Test plan

  • test_write_polars_basic — export DataFrame with common types, re-import as Pandas and verify data
  • test_write_polars_nulls — null values are preserved through the roundtrip
  • test_write_polars_series — Polars Series export works
  • test_import_as_polars — import with output_format="polars" returns a native polars.DataFrame
  • test_polars_roundtrip — full Polars → SBDF → Polars roundtrip
  • All 72 existing tests continue to pass (1 pre-existing skip unrelated to this change)
  • pylint 10.00/10, mypy clean, cython-lint clean

🤖 Generated with Claude Code

stewjb and others added 9 commits March 23, 2026 20:06
…tetime, scatter compat

- Fix Categorical/Enum dtype: was incorrectly trying to recurse into
  dtype.categories (which doesn't exist on the dtype object); now casts
  series to Utf8 and maps to SBDF_STRINGTYPEID directly
- Add Enum dtype support (previously raised SBDFError)
- Warn on UInt64 export: values above Int64 max will overflow silently
- Warn on timezone-aware Datetime export: tz info is not preserved in SBDF
- Warn on Decimal export: marked experimental, precision may be lost
- Fix scatter() compatibility: add AttributeError fallback to set_at_idx()
  for older Polars versions within the supported range
- Add tests for all of the above

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add polars to test_requirements_default.txt so SbdfPolarsTest is
  actually executed in CI (previously skipped due to missing import)
- Add spotfire[polars] row to extras table in README
- Add usage note explaining Spotfire's bundled Python lacks Polars and
  that SPKs bundling Polars will be ~44 MB larger than typical packages

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Raise SBDFError for unknown output_format values (previously fell
  through silently to Pandas)
- Emit SBDFWarning when Categorical/Enum columns are exported as String,
  consistent with existing UInt64 and timezone warnings
- Add test_invalid_output_format: verifies bad output_format raises
- Add test_write_polars_empty: verifies empty DataFrame exports cleanly
- Add test_write_polars_series_nulls: verifies null preservation in Series
- Add test_polars_categorical_warns: verifies Categorical warning fires

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A Polars Series of [None, None, None] has dtype pl.Null (no type can
be inferred). Previously this raised SBDFError with "unknown dtype".
Now it exports as an all-invalid String column, consistent with how
all-None Pandas columns are handled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CI static analysis runs mypy without polars installed; add
type: ignore[import-not-found] so mypy skips the missing stub.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Explain non-obvious choices that would otherwise prompt review questions:
- Why dtype.__class__.__name__ instead of isinstance()
- Why scatter()/set_at_idx() try/except exists and which versions it covers
- Why is_object_numpy_type() cpdef wrapper is needed for a cdef attribute
- Why the output_format polars path short-circuits before pd.concat
- Why the Null dtype path returns a placeholder array

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…olars versions (>= 0.20)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds native Polars support to Spotfire’s SBDF import/export layer to avoid Pandas conversions (improving memory usage and performance for large datasets), and wires it up as an optional extra.

Changes:

  • Add polars as an optional dependency (spotfire[polars]) and enable it in dev/test setups.
  • Extend sbdf.export_data() to accept polars.DataFrame / polars.Series directly, with dtype→SBDF mapping.
  • Extend sbdf.import_data() with output_format to optionally construct a native polars.DataFrame without creating a Pandas DataFrame.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
spotfire/sbdf.pyx Implements Polars import/export paths and dtype mappings; adds output_format to import_data().
spotfire/sbdf.pyi Updates type stub for import_data() to include output_format.
spotfire/test/test_sbdf.py Adds Polars-focused unit tests for export/import/roundtrip + warnings.
pyproject.toml Adds polars extra and includes it in dev extra.
test_requirements_default.txt Installs Polars for test runs.
README.md Documents spotfire[polars] and the new import/export behavior.
.gitignore Ignores .venv, uv.lock, and .claude.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1238 to +1241
context.set_valuetype_id(_export_infer_valuetype_from_polars_dtype(series.dtype, f"column '{col}'"))
invalids = series.is_null().to_numpy()
context.set_arrays(_export_polars_series_to_numpy(context, series), invalids)
column_metadata.append({})
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Polars export path, invalids are derived from series.is_null(), which does not mark floating-point NaN values as invalid. In the existing Pandas path pd.isnull() treats NaN as missing, so exporting a Polars float column containing NaN will write NaNs as real values instead of SBDF invalids (behavior mismatch vs Pandas and likely incorrect for Spotfire missing-values semantics). Consider treating NaN as invalid for Float32/Float64 columns (e.g., combine is_null() with is_nan() when applicable).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added series.is_null() and series.is_nan() for floats to handle like pandas does.

if na_value is not None:
return np.asarray(series.fill_null(na_value).to_numpy(allow_copy=True),
dtype=context.get_numpy_dtype())
else:
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_export_polars_series_to_numpy converts to an object ndarray when na_value is None. For Polars Datetime / Duration series, to_numpy() already produces datetime64 / timedelta64 arrays that the existing SBDF exporters can handle, so forcing dtype=object will box scalars and create an unnecessary copy (hurting the performance goal of this PR). Consider special-casing datetime/timespan to keep the native NumPy dtype (ideally normalized to the SBDF-supported resolution) instead of casting to object.

Suggested change
else:
else:
# For Datetime/Duration, keep native NumPy datetime64/timedelta64 dtypes instead of boxing to object.
if dtype_name in ("Datetime", "Duration"):
return series.to_numpy(allow_copy=True)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime and duration go to numpy early now.

- Move output_format validation to top of import_data() for fail-fast
  behaviour before the file is opened
- Raise SBDFError in _import_polars_dtype fallback instead of silently
  returning Utf8 for unknown SBDF type IDs
- Treat NaN as invalid (missing) for Float32/Float64 columns, matching
  Pandas pd.isnull() behaviour; add test_write_polars_float_nan
- Keep native datetime64/timedelta64 arrays for Datetime/Duration columns
  instead of boxing to object dtype (avoids unnecessary copy)
- Add @overload signatures to sbdf.pyi so callers get pd.DataFrame for
  the default output_format="pandas" and Any for output_format="polars"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@stewjb
Copy link
Copy Markdown
Author

stewjb commented Mar 24, 2026

@vrane-tibco @bbassett-tibco @mpanke-tibco thanks for considering this PR. Let me know if you all have thoughts.

stewjb and others added 2 commits March 24, 2026 19:47
_export_obj_dict_of_lists (line 1313): np.array(n) where n is an integer
creates a 0-dimensional array, not a 1-D array of length n. Every
export_data({"col": [...]}) call would raise IndexError. Fixed to
np.empty(shape, ...).

_export_obj_iterable (lines 1358-1366): np.append inside a for loop
reallocates the entire array on every iteration — O(n²) for a column
of n rows. Replaced with list accumulation and a single np.array()
call at the end.

Add test_export_dict_of_lists and test_export_list to cover both paths
(previously untested, which is why the bug went undetected).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
For Int32/Int64 columns, the previous code constructed a pd.Series and
then assigned nulls via .loc[mask] = None in a second pass, which
triggers Pandas dtype coercion overhead internally.

Replace with pd.arrays.IntegerArray(values, mask) which constructs the
nullable integer array with the validity mask in a single operation,
avoiding the second pass entirely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vrane-tibco
Copy link
Copy Markdown
Collaborator

@stewjb Appreciate the work here — the dtype mapping is thorough and the export mechanics are solid. But I have two concerns, one architectural and one about the performance framing.

On metadata -

This package isn't just a data serializer - it's specifically the bridge that carries Spotfire's interpretation of that data back and forth. The three things Spotfire cares about beyond raw values are spotfire_table_metadata, spotfire_column_metadata, and spotfire_type. All three are attached to Pandas objects and the entire surface API (copy_metadata, get_spotfire_types, set_spotfire_types, set_geocoding_table, all exported from init.py) is built on top of those Pandas extension points.
Polars DataFrames don't support any of this. So right now:

  • On import: In sbdf.pyx _import_build_polars_dataframe (line 875) only receives column_names and importer_contexts. The table_metadata and column_metadata that were correctly parsed from the SBDF file a few lines above simply never make it into the returned DataFrame. Silently dropped

  • On export: In sbdf.pyx _export_obj_polars_dataframe hardcodes {} for both table and column metadata (lines 1248, 1251). The Pandas path reads these from the object; the Polars path has nowhere to read them from.

  • In the public API: copy_metadata(), get_spotfire_types(), and set_spotfire_types() all raise TypeError on non-Pandas objects (lines 32–35, 68–69, 86–87). So a user who gets a pl.DataFrame back from import_data(output_format="polars") and immediately tries to use any of those three functions hits a confusing error.

On performance -

The export side is genuinely better than the to_pandas() workaround.

The import side though is a different story, Performance gain get for Numeric columns (Boolean, Integer, LongInteger, Float, Real): The Polars path does pl.Series(values=numpy_array) - Polars can reference the NumPy buffer directly for these types. This is the genuine zero-copy , but String or Datetime I doubt there will be performance gain.

Take strings as a concrete example. In sbdf.pyx _import_vt_string (line 534–536) runs a Python loop that creates a Python str object for every row and stores them in a NPY_OBJECT array — that's unavoidable because SBDF stores strings as C char pointers, not Arrow buffers. Then _import_build_polars_dataframe calls .tolist() on that object array (line 726) to produce a Python list, because Polars can't consume a NPY_OBJECT array directly. Then pl.Series(values=list) re-encodes all those strings into Polars' Arrow buffer. So at peak memory you have the NPY_OBJECT array, the Python list, and the Arrow buffer all alive simultaneously - three representations of the same data. The Pandas import path just wraps the NPY_OBJECT array in a Series header and stops there: two representations, same str objects shared by reference.

So for import: numeric columns genuinely benefit, strings and datetimes are actually worse than the Pandas path in both memory and time.

@stewjb
Copy link
Copy Markdown
Author

stewjb commented Apr 1, 2026

@vrane-tibco this is good feedback. Thank you for taking the time to consider this. If you all feel the metadata is non-negotiable for this package, I understand that. We can fork off this package if we deem this important. The biggest benefit is not having doubling of peak memory when using polars, but we can increase our memory to get around this.

The metadata loss is a limitation of polars, which you've mentioned.

Polars has an outstanding issue to address this (pola-rs/polars#5117), but it's been open since 2022 which isn't promising.

My suggestion would be to log a warning to the user on the limitation of Polars on import/export. That would handle your point on dropping the metadata silently. I don't think there is a clean way to do that hand off natively into polars. The alternative I can see is to return a list with the dataframe and metadata unpacked. That seems clunky and confusing. A possible compromise is another function strictly for polars that returns the metadata unpacked.

Your comment on the confusing error makes sense and I can address it.

On performance, the export benefit is the main driver for me of this PR.

As of now, I don't know how the user would specify within spotfire that they want a polars dataframe returned instead of pandas. Personally, i just use df = spotfire_data_table within a data function, and I believe the only way to specify a polars output would be to hack into the internals of this package.

One proposal would be to drop the import functionality totally given you can't reach it as intended currently. I included it because it helps for testing and if polars adoption becomes more widespread in the future.

I think we can address the concerns on datetimes and strings. I definitely think we can drop it to to have only 2 copies of the same data, which is still a net improvement over having to use .to_pandas(), which is the current default and holds 2 copies of the whole df in that process.

stewjb and others added 10 commits April 3, 2026 12:00
…etime import

- Emit SBDFWarning on both Polars import and export paths pointing to
  polars-rs/polars#5117 so metadata loss is never silent.
- Raise TypeError with a Polars-specific message from copy_metadata(),
  get_spotfire_types(), and set_spotfire_types() instead of a generic error.
- For the Polars import path, bypass the Python-boxing importers for
  DateTime/Date/TimeSpan: store raw int64 ms values via _import_vts_numpy,
  then in _import_build_polars_dataframe subtract the SBDF-to-Unix epoch
  offset in-place and reinterpret via .view() — reducing peak memory from
  3 live copies to 1-2 (down from creating Python datetime objects).
- String/Time/Binary/Decimal import: release the concatenated numpy array
  before building the Polars Arrow buffer (del + clear_values_arrays()) to
  cap peak at 2 live copies instead of 3.
- Add get_value_type_id() and clear_values_arrays() cpdef helpers on
  _ImportContext to support the above without Cython-level casts.
- Add 6 new tests covering the metadata warning and descriptive error paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Verifies that the in-place epoch-shift + .view('datetime64[D]') path in
_import_build_polars_dataframe produces identical results to the reference
np.astype('datetime64[D]') conversion across six dates: the SBDF epoch
(0001-01-01), one day before and the day of the Unix epoch, one day after,
a recent date, and the maximum representable date (9999-12-31).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous view('datetime64[ms]') approach always triggered a copy inside
Polars: _normalise_numpy_dtype() unconditionally calls .astype(np.int64) on
any datetime64 input before passing to the Rust constructor.

Verified via mutation test (numpy array modified after Series construction):
- Datetime: pl.Series(int64, Int64).cast(Datetime('ms')) — zero-copy; Int64
  and Datetime('ms') share the same int64 Arrow buffer (metadata-only cast).
- Duration: pl.Series(int64, Int64).cast(Duration('ms')) — same, zero-copy.
- Date: pl.Date is int32 internally, so int64→int32 narrowing is unavoidable
  (1 copy via .astype(np.int32)); pl.Series(int32, Date) is then zero-copy.
  Total: 2 copies from C data (down from 3 in the original NPY_OBJECT path).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_import_vt_date_int32 writes directly into an NPY_INT32 slice array at the
C level, so pl.Series(int32, pl.Date) in _import_build_polars_dataframe is
then zero-copy — eliminating the prior int64→int32 astype() narrowing copy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
For null-free numeric columns, skip fill_null and use to_numpy(allow_copy=False)
to return a direct view of the Arrow buffer.

For Datetime/Date/Duration/Time, extract raw integer buffers from the Polars Series
(zero-copy when null-free) and route through four new Polars-specific C-level
exporter functions that perform epoch/unit conversion in a tight C loop, completely
bypassing the Python-object-boxing loop in the generic exporters:
- _export_vt_polars_datetime: int64 ms (Unix) → add SBDF epoch offset
- _export_vt_polars_date: int32 days → int64 ms (SBDF epoch)
- _export_vt_polars_timespan: int64 ms passthrough (no epoch needed)
- _export_vt_polars_time: int64 ns → int64 ms

Columns with nulls fall back to a fill-zero copy (Arrow's validity bitmap cannot
be expressed inline in a numpy int array), but are still processed by the C loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stub files must not contain a concrete @overload implementation alongside
the overload variants; mypy rejects it with 'An implementation for an
overloaded function is not allowed in a stub file'.  Remove the offending
line, leaving only the two typed overloads.

Also suppress call-overload at the one test site that intentionally passes
an invalid output_format value to exercise the SBDFError path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…port

SBDF null slots may contain sentinel values (e.g. INT64_MAX) which,
after the ms→ns ×1_000_000 scale in _import_vt_time_int64, exceed
Polars' valid Time range [0, 86_400_000_000_000 ns].  Zero them out
before passing the int64 buffer to pl.Series(dtype=pl.Time); the
invalids array then overwrites those slots with None.

Also adds OutputFormat enum, cython-lint-friendly named export
constants, and fixes the sbdf.pyi stub to use TYPE_CHECKING guard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
polars is an optional dependency not installed in the CI lint
environment; the TYPE_CHECKING guard in sbdf.pyi is sufficient for
runtime, but mypy still needs the override to suppress
import-not-found on the stub.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OutputFormat is no longer a str subclass; passing a raw string now
raises SBDFError.  Updated all call sites in tests and README to use
OutputFormat.POLARS / OutputFormat.PANDAS, and tightened the .pyi
overloads to Literal[OutputFormat.*] accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@stewjb
Copy link
Copy Markdown
Author

stewjb commented Apr 3, 2026

@vrane-tibco addressing your point on import copy performance — several optimizations have
landed since your review that change the picture for both strings and datetimes.

String import

Your analysis was correct at the time of review: the NPY_OBJECT array, the Python list,
and the Arrow buffer were all live simultaneously — three representations. Two changes
since then:

  1. The NPY_OBJECT array is now explicitly freed (del values) before pl.Series() is
    called, so peak memory is two representations (list + Arrow), not three. This matches
    the Pandas path (NPY_OBJECT array + pd.Series, which share the same str objects).

  2. .tolist() copies object references, not string bytes. The Python str objects
    created in _import_vt_string are the same objects in both the NumPy array and the
    list — no string data is duplicated at that step. So there are two copies of string
    content across the whole pipeline: once into Python str objects, once into the
    Arrow Utf8 buffer. The Pandas path also has two (Python str objects + the NPY_OBJECT
    array header), so the two paths are now equivalent in both peak memory and copy count.

Datetime import

Your concern was that datetimes would not see a performance gain. The current
implementation avoids Python object boxing entirely:

  • _import_vts_numpy reads raw int64 ms directly from the SBDF C buffer into a NumPy
    int64 array — no Python datetime objects are created.
  • The epoch offset is subtracted in-place on the NumPy array (no allocation).
  • pl.Series(int64).cast(pl.Datetime('ms')) is a zero-copy metadata change — Datetime
    and Int64 share the same Arrow int64 backing store.

The Pandas path (_import_vt_datetime) runs a Python loop boxing each value into a
pd.Timestamp, which is significantly more expensive.

Full import picture

Type Copies (Polars path)
Bool / Int / Float 1 (file → NumPy; Polars may share buffer)
String / Binary 2 (SBDF bytes → str objects; str objects → Arrow Utf8); peak 2 live
Datetime 1 (file → int64 NumPy; epoch subtract in-place; zero-copy cast to Datetime)
Date 1 (file → int32 days NumPy; zero-copy cast to Date)
Duration 1 (file → int64 NumPy; zero-copy cast to Duration)
Time 1 (file → int64 ns NumPy; zero-copy cast to Time)

String is now equivalent to the Pandas path. All temporal types are strictly better —
the Pandas path boxes every value into a Python object; the Polars path works entirely
at the int64 level and lets Polars handle the reinterpretation.

@stewjb
Copy link
Copy Markdown
Author

stewjb commented Apr 3, 2026

This is getting large for one PR. Here are my thoughts on splitting it up.

PR A — Export

export_data() accepts pl.DataFrame and pl.Series directly. This is the part you
explicitly praised — no new public API surface, no metadata questions, just an overload
on an existing function.

Covers:

  • dtype → SBDF type mapping for all Polars types
  • Zero-copy temporal export (_export_vt_polars_*, _export_polars_setup_arrays)
  • Edge cases: Null dtype, Categorical, UInt64 overflow, tz-aware Datetime, NaN as missing
  • Metadata warning on export
  • All test_write_polars_* tests
  • pyproject.toml (spotfire[polars] extra, test requirements)
  • README (spotfire[polars] installation and export_data() usage)

PR B — Import

import_data() gains output_format=OutputFormat.POLARS returning a native
pl.DataFrame. Keeping this separate means the import discussion doesn't block the
export merge.

Covers:

  • OutputFormat enum and stub changes
  • _import_build_polars_dataframe and all C-level import functions
  • Zero-copy temporal import (datetime epoch shift, date int32, timespan, time ns)
  • Null sentinel fix for pl.Time
  • Metadata warning on import; descriptive TypeError on copy_metadata /
    get_spotfire_types / set_spotfire_types when passed a Polars object
  • All import, roundtrip, and metadata error tests
  • README (import_data() usage with OutputFormat.POLARS)

PR A could land quickly given it's the uncontested part, and PR B can absorb the metadata/import discussion on its own timeline.

stewjb and others added 9 commits April 3, 2026 21:13
All errors pre-dated this PR but were blocking CI on the fork.
Added targeted # type: ignore[...] annotations with the narrowest
applicable error codes rather than broad suppression.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix dict-of-lists export bug and O(n²) iterable export loop
* Build nullable integer columns with mask in one shot on import
* Fix pre-existing mypy errors in data_function.py and test_sbdf.py

# Conflicts:
#	spotfire/test/test_sbdf.py
Polars stores strings as Arrow LargeUtf8: a flat UTF-8 bytes buffer plus
an int64 offsets buffer. Previously, export went through
series.to_numpy() (one Python str object per row) and then the C helper
re-encoded each string to UTF-8 via PyObject_Str + str.encode().

This commit adds _export_extract_string_obj_arrow() in sbdf_helpers.c,
which reads the raw UTF-8 bytes and offsets directly -- no Python API
calls in the inner loop. The Cython side obtains raw pointers via
PyArray_DATA() on zero-copy numpy views of the Arrow buffers.

The dispatch path (polars_exporter_id = _POL_EXP_STRING = 5) mirrors
the existing temporal fast paths. Categorical and Enum columns are cast
to Utf8 before the Arrow path is taken. A guard asserts the Arrow type
is large_string (int64 offsets) and raises SBDFError if not.

Benchmarked at 100k rows, string no-nulls (psutil, 7 reps):
  pandas baseline:          58ms
  old polars (via pandas):  71ms
  new polars (Arrow direct): 26ms  (-56% vs pandas, -64% vs old polars)

The remaining time is dominated by sbdf_str_create_len (one malloc +
memcpy per string), which is unavoidable in the current SBDF format.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
series.to_arrow() requires pyarrow. CI test environments install
spotfire[polars] without pyarrow, causing ModuleNotFoundError on all
Polars string export tests. Wrap the Arrow fast path in try/except
ImportError so it degrades gracefully to the existing to_numpy() path
when pyarrow is absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hars

pylint line-too-long (C0301) flagged lines 98-99 after the type: ignore
annotations were added. Split the assertEqual calls to keep each line
within the 120-character limit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
E302: add second blank line before OutputFormat class and _ExportContext
decorator.  E127: align continuation lines with opening parenthesis in
set_arrow_string, _export_polars_series_to_numpy, _export_vt_polars_string,
and the sbdf_helpers.pxi extern declaration.  E115/E117: fix comment
indentation inside except blocks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rk profiles

Temporal Polars columns with nulls were being cast to float64 (nan for nulls)
instead of int64 before passing to the C exporter, which read the buffer as
long long* and got garbage values.  Fix: call fill_null(0) after the int cast
so to_numpy() always returns the expected integer dtype; the invalids mask
already records which positions are null so the sentinel is never read.

Adds temporal_nulls (datetime/date/duration/time, ~10% nulls) and binary /
binary_nulls profiles to benchmark.py to cover remaining SBDF value types.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Perf: export Polars String columns directly from Arrow LargeUtf8 buffers
@stewjb
Copy link
Copy Markdown
Author

stewjb commented Apr 4, 2026

Benchmark Results

Measured on Python 3.14.3 / Polars 1.39.3 / Pandas 2.3.3 / NumPy 2.4.3, 7 reps (first excluded as warmup).
Time = mean of 7 reps. Δ RSS = peak increase in resident set size during the call (captures Arrow/Rust/C allocations).
"Old Polars workaround" = .to_pandas() before export / pl.from_pandas() after import.

Export time — 100,000 rows

Profile pandas polars (old: via pandas) polars (this PR) Speedup vs old
Numeric, no nulls 46 ms 54 ms 27 ms 2.0×
Numeric, ~10% nulls 25 ms 24 ms 16 ms 1.5×
String, no nulls 139 ms 139 ms 54 ms 2.6×
String, ~10% nulls 257 ms 209 ms 96 ms 2.2×
Temporal (datetime/date/duration/time), no nulls 1809 ms 1758 ms 35 ms 50×
Temporal (datetime/date/duration/time), ~10% nulls 1711 ms 1459 ms 54 ms 27×
Binary, no nulls 149 ms 199 ms 105 ms 1.9×
Binary, ~10% nulls 134 ms 172 ms 101 ms 1.7×

Export memory (Δ RSS) — 100,000 rows

Profile pandas polars (old: via pandas) polars (this PR)
Numeric, no nulls 0 MB 0 MB 0 MB
Numeric, ~10% nulls 0 MB 0 MB 0 MB
String, no nulls 0.3 MB 0.9 MB 1.3 MB
String, ~10% nulls 0.1 MB 0.9 MB 0.9 MB
Temporal (datetime/date/duration/time), no nulls 4.0 MB 4.0 MB 0 MB
Temporal (datetime/date/duration/time), ~10% nulls 2.0 MB 2.0 MB 0.1 MB
Binary, no nulls 10.0 MB 12.2 MB 10.6 MB
Binary, ~10% nulls 8.4 MB 13.1 MB 10.4 MB

Import time — 100,000 rows

Profile → pandas → polars (old: via pandas) → polars (this PR) Speedup vs old
Numeric, no nulls 18 ms 33 ms 7 ms 4.7×
Numeric, ~10% nulls 11 ms 14 ms 13 ms 1.1×
String, no nulls 75 ms 90 ms 100 ms ~even
String, ~10% nulls 65 ms 93 ms 70 ms 1.3×
Temporal (datetime/date/duration/time), no nulls 136 ms 247 ms 56 ms 4.4×
Temporal (datetime/date/duration/time), ~10% nulls 145 ms 245 ms 54 ms 4.5×
Binary, no nulls 91 ms 111 ms 84 ms 1.3×
Binary, ~10% nulls 76 ms 102 ms 80 ms 1.3×

Import memory (Δ RSS) — 100,000 rows

Profile → pandas → polars (old: via pandas) → polars (this PR)
Numeric, no nulls 0.2 MB 0 MB 0 MB
Numeric, ~10% nulls 0 MB 0 MB 0.4 MB
String, no nulls 0.8 MB 1.6 MB 3.0 MB
String, ~10% nulls 3.9 MB 1.5 MB 0.8 MB
Temporal (datetime/date/duration/time), no nulls 1.5 MB 1.5 MB 5.4 MB
Temporal (datetime/date/duration/time), ~10% nulls 1.6 MB 5.7 MB 5.5 MB
Binary, no nulls 9.8 MB 10.6 MB 0 MB
Binary, ~10% nulls 0.1 MB 10.4 MB 10.6 MB

Notes

  • Temporal export shows the largest gain (~27–50×): the old path boxes every value into a Python datetime object when converting through pandas; the new path casts directly to int64 in C with no Python allocation. The 4
    MB RSS spike in the old paths disappears entirely.
  • String export (~2–2.6×): the new path reads raw UTF-8 bytes directly from Polars' Arrow LargeUtf8 buffer in C, bypassing str object creation and re-encoding. The small RSS increase reflects the SBDF output buffer,
    not intermediate Python objects.
  • String import is roughly flat on time — building an Arrow buffer directly is comparable work to converting through pandas.
  • Binary (~1.7–1.9× export, ~1.3× import): benefits from skipping the Polars → pandas conversion chain; the memory profile is similar across paths since the raw byte data dominates.

stewjb and others added 8 commits April 3, 2026 22:25
benchmark.py is a local development tool and should not be committed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
np.where(invalids)[0] returns an ndarray; pl.Series.scatter() accepts it
directly. The .tolist() conversion was allocating an unnecessary Python list
on every null-containing column import.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…boundary safety

Exports 100_001 rows of a Polars String column, forcing a second SBDF row
slice (start=100_000, count=1), and asserts the value at the chunk boundary
is correct.  Covers the raw C pointer arithmetic in _export_extract_string_obj_arrow
which is not bounds-checked.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lars/pyarrow

polars is an optional dependency; pyarrow only arrives transitively through it.
Adding test_requirements_no_polars.txt causes build.yaml's test-environment matrix
to automatically pick up a second CI slot that runs the full test suite with neither
library installed.  SbdfPolarsTest is skipped via @unittest.skipIf(pl is None, ...);
all Pandas tests must pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add two tests to SbdfPolarsTest verifying that the Polars and Pandas
import/export code paths produce identical data for all 11 non-Decimal
SBDF value types with one null per column (rotating positions 0–4):

- test_all_dtypes_export_polars_vs_pandas_path: exports the same data via
  the native Polars path and the Pandas path, imports both back as Pandas,
  and asserts frame equality.

- test_all_dtypes_import_polars_vs_pandas_path: imports a single SBDF file
  as both a Polars and a Pandas DataFrame, then compares null positions and
  non-null values column by column.

Helpers:
- _all_dtypes_polars_df(): canonical Polars source with all SBDF-compatible types.
- _all_dtypes_pandas_df(): equivalent Pandas source (avoids pyarrow dependency).
- _assert_import_paths_equivalent(): per-column null + value comparison using
  Series.to_list(), which works without pyarrow.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or pylint

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…erload

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ong (131/120)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow use of polars instead of pandas

3 participants