Add native Polars DataFrame support by stewjb · Pull Request #99 · spotfiresoftware/spotfire-python

stewjb · 2026-03-24T01:49:28Z

Closes #98

Summary

Export: export_data() now accepts polars.DataFrame and polars.Series directly, mapping Polars dtypes to SBDF types without any Pandas intermediary. Supported types: Boolean, Int8/16/32, Int64, Float32/64, Utf8/String, Date, Datetime, Duration, Time, Binary, Decimal, Categorical.
Import: import_data() gains an output_format parameter (default "pandas" for backwards compatibility). When output_format="polars", a polars.DataFrame is built directly from the raw numpy arrays — no Pandas DataFrame is created at any point.
Dependency: Polars is added as an optional dependency (spotfire[polars]), following the same pattern as spotfire[geo] and spotfire[plot].

Performance benefit

The previous workaround required polars_df.to_pandas() before export, which doubles peak memory usage and adds 2–5 seconds of conversion time at 10M rows. The native path eliminates this entirely for export.

Spotfire data function context

When running inside a Spotfire data function, SBDF import and export happen automatically via data_function.py — users never call import_data or export_data directly. This has two implications:

Export (output variables): Full benefit. A user can build a polars.DataFrame in their script and return it as an output variable — export_data() handles it natively with no conversion.

Import (input variables): No benefit from import_data(output_format="polars"). Input data is always loaded by the framework via sbdf.import_data(self._file) (no output_format argument), so input variables always arrive in the script as pd.DataFrame. Users who want Polars for processing would still need to call pl.from_pandas(input_df) themselves. Fixing this properly would require changes to data_function.py and a mechanism for users to declare their preference — out of scope for this PR.

In short: the output_format parameter on import_data is primarily useful outside the Spotfire data function context (e.g. standalone scripts using the spotfire package directly). Inside a data function, only the export side benefits.

Test plan

test_write_polars_basic — export DataFrame with common types, re-import as Pandas and verify data
test_write_polars_nulls — null values are preserved through the roundtrip
test_write_polars_series — Polars Series export works
test_import_as_polars — import with output_format="polars" returns a native polars.DataFrame
test_polars_roundtrip — full Polars → SBDF → Polars roundtrip
All 72 existing tests continue to pass (1 pre-existing skip unrelated to this change)
pylint 10.00/10, mypy clean, cython-lint clean

🤖 Generated with Claude Code

…tetime, scatter compat - Fix Categorical/Enum dtype: was incorrectly trying to recurse into dtype.categories (which doesn't exist on the dtype object); now casts series to Utf8 and maps to SBDF_STRINGTYPEID directly - Add Enum dtype support (previously raised SBDFError) - Warn on UInt64 export: values above Int64 max will overflow silently - Warn on timezone-aware Datetime export: tz info is not preserved in SBDF - Warn on Decimal export: marked experimental, precision may be lost - Fix scatter() compatibility: add AttributeError fallback to set_at_idx() for older Polars versions within the supported range - Add tests for all of the above Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add polars to test_requirements_default.txt so SbdfPolarsTest is actually executed in CI (previously skipped due to missing import) - Add spotfire[polars] row to extras table in README - Add usage note explaining Spotfire's bundled Python lacks Polars and that SPKs bundling Polars will be ~44 MB larger than typical packages Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Raise SBDFError for unknown output_format values (previously fell through silently to Pandas) - Emit SBDFWarning when Categorical/Enum columns are exported as String, consistent with existing UInt64 and timezone warnings - Add test_invalid_output_format: verifies bad output_format raises - Add test_write_polars_empty: verifies empty DataFrame exports cleanly - Add test_write_polars_series_nulls: verifies null preservation in Series - Add test_polars_categorical_warns: verifies Categorical warning fires Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

A Polars Series of [None, None, None] has dtype pl.Null (no type can be inferred). Previously this raised SBDFError with "unknown dtype". Now it exports as an all-invalid String column, consistent with how all-None Pandas columns are handled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CI static analysis runs mypy without polars installed; add type: ignore[import-not-found] so mypy skips the missing stub. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Explain non-obvious choices that would otherwise prompt review questions: - Why dtype.__class__.__name__ instead of isinstance() - Why scatter()/set_at_idx() try/except exists and which versions it covers - Why is_object_numpy_type() cpdef wrapper is needed for a cdef attribute - Why the output_format polars path short-circuits before pd.concat - Why the Null dtype path returns a placeholder array Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…olars versions (>= 0.20) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds native Polars support to Spotfire’s SBDF import/export layer to avoid Pandas conversions (improving memory usage and performance for large datasets), and wires it up as an optional extra.

Changes:

Add polars as an optional dependency (spotfire[polars]) and enable it in dev/test setups.
Extend sbdf.export_data() to accept polars.DataFrame / polars.Series directly, with dtype→SBDF mapping.
Extend sbdf.import_data() with output_format to optionally construct a native polars.DataFrame without creating a Pandas DataFrame.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`spotfire/sbdf.pyx`	Implements Polars import/export paths and dtype mappings; adds `output_format` to `import_data()`.
`spotfire/sbdf.pyi`	Updates type stub for `import_data()` to include `output_format`.
`spotfire/test/test_sbdf.py`	Adds Polars-focused unit tests for export/import/roundtrip + warnings.
`pyproject.toml`	Adds `polars` extra and includes it in `dev` extra.
`test_requirements_default.txt`	Installs Polars for test runs.
`README.md`	Documents `spotfire[polars]` and the new import/export behavior.
`.gitignore`	Ignores `.venv`, `uv.lock`, and `.claude`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

spotfire/sbdf.pyi

Copilot · 2026-03-24T04:04:56Z

spotfire/sbdf.pyx

+        context.set_valuetype_id(_export_infer_valuetype_from_polars_dtype(series.dtype, f"column '{col}'"))
+        invalids = series.is_null().to_numpy()
+        context.set_arrays(_export_polars_series_to_numpy(context, series), invalids)
+        column_metadata.append({})


In the Polars export path, invalids are derived from series.is_null(), which does not mark floating-point NaN values as invalid. In the existing Pandas path pd.isnull() treats NaN as missing, so exporting a Polars float column containing NaN will write NaNs as real values instead of SBDF invalids (behavior mismatch vs Pandas and likely incorrect for Spotfire missing-values semantics). Consider treating NaN as invalid for Float32/Float64 columns (e.g., combine is_null() with is_nan() when applicable).

added series.is_null() and series.is_nan() for floats to handle like pandas does.

Copilot · 2026-03-24T04:04:57Z

spotfire/sbdf.pyx

+    if na_value is not None:
+        return np.asarray(series.fill_null(na_value).to_numpy(allow_copy=True),
+                          dtype=context.get_numpy_dtype())
+    else:


_export_polars_series_to_numpy converts to an object ndarray when na_value is None. For Polars Datetime / Duration series, to_numpy() already produces datetime64 / timedelta64 arrays that the existing SBDF exporters can handle, so forcing dtype=object will box scalars and create an unnecessary copy (hurting the performance goal of this PR). Consider special-casing datetime/timespan to keep the native NumPy dtype (ideally normalized to the SBDF-supported resolution) instead of casting to object.

Suggested change

else:

else:

# For Datetime/Duration, keep native NumPy datetime64/timedelta64 dtypes instead of boxing to object.

if dtype_name in ("Datetime", "Duration"):

return series.to_numpy(allow_copy=True)

datetime and duration go to numpy early now.

spotfire/sbdf.pyx

@overload

- Move output_format validation to top of import_data() for fail-fast behaviour before the file is opened - Raise SBDFError in _import_polars_dtype fallback instead of silently returning Utf8 for unknown SBDF type IDs - Treat NaN as invalid (missing) for Float32/Float64 columns, matching Pandas pd.isnull() behaviour; add test_write_polars_float_nan - Keep native datetime64/timedelta64 arrays for Datetime/Duration columns instead of boxing to object dtype (avoids unnecessary copy) - Add @overload signatures to sbdf.pyi so callers get pd.DataFrame for the default output_format="pandas" and Any for output_format="polars" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

stewjb · 2026-03-24T18:56:22Z

@vrane-tibco @bbassett-tibco @mpanke-tibco thanks for considering this PR. Let me know if you all have thoughts.

_export_obj_dict_of_lists (line 1313): np.array(n) where n is an integer creates a 0-dimensional array, not a 1-D array of length n. Every export_data({"col": [...]}) call would raise IndexError. Fixed to np.empty(shape, ...). _export_obj_iterable (lines 1358-1366): np.append inside a for loop reallocates the entire array on every iteration — O(n²) for a column of n rows. Replaced with list accumulation and a single np.array() call at the end. Add test_export_dict_of_lists and test_export_list to cover both paths (previously untested, which is why the bug went undetected). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

For Int32/Int64 columns, the previous code constructed a pd.Series and then assigned nulls via .loc[mask] = None in a second pass, which triggers Pandas dtype coercion overhead internally. Replace with pd.arrays.IntegerArray(values, mask) which constructs the nullable integer array with the validity mask in a single operation, avoiding the second pass entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vrane-tibco · 2026-04-01T15:22:28Z

@stewjb Appreciate the work here — the dtype mapping is thorough and the export mechanics are solid. But I have two concerns, one architectural and one about the performance framing.

On metadata -

This package isn't just a data serializer - it's specifically the bridge that carries Spotfire's interpretation of that data back and forth. The three things Spotfire cares about beyond raw values are spotfire_table_metadata, spotfire_column_metadata, and spotfire_type. All three are attached to Pandas objects and the entire surface API (copy_metadata, get_spotfire_types, set_spotfire_types, set_geocoding_table, all exported from init.py) is built on top of those Pandas extension points.
Polars DataFrames don't support any of this. So right now:

On import: In sbdf.pyx _import_build_polars_dataframe (line 875) only receives column_names and importer_contexts. The table_metadata and column_metadata that were correctly parsed from the SBDF file a few lines above simply never make it into the returned DataFrame. Silently dropped
On export: In sbdf.pyx _export_obj_polars_dataframe hardcodes {} for both table and column metadata (lines 1248, 1251). The Pandas path reads these from the object; the Polars path has nowhere to read them from.
In the public API: copy_metadata(), get_spotfire_types(), and set_spotfire_types() all raise TypeError on non-Pandas objects (lines 32–35, 68–69, 86–87). So a user who gets a pl.DataFrame back from import_data(output_format="polars") and immediately tries to use any of those three functions hits a confusing error.

On performance -

The export side is genuinely better than the to_pandas() workaround.

The import side though is a different story, Performance gain get for Numeric columns (Boolean, Integer, LongInteger, Float, Real): The Polars path does pl.Series(values=numpy_array) - Polars can reference the NumPy buffer directly for these types. This is the genuine zero-copy , but String or Datetime I doubt there will be performance gain.

Take strings as a concrete example. In sbdf.pyx _import_vt_string (line 534–536) runs a Python loop that creates a Python str object for every row and stores them in a NPY_OBJECT array — that's unavoidable because SBDF stores strings as C char pointers, not Arrow buffers. Then _import_build_polars_dataframe calls .tolist() on that object array (line 726) to produce a Python list, because Polars can't consume a NPY_OBJECT array directly. Then pl.Series(values=list) re-encodes all those strings into Polars' Arrow buffer. So at peak memory you have the NPY_OBJECT array, the Python list, and the Arrow buffer all alive simultaneously - three representations of the same data. The Pandas import path just wraps the NPY_OBJECT array in a Series header and stops there: two representations, same str objects shared by reference.

So for import: numeric columns genuinely benefit, strings and datetimes are actually worse than the Pandas path in both memory and time.

stewjb · 2026-04-01T17:02:56Z

@vrane-tibco this is good feedback. Thank you for taking the time to consider this. If you all feel the metadata is non-negotiable for this package, I understand that. We can fork off this package if we deem this important. The biggest benefit is not having doubling of peak memory when using polars, but we can increase our memory to get around this.

The metadata loss is a limitation of polars, which you've mentioned.

Polars has an outstanding issue to address this (pola-rs/polars#5117), but it's been open since 2022 which isn't promising.

My suggestion would be to log a warning to the user on the limitation of Polars on import/export. That would handle your point on dropping the metadata silently. I don't think there is a clean way to do that hand off natively into polars. The alternative I can see is to return a list with the dataframe and metadata unpacked. That seems clunky and confusing. A possible compromise is another function strictly for polars that returns the metadata unpacked.

Your comment on the confusing error makes sense and I can address it.

On performance, the export benefit is the main driver for me of this PR.

As of now, I don't know how the user would specify within spotfire that they want a polars dataframe returned instead of pandas. Personally, i just use df = spotfire_data_table within a data function, and I believe the only way to specify a polars output would be to hack into the internals of this package.

One proposal would be to drop the import functionality totally given you can't reach it as intended currently. I included it because it helps for testing and if polars adoption becomes more widespread in the future.

I think we can address the concerns on datetimes and strings. I definitely think we can drop it to to have only 2 copies of the same data, which is still a net improvement over having to use .to_pandas(), which is the current default and holds 2 copies of the whole df in that process.

…etime import - Emit SBDFWarning on both Polars import and export paths pointing to polars-rs/polars#5117 so metadata loss is never silent. - Raise TypeError with a Polars-specific message from copy_metadata(), get_spotfire_types(), and set_spotfire_types() instead of a generic error. - For the Polars import path, bypass the Python-boxing importers for DateTime/Date/TimeSpan: store raw int64 ms values via _import_vts_numpy, then in _import_build_polars_dataframe subtract the SBDF-to-Unix epoch offset in-place and reinterpret via .view() — reducing peak memory from 3 live copies to 1-2 (down from creating Python datetime objects). - String/Time/Binary/Decimal import: release the concatenated numpy array before building the Polars Arrow buffer (del + clear_values_arrays()) to cap peak at 2 live copies instead of 3. - Add get_value_type_id() and clear_values_arrays() cpdef helpers on _ImportContext to support the above without Cython-level casts. - Add 6 new tests covering the metadata warning and descriptive error paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Verifies that the in-place epoch-shift + .view('datetime64[D]') path in _import_build_polars_dataframe produces identical results to the reference np.astype('datetime64[D]') conversion across six dates: the SBDF epoch (0001-01-01), one day before and the day of the Unix epoch, one day after, a recent date, and the maximum representable date (9999-12-31). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The previous view('datetime64[ms]') approach always triggered a copy inside Polars: _normalise_numpy_dtype() unconditionally calls .astype(np.int64) on any datetime64 input before passing to the Rust constructor. Verified via mutation test (numpy array modified after Series construction): - Datetime: pl.Series(int64, Int64).cast(Datetime('ms')) — zero-copy; Int64 and Datetime('ms') share the same int64 Arrow buffer (metadata-only cast). - Duration: pl.Series(int64, Int64).cast(Duration('ms')) — same, zero-copy. - Date: pl.Date is int32 internally, so int64→int32 narrowing is unavoidable (1 copy via .astype(np.int32)); pl.Series(int32, Date) is then zero-copy. Total: 2 copies from C data (down from 3 in the original NPY_OBJECT path). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

_import_vt_date_int32 writes directly into an NPY_INT32 slice array at the C level, so pl.Series(int32, pl.Date) in _import_build_polars_dataframe is then zero-copy — eliminating the prior int64→int32 astype() narrowing copy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

For null-free numeric columns, skip fill_null and use to_numpy(allow_copy=False) to return a direct view of the Arrow buffer. For Datetime/Date/Duration/Time, extract raw integer buffers from the Polars Series (zero-copy when null-free) and route through four new Polars-specific C-level exporter functions that perform epoch/unit conversion in a tight C loop, completely bypassing the Python-object-boxing loop in the generic exporters: - _export_vt_polars_datetime: int64 ms (Unix) → add SBDF epoch offset - _export_vt_polars_date: int32 days → int64 ms (SBDF epoch) - _export_vt_polars_timespan: int64 ms passthrough (no epoch needed) - _export_vt_polars_time: int64 ns → int64 ms Columns with nulls fall back to a fill-zero copy (Arrow's validity bitmap cannot be expressed inline in a numpy int array), but are still processed by the C loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@overload

Stub files must not contain a concrete @overload implementation alongside the overload variants; mypy rejects it with 'An implementation for an overloaded function is not allowed in a stub file'. Remove the offending line, leaving only the two typed overloads. Also suppress call-overload at the one test site that intentionally passes an invalid output_format value to exercise the SBDFError path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…port SBDF null slots may contain sentinel values (e.g. INT64_MAX) which, after the ms→ns ×1_000_000 scale in _import_vt_time_int64, exceed Polars' valid Time range [0, 86_400_000_000_000 ns]. Zero them out before passing the int64 buffer to pl.Series(dtype=pl.Time); the invalids array then overwrites those slots with None. Also adds OutputFormat enum, cython-lint-friendly named export constants, and fixes the sbdf.pyi stub to use TYPE_CHECKING guard. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

polars is an optional dependency not installed in the CI lint environment; the TYPE_CHECKING guard in sbdf.pyi is sufficient for runtime, but mypy still needs the override to suppress import-not-found on the stub. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

OutputFormat is no longer a str subclass; passing a raw string now raises SBDFError. Updated all call sites in tests and README to use OutputFormat.POLARS / OutputFormat.PANDAS, and tightened the .pyi overloads to Literal[OutputFormat.*] accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

stewjb · 2026-04-03T19:10:12Z

@vrane-tibco addressing your point on import copy performance — several optimizations have
landed since your review that change the picture for both strings and datetimes.

String import

Your analysis was correct at the time of review: the NPY_OBJECT array, the Python list,
and the Arrow buffer were all live simultaneously — three representations. Two changes
since then:

The NPY_OBJECT array is now explicitly freed (del values) before pl.Series() is
called, so peak memory is two representations (list + Arrow), not three. This matches
the Pandas path (NPY_OBJECT array + pd.Series, which share the same str objects).
.tolist() copies object references, not string bytes. The Python str objects
created in _import_vt_string are the same objects in both the NumPy array and the
list — no string data is duplicated at that step. So there are two copies of string
content across the whole pipeline: once into Python str objects, once into the
Arrow Utf8 buffer. The Pandas path also has two (Python str objects + the NPY_OBJECT
array header), so the two paths are now equivalent in both peak memory and copy count.

Datetime import

Your concern was that datetimes would not see a performance gain. The current
implementation avoids Python object boxing entirely:

_import_vts_numpy reads raw int64 ms directly from the SBDF C buffer into a NumPy
int64 array — no Python datetime objects are created.
The epoch offset is subtracted in-place on the NumPy array (no allocation).
pl.Series(int64).cast(pl.Datetime('ms')) is a zero-copy metadata change — Datetime
and Int64 share the same Arrow int64 backing store.

The Pandas path (_import_vt_datetime) runs a Python loop boxing each value into a
pd.Timestamp, which is significantly more expensive.

Full import picture

Type	Copies (Polars path)
Bool / Int / Float	1 (file → NumPy; Polars may share buffer)
String / Binary	2 (SBDF bytes → str objects; str objects → Arrow Utf8); peak 2 live
Datetime	1 (file → int64 NumPy; epoch subtract in-place; zero-copy cast to Datetime)
Date	1 (file → int32 days NumPy; zero-copy cast to Date)
Duration	1 (file → int64 NumPy; zero-copy cast to Duration)
Time	1 (file → int64 ns NumPy; zero-copy cast to Time)

String is now equivalent to the Pandas path. All temporal types are strictly better —
the Pandas path boxes every value into a Python object; the Polars path works entirely
at the int64 level and lets Polars handle the reinterpretation.

stewjb · 2026-04-03T19:17:04Z

This is getting large for one PR. Here are my thoughts on splitting it up.

PR A — Export

export_data() accepts pl.DataFrame and pl.Series directly. This is the part you
explicitly praised — no new public API surface, no metadata questions, just an overload
on an existing function.

Covers:

dtype → SBDF type mapping for all Polars types
Zero-copy temporal export (_export_vt_polars_*, _export_polars_setup_arrays)
Edge cases: Null dtype, Categorical, UInt64 overflow, tz-aware Datetime, NaN as missing
Metadata warning on export
All test_write_polars_* tests
pyproject.toml (spotfire[polars] extra, test requirements)
README (spotfire[polars] installation and export_data() usage)

PR B — Import

import_data() gains output_format=OutputFormat.POLARS returning a native
pl.DataFrame. Keeping this separate means the import discussion doesn't block the
export merge.

Covers:

OutputFormat enum and stub changes
_import_build_polars_dataframe and all C-level import functions
Zero-copy temporal import (datetime epoch shift, date int32, timespan, time ns)
Null sentinel fix for pl.Time
Metadata warning on import; descriptive TypeError on copy_metadata /
get_spotfire_types / set_spotfire_types when passed a Polars object
All import, roundtrip, and metadata error tests
README (import_data() usage with OutputFormat.POLARS)

PR A could land quickly given it's the uncontested part, and PR B can absorb the metadata/import discussion on its own timeline.

All errors pre-dated this PR but were blocking CI on the fork. Added targeted # type: ignore[...] annotations with the narrowest applicable error codes rather than broad suppression. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix dict-of-lists export bug and O(n²) iterable export loop * Build nullable integer columns with mask in one shot on import * Fix pre-existing mypy errors in data_function.py and test_sbdf.py # Conflicts: # spotfire/test/test_sbdf.py

Polars stores strings as Arrow LargeUtf8: a flat UTF-8 bytes buffer plus an int64 offsets buffer. Previously, export went through series.to_numpy() (one Python str object per row) and then the C helper re-encoded each string to UTF-8 via PyObject_Str + str.encode(). This commit adds _export_extract_string_obj_arrow() in sbdf_helpers.c, which reads the raw UTF-8 bytes and offsets directly -- no Python API calls in the inner loop. The Cython side obtains raw pointers via PyArray_DATA() on zero-copy numpy views of the Arrow buffers. The dispatch path (polars_exporter_id = _POL_EXP_STRING = 5) mirrors the existing temporal fast paths. Categorical and Enum columns are cast to Utf8 before the Arrow path is taken. A guard asserts the Arrow type is large_string (int64 offsets) and raises SBDFError if not. Benchmarked at 100k rows, string no-nulls (psutil, 7 reps): pandas baseline: 58ms old polars (via pandas): 71ms new polars (Arrow direct): 26ms (-56% vs pandas, -64% vs old polars) The remaining time is dominated by sbdf_str_create_len (one malloc + memcpy per string), which is unavoidable in the current SBDF format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

series.to_arrow() requires pyarrow. CI test environments install spotfire[polars] without pyarrow, causing ModuleNotFoundError on all Polars string export tests. Wrap the Arrow fast path in try/except ImportError so it degrades gracefully to the existing to_numpy() path when pyarrow is absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…hars pylint line-too-long (C0301) flagged lines 98-99 after the type: ignore annotations were added. Split the assertEqual calls to keep each line within the 120-character limit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

E302: add second blank line before OutputFormat class and _ExportContext decorator. E127: align continuation lines with opening parenthesis in set_arrow_string, _export_polars_series_to_numpy, _export_vt_polars_string, and the sbdf_helpers.pxi extern declaration. E115/E117: fix comment indentation inside except blocks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rk profiles Temporal Polars columns with nulls were being cast to float64 (nan for nulls) instead of int64 before passing to the C exporter, which read the buffer as long long* and got garbage values. Fix: call fill_null(0) after the int cast so to_numpy() always returns the expected integer dtype; the invalids mask already records which positions are null so the sentinel is never read. Adds temporal_nulls (datetime/date/duration/time, ~10% nulls) and binary / binary_nulls profiles to benchmark.py to cover remaining SBDF value types. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Perf: export Polars String columns directly from Arrow LargeUtf8 buffers

stewjb · 2026-04-04T03:17:59Z

Benchmark Results

Measured on Python 3.14.3 / Polars 1.39.3 / Pandas 2.3.3 / NumPy 2.4.3, 7 reps (first excluded as warmup).
Time = mean of 7 reps. Δ RSS = peak increase in resident set size during the call (captures Arrow/Rust/C allocations).
"Old Polars workaround" = .to_pandas() before export / pl.from_pandas() after import.

Export time — 100,000 rows

Profile	pandas	polars (old: via pandas)	polars (this PR)	Speedup vs old
Numeric, no nulls	46 ms	54 ms	27 ms	2.0×
Numeric, ~10% nulls	25 ms	24 ms	16 ms	1.5×
String, no nulls	139 ms	139 ms	54 ms	2.6×
String, ~10% nulls	257 ms	209 ms	96 ms	2.2×
Temporal (datetime/date/duration/time), no nulls	1809 ms	1758 ms	35 ms	50×
Temporal (datetime/date/duration/time), ~10% nulls	1711 ms	1459 ms	54 ms	27×
Binary, no nulls	149 ms	199 ms	105 ms	1.9×
Binary, ~10% nulls	134 ms	172 ms	101 ms	1.7×

Export memory (Δ RSS) — 100,000 rows

Profile	pandas	polars (old: via pandas)	polars (this PR)
Numeric, no nulls	0 MB	0 MB	0 MB
Numeric, ~10% nulls	0 MB	0 MB	0 MB
String, no nulls	0.3 MB	0.9 MB	1.3 MB
String, ~10% nulls	0.1 MB	0.9 MB	0.9 MB
Temporal (datetime/date/duration/time), no nulls	4.0 MB	4.0 MB	0 MB
Temporal (datetime/date/duration/time), ~10% nulls	2.0 MB	2.0 MB	0.1 MB
Binary, no nulls	10.0 MB	12.2 MB	10.6 MB
Binary, ~10% nulls	8.4 MB	13.1 MB	10.4 MB

Import time — 100,000 rows

Profile	→ pandas	→ polars (old: via pandas)	→ polars (this PR)	Speedup vs old
Numeric, no nulls	18 ms	33 ms	7 ms	4.7×
Numeric, ~10% nulls	11 ms	14 ms	13 ms	1.1×
String, no nulls	75 ms	90 ms	100 ms	~even
String, ~10% nulls	65 ms	93 ms	70 ms	1.3×
Temporal (datetime/date/duration/time), no nulls	136 ms	247 ms	56 ms	4.4×
Temporal (datetime/date/duration/time), ~10% nulls	145 ms	245 ms	54 ms	4.5×
Binary, no nulls	91 ms	111 ms	84 ms	1.3×
Binary, ~10% nulls	76 ms	102 ms	80 ms	1.3×

Import memory (Δ RSS) — 100,000 rows

Profile	→ pandas	→ polars (old: via pandas)	→ polars (this PR)
Numeric, no nulls	0.2 MB	0 MB	0 MB
Numeric, ~10% nulls	0 MB	0 MB	0.4 MB
String, no nulls	0.8 MB	1.6 MB	3.0 MB
String, ~10% nulls	3.9 MB	1.5 MB	0.8 MB
Temporal (datetime/date/duration/time), no nulls	1.5 MB	1.5 MB	5.4 MB
Temporal (datetime/date/duration/time), ~10% nulls	1.6 MB	5.7 MB	5.5 MB
Binary, no nulls	9.8 MB	10.6 MB	0 MB
Binary, ~10% nulls	0.1 MB	10.4 MB	10.6 MB

Notes

Temporal export shows the largest gain (~27–50×): the old path boxes every value into a Python datetime object when converting through pandas; the new path casts directly to int64 in C with no Python allocation. The 4
MB RSS spike in the old paths disappears entirely.
String export (~2–2.6×): the new path reads raw UTF-8 bytes directly from Polars' Arrow LargeUtf8 buffer in C, bypassing str object creation and re-encoding. The small RSS increase reflects the SBDF output buffer,
not intermediate Python objects.
String import is roughly flat on time — building an Arrow buffer directly is comparable work to converting through pandas.
Binary (~1.7–1.9× export, ~1.3× import): benefits from skipping the Polars → pandas conversion chain; the memory profile is similar across paths since the raw byte data dominates.

benchmark.py is a local development tool and should not be committed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

np.where(invalids)[0] returns an ndarray; pl.Series.scatter() accepts it directly. The .tolist() conversion was allocating an unnecessary Python list on every null-containing column import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…boundary safety Exports 100_001 rows of a Polars String column, forcing a second SBDF row slice (start=100_000, count=1), and asserts the value at the chunk boundary is correct. Covers the raw C pointer arithmetic in _export_extract_string_obj_arrow which is not bounds-checked. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…lars/pyarrow polars is an optional dependency; pyarrow only arrives transitively through it. Adding test_requirements_no_polars.txt causes build.yaml's test-environment matrix to automatically pick up a second CI slot that runs the full test suite with neither library installed. SbdfPolarsTest is skipped via @unittest.skipIf(pl is None, ...); all Pandas tests must pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add two tests to SbdfPolarsTest verifying that the Polars and Pandas import/export code paths produce identical data for all 11 non-Decimal SBDF value types with one null per column (rotating positions 0–4): - test_all_dtypes_export_polars_vs_pandas_path: exports the same data via the native Polars path and the Pandas path, imports both back as Pandas, and asserts frame equality. - test_all_dtypes_import_polars_vs_pandas_path: imports a single SBDF file as both a Polars and a Pandas DataFrame, then compares null positions and non-null values column by column. Helpers: - _all_dtypes_polars_df(): canonical Polars source with all SBDF-compatible types. - _all_dtypes_pandas_df(): equivalent Pandas source (avoids pyarrow dependency). - _assert_import_paths_equivalent(): per-column null + value comparison using Series.to_list(), which works without pyarrow. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…or pylint Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…erload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ong (131/120) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

stewjb and others added 9 commits March 23, 2026 20:06

feat: polars functionality

4868db9

linting and testing

82492e5

Fix mypy error for polars import in test file

441cddb

CI static analysis runs mypy without polars installed; add type: ignore[import-not-found] so mypy skips the missing stub. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove set_at_idx fallback; scatter() is available in all supported P…

bf8e984

…olars versions (>= 0.20) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mpanke-tibco requested a review from Copilot March 24, 2026 04:01

Copilot started reviewing on behalf of mpanke-tibco March 24, 2026 04:01 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

stewjb and others added 2 commits March 24, 2026 19:47

stewjb force-pushed the main branch from 79d62d1 to 00d81cf Compare March 25, 2026 01:06

Merge branch 'main' of github.com:spotfiresoftware/spotfire-python

392532f

stewjb and others added 10 commits April 3, 2026 12:00

Docs: update README to show OutputFormat enum for import_data

4bc8639

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Docs: add concrete import_data example with OutputFormat enum

e5893e7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

stewjb and others added 9 commits April 3, 2026 21:13

Merge branch 'perf/fix-export-bugs'

e7cb389

* Fix dict-of-lists export bug and O(n²) iterable export loop * Build nullable integer columns with mask in one shot on import * Fix pre-existing mypy errors in data_function.py and test_sbdf.py # Conflicts: # spotfire/test/test_sbdf.py

Fix: remove unused cdef declarations flagged by cython-lint

c52db09

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge pull request #1 from stewjb/perf/polars-string-arrow-export

651df7e

Perf: export Polars String columns directly from Arrow LargeUtf8 buffers

stewjb and others added 8 commits April 3, 2026 22:25

Remove benchmark.py from repository

36621bd

benchmark.py is a local development tool and should not be committed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix: shorten over-long test method names and rename 2-char variable f…

b873848

…or pylint Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix: add type: ignore[call-overload] for pd.array timedelta64 mypy ov…

c91fd1a

…erload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix: move type: ignore comment to continuation line to fix line-too-l…

fb3760b

…ong (131/120) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add native Polars DataFrame support#99

Add native Polars DataFrame support#99
stewjb wants to merge 41 commits intospotfiresoftware:mainfrom
stewjb:main

stewjb commented Mar 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

stewjb Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

stewjb Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

stewjb commented Mar 24, 2026

Uh oh!

vrane-tibco commented Apr 1, 2026

Uh oh!

stewjb commented Apr 1, 2026

Uh oh!

stewjb commented Apr 3, 2026 •

edited

Loading

Uh oh!

stewjb commented Apr 3, 2026

Uh oh!

stewjb commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    else:
+    else:
+        # For Datetime/Duration, keep native NumPy datetime64/timedelta64 dtypes instead of boxing to object.
+        if dtype_name in ("Datetime", "Duration"):
+            return series.to_numpy(allow_copy=True)

Conversation

stewjb commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance benefit

Spotfire data function context

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

stewjb Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

stewjb Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

stewjb commented Mar 24, 2026

Uh oh!

vrane-tibco commented Apr 1, 2026

Uh oh!

stewjb commented Apr 1, 2026

Uh oh!

stewjb commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

String import

Datetime import

Full import picture

Uh oh!

stewjb commented Apr 3, 2026

PR A — Export

PR B — Import

Uh oh!

stewjb commented Apr 4, 2026

Benchmark Results

Export time — 100,000 rows

Export memory (Δ RSS) — 100,000 rows

Import time — 100,000 rows

Import memory (Δ RSS) — 100,000 rows

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stewjb commented Mar 24, 2026 •

edited

Loading

stewjb commented Apr 3, 2026 •

edited

Loading