Add native Polars DataFrame support#99
Add native Polars DataFrame support#99stewjb wants to merge 41 commits intospotfiresoftware:mainfrom
Conversation
…tetime, scatter compat - Fix Categorical/Enum dtype: was incorrectly trying to recurse into dtype.categories (which doesn't exist on the dtype object); now casts series to Utf8 and maps to SBDF_STRINGTYPEID directly - Add Enum dtype support (previously raised SBDFError) - Warn on UInt64 export: values above Int64 max will overflow silently - Warn on timezone-aware Datetime export: tz info is not preserved in SBDF - Warn on Decimal export: marked experimental, precision may be lost - Fix scatter() compatibility: add AttributeError fallback to set_at_idx() for older Polars versions within the supported range - Add tests for all of the above Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add polars to test_requirements_default.txt so SbdfPolarsTest is actually executed in CI (previously skipped due to missing import) - Add spotfire[polars] row to extras table in README - Add usage note explaining Spotfire's bundled Python lacks Polars and that SPKs bundling Polars will be ~44 MB larger than typical packages Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Raise SBDFError for unknown output_format values (previously fell through silently to Pandas) - Emit SBDFWarning when Categorical/Enum columns are exported as String, consistent with existing UInt64 and timezone warnings - Add test_invalid_output_format: verifies bad output_format raises - Add test_write_polars_empty: verifies empty DataFrame exports cleanly - Add test_write_polars_series_nulls: verifies null preservation in Series - Add test_polars_categorical_warns: verifies Categorical warning fires Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A Polars Series of [None, None, None] has dtype pl.Null (no type can be inferred). Previously this raised SBDFError with "unknown dtype". Now it exports as an all-invalid String column, consistent with how all-None Pandas columns are handled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CI static analysis runs mypy without polars installed; add type: ignore[import-not-found] so mypy skips the missing stub. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Explain non-obvious choices that would otherwise prompt review questions: - Why dtype.__class__.__name__ instead of isinstance() - Why scatter()/set_at_idx() try/except exists and which versions it covers - Why is_object_numpy_type() cpdef wrapper is needed for a cdef attribute - Why the output_format polars path short-circuits before pd.concat - Why the Null dtype path returns a placeholder array Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…olars versions (>= 0.20) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds native Polars support to Spotfire’s SBDF import/export layer to avoid Pandas conversions (improving memory usage and performance for large datasets), and wires it up as an optional extra.
Changes:
- Add
polarsas an optional dependency (spotfire[polars]) and enable it in dev/test setups. - Extend
sbdf.export_data()to acceptpolars.DataFrame/polars.Seriesdirectly, with dtype→SBDF mapping. - Extend
sbdf.import_data()withoutput_formatto optionally construct a nativepolars.DataFramewithout creating a Pandas DataFrame.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
spotfire/sbdf.pyx |
Implements Polars import/export paths and dtype mappings; adds output_format to import_data(). |
spotfire/sbdf.pyi |
Updates type stub for import_data() to include output_format. |
spotfire/test/test_sbdf.py |
Adds Polars-focused unit tests for export/import/roundtrip + warnings. |
pyproject.toml |
Adds polars extra and includes it in dev extra. |
test_requirements_default.txt |
Installs Polars for test runs. |
README.md |
Documents spotfire[polars] and the new import/export behavior. |
.gitignore |
Ignores .venv, uv.lock, and .claude. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| context.set_valuetype_id(_export_infer_valuetype_from_polars_dtype(series.dtype, f"column '{col}'")) | ||
| invalids = series.is_null().to_numpy() | ||
| context.set_arrays(_export_polars_series_to_numpy(context, series), invalids) | ||
| column_metadata.append({}) |
There was a problem hiding this comment.
In the Polars export path, invalids are derived from series.is_null(), which does not mark floating-point NaN values as invalid. In the existing Pandas path pd.isnull() treats NaN as missing, so exporting a Polars float column containing NaN will write NaNs as real values instead of SBDF invalids (behavior mismatch vs Pandas and likely incorrect for Spotfire missing-values semantics). Consider treating NaN as invalid for Float32/Float64 columns (e.g., combine is_null() with is_nan() when applicable).
There was a problem hiding this comment.
added series.is_null() and series.is_nan() for floats to handle like pandas does.
| if na_value is not None: | ||
| return np.asarray(series.fill_null(na_value).to_numpy(allow_copy=True), | ||
| dtype=context.get_numpy_dtype()) | ||
| else: |
There was a problem hiding this comment.
_export_polars_series_to_numpy converts to an object ndarray when na_value is None. For Polars Datetime / Duration series, to_numpy() already produces datetime64 / timedelta64 arrays that the existing SBDF exporters can handle, so forcing dtype=object will box scalars and create an unnecessary copy (hurting the performance goal of this PR). Consider special-casing datetime/timespan to keep the native NumPy dtype (ideally normalized to the SBDF-supported resolution) instead of casting to object.
| else: | |
| else: | |
| # For Datetime/Duration, keep native NumPy datetime64/timedelta64 dtypes instead of boxing to object. | |
| if dtype_name in ("Datetime", "Duration"): | |
| return series.to_numpy(allow_copy=True) |
There was a problem hiding this comment.
datetime and duration go to numpy early now.
- Move output_format validation to top of import_data() for fail-fast behaviour before the file is opened - Raise SBDFError in _import_polars_dtype fallback instead of silently returning Utf8 for unknown SBDF type IDs - Treat NaN as invalid (missing) for Float32/Float64 columns, matching Pandas pd.isnull() behaviour; add test_write_polars_float_nan - Keep native datetime64/timedelta64 arrays for Datetime/Duration columns instead of boxing to object dtype (avoids unnecessary copy) - Add @overload signatures to sbdf.pyi so callers get pd.DataFrame for the default output_format="pandas" and Any for output_format="polars" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@vrane-tibco @bbassett-tibco @mpanke-tibco thanks for considering this PR. Let me know if you all have thoughts. |
_export_obj_dict_of_lists (line 1313): np.array(n) where n is an integer
creates a 0-dimensional array, not a 1-D array of length n. Every
export_data({"col": [...]}) call would raise IndexError. Fixed to
np.empty(shape, ...).
_export_obj_iterable (lines 1358-1366): np.append inside a for loop
reallocates the entire array on every iteration — O(n²) for a column
of n rows. Replaced with list accumulation and a single np.array()
call at the end.
Add test_export_dict_of_lists and test_export_list to cover both paths
(previously untested, which is why the bug went undetected).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
For Int32/Int64 columns, the previous code constructed a pd.Series and then assigned nulls via .loc[mask] = None in a second pass, which triggers Pandas dtype coercion overhead internally. Replace with pd.arrays.IntegerArray(values, mask) which constructs the nullable integer array with the validity mask in a single operation, avoiding the second pass entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@stewjb Appreciate the work here — the dtype mapping is thorough and the export mechanics are solid. But I have two concerns, one architectural and one about the performance framing. On metadata - This package isn't just a data serializer - it's specifically the bridge that carries Spotfire's interpretation of that data back and forth. The three things Spotfire cares about beyond raw values are spotfire_table_metadata, spotfire_column_metadata, and spotfire_type. All three are attached to Pandas objects and the entire surface API (copy_metadata, get_spotfire_types, set_spotfire_types, set_geocoding_table, all exported from init.py) is built on top of those Pandas extension points.
On performance - The export side is genuinely better than the to_pandas() workaround. The import side though is a different story, Performance gain get for Numeric columns (Boolean, Integer, LongInteger, Float, Real): The Polars path does pl.Series(values=numpy_array) - Polars can reference the NumPy buffer directly for these types. This is the genuine zero-copy , but String or Datetime I doubt there will be performance gain. Take strings as a concrete example. In sbdf.pyx _import_vt_string (line 534–536) runs a Python loop that creates a Python str object for every row and stores them in a NPY_OBJECT array — that's unavoidable because SBDF stores strings as C char pointers, not Arrow buffers. Then _import_build_polars_dataframe calls .tolist() on that object array (line 726) to produce a Python list, because Polars can't consume a NPY_OBJECT array directly. Then pl.Series(values=list) re-encodes all those strings into Polars' Arrow buffer. So at peak memory you have the NPY_OBJECT array, the Python list, and the Arrow buffer all alive simultaneously - three representations of the same data. The Pandas import path just wraps the NPY_OBJECT array in a Series header and stops there: two representations, same str objects shared by reference. So for import: numeric columns genuinely benefit, strings and datetimes are actually worse than the Pandas path in both memory and time. |
|
@vrane-tibco this is good feedback. Thank you for taking the time to consider this. If you all feel the metadata is non-negotiable for this package, I understand that. We can fork off this package if we deem this important. The biggest benefit is not having doubling of peak memory when using polars, but we can increase our memory to get around this. The metadata loss is a limitation of polars, which you've mentioned. Polars has an outstanding issue to address this (pola-rs/polars#5117), but it's been open since 2022 which isn't promising. My suggestion would be to log a warning to the user on the limitation of Polars on import/export. That would handle your point on dropping the metadata silently. I don't think there is a clean way to do that hand off natively into polars. The alternative I can see is to return a list with the dataframe and metadata unpacked. That seems clunky and confusing. A possible compromise is another function strictly for polars that returns the metadata unpacked. Your comment on the confusing error makes sense and I can address it. On performance, the export benefit is the main driver for me of this PR. As of now, I don't know how the user would specify within spotfire that they want a polars dataframe returned instead of pandas. Personally, i just use One proposal would be to drop the import functionality totally given you can't reach it as intended currently. I included it because it helps for testing and if polars adoption becomes more widespread in the future. I think we can address the concerns on datetimes and strings. I definitely think we can drop it to to have only 2 copies of the same data, which is still a net improvement over having to use .to_pandas(), which is the current default and holds 2 copies of the whole df in that process. |
…etime import - Emit SBDFWarning on both Polars import and export paths pointing to polars-rs/polars#5117 so metadata loss is never silent. - Raise TypeError with a Polars-specific message from copy_metadata(), get_spotfire_types(), and set_spotfire_types() instead of a generic error. - For the Polars import path, bypass the Python-boxing importers for DateTime/Date/TimeSpan: store raw int64 ms values via _import_vts_numpy, then in _import_build_polars_dataframe subtract the SBDF-to-Unix epoch offset in-place and reinterpret via .view() — reducing peak memory from 3 live copies to 1-2 (down from creating Python datetime objects). - String/Time/Binary/Decimal import: release the concatenated numpy array before building the Polars Arrow buffer (del + clear_values_arrays()) to cap peak at 2 live copies instead of 3. - Add get_value_type_id() and clear_values_arrays() cpdef helpers on _ImportContext to support the above without Cython-level casts. - Add 6 new tests covering the metadata warning and descriptive error paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Verifies that the in-place epoch-shift + .view('datetime64[D]') path in
_import_build_polars_dataframe produces identical results to the reference
np.astype('datetime64[D]') conversion across six dates: the SBDF epoch
(0001-01-01), one day before and the day of the Unix epoch, one day after,
a recent date, and the maximum representable date (9999-12-31).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous view('datetime64[ms]') approach always triggered a copy inside
Polars: _normalise_numpy_dtype() unconditionally calls .astype(np.int64) on
any datetime64 input before passing to the Rust constructor.
Verified via mutation test (numpy array modified after Series construction):
- Datetime: pl.Series(int64, Int64).cast(Datetime('ms')) — zero-copy; Int64
and Datetime('ms') share the same int64 Arrow buffer (metadata-only cast).
- Duration: pl.Series(int64, Int64).cast(Duration('ms')) — same, zero-copy.
- Date: pl.Date is int32 internally, so int64→int32 narrowing is unavoidable
(1 copy via .astype(np.int32)); pl.Series(int32, Date) is then zero-copy.
Total: 2 copies from C data (down from 3 in the original NPY_OBJECT path).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_import_vt_date_int32 writes directly into an NPY_INT32 slice array at the C level, so pl.Series(int32, pl.Date) in _import_build_polars_dataframe is then zero-copy — eliminating the prior int64→int32 astype() narrowing copy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
For null-free numeric columns, skip fill_null and use to_numpy(allow_copy=False) to return a direct view of the Arrow buffer. For Datetime/Date/Duration/Time, extract raw integer buffers from the Polars Series (zero-copy when null-free) and route through four new Polars-specific C-level exporter functions that perform epoch/unit conversion in a tight C loop, completely bypassing the Python-object-boxing loop in the generic exporters: - _export_vt_polars_datetime: int64 ms (Unix) → add SBDF epoch offset - _export_vt_polars_date: int32 days → int64 ms (SBDF epoch) - _export_vt_polars_timespan: int64 ms passthrough (no epoch needed) - _export_vt_polars_time: int64 ns → int64 ms Columns with nulls fall back to a fill-zero copy (Arrow's validity bitmap cannot be expressed inline in a numpy int array), but are still processed by the C loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stub files must not contain a concrete @overload implementation alongside the overload variants; mypy rejects it with 'An implementation for an overloaded function is not allowed in a stub file'. Remove the offending line, leaving only the two typed overloads. Also suppress call-overload at the one test site that intentionally passes an invalid output_format value to exercise the SBDFError path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…port SBDF null slots may contain sentinel values (e.g. INT64_MAX) which, after the ms→ns ×1_000_000 scale in _import_vt_time_int64, exceed Polars' valid Time range [0, 86_400_000_000_000 ns]. Zero them out before passing the int64 buffer to pl.Series(dtype=pl.Time); the invalids array then overwrites those slots with None. Also adds OutputFormat enum, cython-lint-friendly named export constants, and fixes the sbdf.pyi stub to use TYPE_CHECKING guard. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
polars is an optional dependency not installed in the CI lint environment; the TYPE_CHECKING guard in sbdf.pyi is sufficient for runtime, but mypy still needs the override to suppress import-not-found on the stub. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OutputFormat is no longer a str subclass; passing a raw string now raises SBDFError. Updated all call sites in tests and README to use OutputFormat.POLARS / OutputFormat.PANDAS, and tightened the .pyi overloads to Literal[OutputFormat.*] accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@vrane-tibco addressing your point on import copy performance — several optimizations have String importYour analysis was correct at the time of review: the NPY_OBJECT array, the Python list,
Datetime importYour concern was that datetimes would not see a performance gain. The current
The Pandas path ( Full import picture
String is now equivalent to the Pandas path. All temporal types are strictly better — |
|
This is getting large for one PR. Here are my thoughts on splitting it up. PR A — Export
Covers:
PR B — Import
Covers:
PR A could land quickly given it's the uncontested part, and PR B can absorb the metadata/import discussion on its own timeline. |
All errors pre-dated this PR but were blocking CI on the fork. Added targeted # type: ignore[...] annotations with the narrowest applicable error codes rather than broad suppression. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix dict-of-lists export bug and O(n²) iterable export loop * Build nullable integer columns with mask in one shot on import * Fix pre-existing mypy errors in data_function.py and test_sbdf.py # Conflicts: # spotfire/test/test_sbdf.py
Polars stores strings as Arrow LargeUtf8: a flat UTF-8 bytes buffer plus an int64 offsets buffer. Previously, export went through series.to_numpy() (one Python str object per row) and then the C helper re-encoded each string to UTF-8 via PyObject_Str + str.encode(). This commit adds _export_extract_string_obj_arrow() in sbdf_helpers.c, which reads the raw UTF-8 bytes and offsets directly -- no Python API calls in the inner loop. The Cython side obtains raw pointers via PyArray_DATA() on zero-copy numpy views of the Arrow buffers. The dispatch path (polars_exporter_id = _POL_EXP_STRING = 5) mirrors the existing temporal fast paths. Categorical and Enum columns are cast to Utf8 before the Arrow path is taken. A guard asserts the Arrow type is large_string (int64 offsets) and raises SBDFError if not. Benchmarked at 100k rows, string no-nulls (psutil, 7 reps): pandas baseline: 58ms old polars (via pandas): 71ms new polars (Arrow direct): 26ms (-56% vs pandas, -64% vs old polars) The remaining time is dominated by sbdf_str_create_len (one malloc + memcpy per string), which is unavoidable in the current SBDF format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
series.to_arrow() requires pyarrow. CI test environments install spotfire[polars] without pyarrow, causing ModuleNotFoundError on all Polars string export tests. Wrap the Arrow fast path in try/except ImportError so it degrades gracefully to the existing to_numpy() path when pyarrow is absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hars pylint line-too-long (C0301) flagged lines 98-99 after the type: ignore annotations were added. Split the assertEqual calls to keep each line within the 120-character limit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
E302: add second blank line before OutputFormat class and _ExportContext decorator. E127: align continuation lines with opening parenthesis in set_arrow_string, _export_polars_series_to_numpy, _export_vt_polars_string, and the sbdf_helpers.pxi extern declaration. E115/E117: fix comment indentation inside except blocks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rk profiles Temporal Polars columns with nulls were being cast to float64 (nan for nulls) instead of int64 before passing to the C exporter, which read the buffer as long long* and got garbage values. Fix: call fill_null(0) after the int cast so to_numpy() always returns the expected integer dtype; the invalids mask already records which positions are null so the sentinel is never read. Adds temporal_nulls (datetime/date/duration/time, ~10% nulls) and binary / binary_nulls profiles to benchmark.py to cover remaining SBDF value types. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Perf: export Polars String columns directly from Arrow LargeUtf8 buffers
Benchmark ResultsMeasured on Python 3.14.3 / Polars 1.39.3 / Pandas 2.3.3 / NumPy 2.4.3, 7 reps (first excluded as warmup). Export time — 100,000 rows
Export memory (Δ RSS) — 100,000 rows
Import time — 100,000 rows
Import memory (Δ RSS) — 100,000 rows
Notes
|
benchmark.py is a local development tool and should not be committed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
np.where(invalids)[0] returns an ndarray; pl.Series.scatter() accepts it directly. The .tolist() conversion was allocating an unnecessary Python list on every null-containing column import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…boundary safety Exports 100_001 rows of a Polars String column, forcing a second SBDF row slice (start=100_000, count=1), and asserts the value at the chunk boundary is correct. Covers the raw C pointer arithmetic in _export_extract_string_obj_arrow which is not bounds-checked. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lars/pyarrow polars is an optional dependency; pyarrow only arrives transitively through it. Adding test_requirements_no_polars.txt causes build.yaml's test-environment matrix to automatically pick up a second CI slot that runs the full test suite with neither library installed. SbdfPolarsTest is skipped via @unittest.skipIf(pl is None, ...); all Pandas tests must pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add two tests to SbdfPolarsTest verifying that the Polars and Pandas import/export code paths produce identical data for all 11 non-Decimal SBDF value types with one null per column (rotating positions 0–4): - test_all_dtypes_export_polars_vs_pandas_path: exports the same data via the native Polars path and the Pandas path, imports both back as Pandas, and asserts frame equality. - test_all_dtypes_import_polars_vs_pandas_path: imports a single SBDF file as both a Polars and a Pandas DataFrame, then compares null positions and non-null values column by column. Helpers: - _all_dtypes_polars_df(): canonical Polars source with all SBDF-compatible types. - _all_dtypes_pandas_df(): equivalent Pandas source (avoids pyarrow dependency). - _assert_import_paths_equivalent(): per-column null + value comparison using Series.to_list(), which works without pyarrow. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…or pylint Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…erload Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ong (131/120) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes #98
Summary
export_data()now acceptspolars.DataFrameandpolars.Seriesdirectly, mapping Polars dtypes to SBDF types without any Pandas intermediary. Supported types: Boolean, Int8/16/32, Int64, Float32/64, Utf8/String, Date, Datetime, Duration, Time, Binary, Decimal, Categorical.import_data()gains anoutput_formatparameter (default"pandas"for backwards compatibility). Whenoutput_format="polars", apolars.DataFrameis built directly from the raw numpy arrays — no Pandas DataFrame is created at any point.spotfire[polars]), following the same pattern asspotfire[geo]andspotfire[plot].Performance benefit
The previous workaround required
polars_df.to_pandas()before export, which doubles peak memory usage and adds 2–5 seconds of conversion time at 10M rows. The native path eliminates this entirely for export.Spotfire data function context
When running inside a Spotfire data function, SBDF import and export happen automatically via
data_function.py— users never callimport_dataorexport_datadirectly. This has two implications:Export (output variables): Full benefit. A user can build a
polars.DataFramein their script and return it as an output variable —export_data()handles it natively with no conversion.Import (input variables): No benefit from
import_data(output_format="polars"). Input data is always loaded by the framework viasbdf.import_data(self._file)(nooutput_formatargument), so input variables always arrive in the script aspd.DataFrame. Users who want Polars for processing would still need to callpl.from_pandas(input_df)themselves. Fixing this properly would require changes todata_function.pyand a mechanism for users to declare their preference — out of scope for this PR.In short: the
output_formatparameter onimport_datais primarily useful outside the Spotfire data function context (e.g. standalone scripts using thespotfirepackage directly). Inside a data function, only the export side benefits.Test plan
test_write_polars_basic— export DataFrame with common types, re-import as Pandas and verify datatest_write_polars_nulls— null values are preserved through the roundtriptest_write_polars_series— Polars Series export workstest_import_as_polars— import withoutput_format="polars"returns a nativepolars.DataFrametest_polars_roundtrip— full Polars → SBDF → Polars roundtrip🤖 Generated with Claude Code