19 May 22:21

d8fd418

v0.2.1 Latest

Latest

ks-xlsx-parser v0.2.1 — Packaging hotfix 📦

Headline: v0.2.0 shipped a broken wheel. from ks_xlsx_parser.pipeline import parse_workbook (and every public entry point that depends on it) raised ModuleNotFoundError for everyone who installed from PyPI. v0.2.1 fixes it.

If you're on 0.2.0:

pip install --upgrade ks-xlsx-parser   # or: uv pip install --upgrade ks-xlsx-parser
python -c "from ks_xlsx_parser.pipeline import parse_workbook; print('ok')"

What was broken

The source tree was laid out flat under src/ — pipeline.py and api.py as top-level modules, with 13 sibling packages (models, utils, parsers, analysis, …) next to them. The setuptools build picked up the packages but silently dropped the top-level modules, because [tool.setuptools.packages.find] only finds packages, never modules. The published wheel's top_level.txt:

analysis annotation charts chunking comparison export formula
ks_xlsx_parser models parsers rendering storage utils verification

— 14 top-level entries, 13 of them generic. Two consequences:

from ks_xlsx_parser.pipeline import ... failed because pipeline.py was never copied into the wheel.
Anyone with an unrelated models / utils / parsers package in site-packages had it shadowed by ks-xlsx-parser's internal ones.

CI couldn't catch this because the test matrix used an editable install — src/ lives on sys.path and the wheel-packaging defect is invisible.

What changed in 0.2.1

Proper nested package. Everything now lives under src/ks_xlsx_parser/. The wheel's top_level.txt contains only ks_xlsx_parser. Imports inside the package switched from from pipeline import … to relative / fully-qualified ks_xlsx_parser.pipeline.
scripts/verify_wheel.py — builds the wheel, installs it in a clean venv, asserts the public import surface resolves and there's no namespace pollution. Wired into both ci.yml (new wheel-check job — required) and release.yml (a Verify wheel step that runs before PyPI publish). The class of bug that produced 0.2.0 cannot recur.
CI overhaul — separate test / wheel-check / lint / typecheck jobs; uv-backed installs (faster); make install-dev alias.
Docker benchmark + accuracy tracking — Dockerfile.bench reproduces the SpreadsheetBench retrieval benchmark; a new .github/workflows/benchmark.yml runs a 60-instance sample on every PR touching the parser and the full 912-corpus weekly. scripts/append_bench_history.py keeps a commit-over-commit history file. (Goal: text recall@5 > 0.90.)
Recall failure triage — eval_retrieval.py --emit-failures + scripts/triage_recall.py turn "recall@5 = X" into a bucket histogram with exemplar failures. docs/recall-investigation.md documents the diagnosis framework.

⚠️ Breaking — if you depended on the leaked top-level packages

Anyone whose downstream code did:

from models import WorkbookDTO          # ← used to "work" because of the leak
from parsers.workbook_parser import …

must update to:

from ks_xlsx_parser.models import WorkbookDTO
from ks_xlsx_parser.parsers.workbook_parser import …

If you only used the documented public surface (from ks_xlsx_parser import parse_workbook and ks_xlsx_parser.pipeline.parse_workbook), nothing changes — those names still resolve, and now they actually load.

Upgrade

pip install --upgrade ks-xlsx-parser
# or
uv pip install --upgrade ks-xlsx-parser

Then:

from ks_xlsx_parser import parse_workbook
result = parse_workbook(path="report.xlsx")
print(result.workbook.total_cells)
for chunk in result.chunks:
    print(chunk.source_uri, chunk.render_text[:120])

Verification used to ship this release

1041 passed, 11 deselected           # full test suite
verifying ks_xlsx_parser-0.2.1-py3-none-any.whl
wheel contents OK (61 entries, top-level: ks_xlsx_parser)
clean-venv import OK
wheel verification PASSED

scripts/verify_wheel.py ran as a release-workflow step before the PyPI publish, with the wheel that's now on PyPI.

Thanks

Frank for the bug report — the channel screenshot was the right level of detail to reproduce in five minutes.

ks-xlsx-parser v0.2.0 — Benchmark + Retrievability 📊

Headline: ks-xlsx-parser now has a head-to-head benchmark against Docling on the SpreadsheetBench corpus (912 task instances, 5,458 xlsx files). ks parses 99.945% of the corpus and ties Docling at recall@1 / wins at recall@3 (+2.7 pp) and recall@5 (+1.8 pp) on apples-to-apples retrieval, with 36.9% citation-grade geometric recall that Docling structurally cannot achieve.

Plus three quiet RAG-breaking rendering bugs in 0.1.1 are gone.

What's new

🏁 SpreadsheetBench benchmark — `make bench`

A reproducible, parser-agnostic benchmark over real-world workbooks scraped from ExcelHome / Mr.Excel / r/excel:

Metric	ks-xlsx-parser	Docling 2.93	Δ
Parse success (5,458 files)	99.945%	not run at scale	—
Recall@1 (text-match)	0.580	0.579	+0.1 pp (tied)
Recall@3 (text-match)	0.697	0.670	+2.7 pp
Recall@5 (text-match)	0.704	0.686	+1.8 pp
Recall@5 (geometric, A1 anchor overlap)	0.369	0.000	Docling has no per-chunk anchors
Mean parse time per file	251 ms	265 ms	ks ~5% faster

Why "geometric" recall matters for RAG: ks emits a sheet!A1:Z99 range with every chunk. A retrieval system that surfaces the chunk can render a citation that points at the exact source cells. Docling produces markdown without per-chunk anchors, so it can't satisfy this metric at all. This is the difference between "the answer was in the workbook" and "the answer was in cell C7 of the Revenue sheet."

Marker is intentionally absent — its xlsx → HTML → PDF → layout-model pipeline clocks >30 min per workbook on CPU. The harness supports adding a Marker adapter (tests/benchmarks/adapters/docling_adapter.py as a template); the speed wall is the obstacle.

Full methodology, capability matrix, and caveats: tests/benchmarks/reports/COMPARISON.md.

🔧 Three rendering bugs that were silently torpedoing retrieval

Comma-formatted numbers. 1272 rendered as "1,272.00" (Excel's display format). A user query "1272" substring-missed. Now: numeric cells render the raw value.
Spurious sci-notation. The [=] formula marker inflated a cell past column width, tripping a long-value fallback that rendered 1272 as "1.272000e+03". Now: column widths computed using the same rendering pipeline data rows will use.
Embedded newlines in headers (common in CJK workbooks like "租金\n天数") tore apart the Markdown table grid. Now: collapsed to spaces.

These three together accounted for the entire retrieval-recall gap we initially measured against Docling.

🧹 Segmenter — no more banded-table fragmentation

Removed _detect_style_boundaries from chunking/segmenter.py. The function split a coherent table into 5 fragments at fill-color band boundaries (year-banding, alternating-row shading), shedding header context from data rows. The connected-components + gap detection already handles real boundaries; fill banding is not a semantic one.

🛡️ GradientFill safety

Cells using GradientFill (rare but real — caught by SpreadsheetBench instance 118-8, 8 sheets / 1,244 cells previously lost) used to crash the sheet parser. Now: defensively skipped, sheet keeps parsing.

🐳 Productionization

Makefile: make bench, make bench-robust, make bench-retrieval
scripts/download_corpora.sh now fetches SpreadsheetBench v0.1
scripts/summarize_retrieval.py — re-aggregate a partial results.ndjson if a long run gets interrupted
New benchmark framework supports adding parsers (Marker, hucre, others) via the NDJSON-worker protocol; see tests/benchmarks/README.md

Reproduce

pip install -U ks-xlsx-parser==0.2.0      # or
git clone https://github.com/knowledgestack/ks-xlsx-parser
cd ks-xlsx-parser
make corpus-download                       # one-time, ~100 MB
make bench                                 # ~30 min for both benchmarks
open tests/benchmarks/reports/COMPARISON.md

Upgrading from 0.1.1

No breaking API changes. The only behavioral change is that render_text on numeric cells now contains the raw value instead of the Excel-display-formatted string (e.g. 1272 instead of 1,272.00). If you were relying on display formatting in retrieval keys or downstream regex parsing, switch to the cell's display_value field on the ChunkDTO. For everything else, drop-in.

Full changelog: CHANGELOG.md.

Thanks

To the SpreadsheetBench team at Renmin University for publishing a clean, real-world xlsx corpus with structured ground truth — none of this comparison would have been possible without it.

Assets 5

0 Join discussion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

ks-xlsx-parser v0.2.1 — Packaging hotfix 📦

What was broken

What changed in 0.2.1

⚠️ Breaking — if you depended on the leaked top-level packages

Upgrade

Verification used to ship this release

Thanks

See also

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

ks-xlsx-parser v0.2.0 — Benchmark + Retrievability 📊

What's new

🏁 SpreadsheetBench benchmark — `make bench`

🔧 Three rendering bugs that were silently torpedoing retrieval

🧹 Segmenter — no more banded-table fragmentation

🛡️ GradientFill safety

🐳 Productionization

Reproduce

Upgrading from 0.1.1

Thanks

Uh oh!

Releases: knowledgestack/excel-parser

v0.2.1

ks-xlsx-parser v0.2.1 — Packaging hotfix 📦

What was broken

What changed in 0.2.1

⚠️ Breaking — if you depended on the leaked top-level packages

Upgrade

Verification used to ship this release

Thanks

See also

Uh oh!

v0.2.0 — SpreadsheetBench benchmark + retrievability

ks-xlsx-parser v0.2.0 — Benchmark + Retrievability 📊

What's new

🏁 SpreadsheetBench benchmark — make bench

🔧 Three rendering bugs that were silently torpedoing retrieval

🧹 Segmenter — no more banded-table fragmentation

🛡️ GradientFill safety

🐳 Productionization

Reproduce

Upgrading from 0.1.1

Thanks

Uh oh!

🏁 SpreadsheetBench benchmark — `make bench`