Releases: knowledgestack/excel-parser
v0.2.1
ks-xlsx-parser v0.2.1 — Packaging hotfix 📦
Headline: v0.2.0 shipped a broken wheel. from ks_xlsx_parser.pipeline import parse_workbook (and every public entry point that depends on it) raised ModuleNotFoundError for everyone who installed from PyPI. v0.2.1 fixes it.
If you're on 0.2.0:
pip install --upgrade ks-xlsx-parser # or: uv pip install --upgrade ks-xlsx-parser
python -c "from ks_xlsx_parser.pipeline import parse_workbook; print('ok')"What was broken
The source tree was laid out flat under src/ — pipeline.py and api.py as top-level modules, with 13 sibling packages (models, utils, parsers, analysis, …) next to them. The setuptools build picked up the packages but silently dropped the top-level modules, because [tool.setuptools.packages.find] only finds packages, never modules. The published wheel's top_level.txt:
analysis annotation charts chunking comparison export formula
ks_xlsx_parser models parsers rendering storage utils verification
— 14 top-level entries, 13 of them generic. Two consequences:
from ks_xlsx_parser.pipeline import ...failed becausepipeline.pywas never copied into the wheel.- Anyone with an unrelated
models/utils/parserspackage insite-packageshad it shadowed by ks-xlsx-parser's internal ones.
CI couldn't catch this because the test matrix used an editable install — src/ lives on sys.path and the wheel-packaging defect is invisible.
What changed in 0.2.1
- Proper nested package. Everything now lives under
src/ks_xlsx_parser/. The wheel'stop_level.txtcontains onlyks_xlsx_parser. Imports inside the package switched fromfrom pipeline import …to relative / fully-qualifiedks_xlsx_parser.pipeline. scripts/verify_wheel.py— builds the wheel, installs it in a clean venv, asserts the public import surface resolves and there's no namespace pollution. Wired into bothci.yml(newwheel-checkjob — required) andrelease.yml(aVerify wheelstep that runs before PyPI publish). The class of bug that produced 0.2.0 cannot recur.- CI overhaul — separate
test/wheel-check/lint/typecheckjobs;uv-backed installs (faster);make install-devalias. - Docker benchmark + accuracy tracking —
Dockerfile.benchreproduces the SpreadsheetBench retrieval benchmark; a new.github/workflows/benchmark.ymlruns a 60-instance sample on every PR touching the parser and the full 912-corpus weekly.scripts/append_bench_history.pykeeps a commit-over-commit history file. (Goal: text recall@5 > 0.90.) - Recall failure triage —
eval_retrieval.py --emit-failures+scripts/triage_recall.pyturn "recall@5 = X" into a bucket histogram with exemplar failures.docs/recall-investigation.mddocuments the diagnosis framework.
⚠️ Breaking — if you depended on the leaked top-level packages
Anyone whose downstream code did:
from models import WorkbookDTO # ← used to "work" because of the leak
from parsers.workbook_parser import …must update to:
from ks_xlsx_parser.models import WorkbookDTO
from ks_xlsx_parser.parsers.workbook_parser import …If you only used the documented public surface (from ks_xlsx_parser import parse_workbook and ks_xlsx_parser.pipeline.parse_workbook), nothing changes — those names still resolve, and now they actually load.
Upgrade
pip install --upgrade ks-xlsx-parser
# or
uv pip install --upgrade ks-xlsx-parserThen:
from ks_xlsx_parser import parse_workbook
result = parse_workbook(path="report.xlsx")
print(result.workbook.total_cells)
for chunk in result.chunks:
print(chunk.source_uri, chunk.render_text[:120])Verification used to ship this release
1041 passed, 11 deselected # full test suite
verifying ks_xlsx_parser-0.2.1-py3-none-any.whl
wheel contents OK (61 entries, top-level: ks_xlsx_parser)
clean-venv import OK
wheel verification PASSED
scripts/verify_wheel.py ran as a release-workflow step before the PyPI publish, with the wheel that's now on PyPI.
Thanks
Frank for the bug report — the channel screenshot was the right level of detail to reproduce in five minutes.
See also
- CHANGELOG.md — full diff log.
docs/recall-investigation.md— diagnosis framework for the recall→0.90 roadmap.docs/benchmark-local-setup.md— reproduce the benchmark on your laptop.
v0.2.0 — SpreadsheetBench benchmark + retrievability
ks-xlsx-parser v0.2.0 — Benchmark + Retrievability 📊
Headline: ks-xlsx-parser now has a head-to-head benchmark against Docling on the SpreadsheetBench corpus (912 task instances, 5,458 xlsx files). ks parses 99.945% of the corpus and ties Docling at recall@1 / wins at recall@3 (+2.7 pp) and recall@5 (+1.8 pp) on apples-to-apples retrieval, with 36.9% citation-grade geometric recall that Docling structurally cannot achieve.
Plus three quiet RAG-breaking rendering bugs in 0.1.1 are gone.
What's new
🏁 SpreadsheetBench benchmark — make bench
A reproducible, parser-agnostic benchmark over real-world workbooks scraped from ExcelHome / Mr.Excel / r/excel:
| Metric | ks-xlsx-parser | Docling 2.93 | Δ |
|---|---|---|---|
| Parse success (5,458 files) | 99.945% | not run at scale | — |
| Recall@1 (text-match) | 0.580 | 0.579 | +0.1 pp (tied) |
| Recall@3 (text-match) | 0.697 | 0.670 | +2.7 pp |
| Recall@5 (text-match) | 0.704 | 0.686 | +1.8 pp |
| Recall@5 (geometric, A1 anchor overlap) | 0.369 | 0.000 | Docling has no per-chunk anchors |
| Mean parse time per file | 251 ms | 265 ms | ks ~5% faster |
Why "geometric" recall matters for RAG: ks emits a sheet!A1:Z99 range with every chunk. A retrieval system that surfaces the chunk can render a citation that points at the exact source cells. Docling produces markdown without per-chunk anchors, so it can't satisfy this metric at all. This is the difference between "the answer was in the workbook" and "the answer was in cell C7 of the Revenue sheet."
Marker is intentionally absent — its xlsx → HTML → PDF → layout-model pipeline clocks >30 min per workbook on CPU. The harness supports adding a Marker adapter (tests/benchmarks/adapters/docling_adapter.py as a template); the speed wall is the obstacle.
Full methodology, capability matrix, and caveats: tests/benchmarks/reports/COMPARISON.md.
🔧 Three rendering bugs that were silently torpedoing retrieval
- Comma-formatted numbers.
1272rendered as"1,272.00"(Excel's display format). A user query"1272"substring-missed. Now: numeric cells render the raw value. - Spurious sci-notation. The
[=]formula marker inflated a cell past column width, tripping a long-value fallback that rendered1272as"1.272000e+03". Now: column widths computed using the same rendering pipeline data rows will use. - Embedded newlines in headers (common in CJK workbooks like
"租金\n天数") tore apart the Markdown table grid. Now: collapsed to spaces.
These three together accounted for the entire retrieval-recall gap we initially measured against Docling.
🧹 Segmenter — no more banded-table fragmentation
Removed _detect_style_boundaries from chunking/segmenter.py. The function split a coherent table into 5 fragments at fill-color band boundaries (year-banding, alternating-row shading), shedding header context from data rows. The connected-components + gap detection already handles real boundaries; fill banding is not a semantic one.
🛡️ GradientFill safety
Cells using GradientFill (rare but real — caught by SpreadsheetBench instance 118-8, 8 sheets / 1,244 cells previously lost) used to crash the sheet parser. Now: defensively skipped, sheet keeps parsing.
🐳 Productionization
Makefile:make bench,make bench-robust,make bench-retrievalscripts/download_corpora.shnow fetches SpreadsheetBench v0.1scripts/summarize_retrieval.py— re-aggregate a partialresults.ndjsonif a long run gets interrupted- New benchmark framework supports adding parsers (Marker, hucre, others) via the NDJSON-worker protocol; see
tests/benchmarks/README.md
Reproduce
pip install -U ks-xlsx-parser==0.2.0 # or
git clone https://github.com/knowledgestack/ks-xlsx-parser
cd ks-xlsx-parser
make corpus-download # one-time, ~100 MB
make bench # ~30 min for both benchmarks
open tests/benchmarks/reports/COMPARISON.mdUpgrading from 0.1.1
No breaking API changes. The only behavioral change is that render_text on numeric cells now contains the raw value instead of the Excel-display-formatted string (e.g. 1272 instead of 1,272.00). If you were relying on display formatting in retrieval keys or downstream regex parsing, switch to the cell's display_value field on the ChunkDTO. For everything else, drop-in.
Full changelog: CHANGELOG.md.
Thanks
To the SpreadsheetBench team at Renmin University for publishing a clean, real-world xlsx corpus with structured ground truth — none of this comparison would have been possible without it.