Performance: 35× speedup across matchers + Coma accuracy improvements by kPsarakis · Pull Request #96 · delftdata/valentine

kPsarakis · 2026-04-08T06:24:34Z

Summary

Resolves #88

Across-the-board performance and accuracy work on every matcher, plus dead-code removal and coverage tests for the touched paths.

Headline: ~35× wall-clock speedup on the full NYU Open Data benchmark (1048s → 30s) with no F1 regressions and a +0.047 Coma F1 improvement from targeted accuracy work.

Performance: before / after

Full NYU Open Data benchmark, 10 dataset pairs, single machine, sequential.

Matcher	Total (s)	Worst pair (s)	Mean F1
Coma	74.87 → 8.46 (8.9×)	24.41 → 2.68 (9.1×)	0.754 → 0.802 (+0.047)
Cupid	191.45 → 6.17 (31.0×)	78.17 → 2.02 (38.8×)	0.480 → 0.495 (+0.015)
DistributionBased	36.86 → 5.25 (7.0×)	13.81 → 2.77 (5.0×)	0.678 → 0.667 (−0.011)
JaccardDistanceMatcher	739.17 → 3.96 (186.9×)	203.98 → 1.61 (126.4×)	0.645 → 0.635 (−0.010)
SimilarityFlooding	5.75 → 5.78 (≈)	1.56 → 1.47 (1.1×)	0.505 → 0.509 (+0.004)
Total	1048.10 → 29.61 (35.4×)	—	—

DistributionBased and Jaccard F1 deltas (≤ 0.011) are within run-to-run noise on these matchers.

What changed

Performance

Coma — TF-IDF cosine fast path. TfidfCorpus builds float32 sparse CSR matrices, caches per-column vectorisations on object identity, and memoises pair-level similarities on a symmetric (id, id) key. InstancesCM evaluates InstancesDirect and InstancesAll on the same list — the pair cache collapses both calls into one matmul.
Cupid — WordNet caching. Cached wn.synsets and wn.wup_similarity (symmetric key), plus the English stopword frozenset and the all_lemma_names corpus walk. The cold-path lemma walk used to dominate; it's now paid once per process.
DistributionBased — vectorised quantile histograms. QuantileHistogram.add_values replaces a Python bucket_binary_search loop with a single np.searchsorted + np.bincount over precomputed lower/upper-bound arrays. __slots__ on the histogram, lru_cache on the constant _bucket_distance_matrix(n), and a global ranks pickle cache to avoid re-unpickling per column.
JaccardDistanceMatcher — rapidfuzz cdist. Replaced the per-pair Python loop with rapidfuzz.process.cdist, dispatched to the smaller side (rows × cols favours small rows), with score_cutoff=threshold so rapidfuzz can short-circuit. This is the matcher that moved most in absolute time (−735 s).
Type inference fix. BaseTable.get_data_type now treats pandas "str" / "string" dtypes as text, not as unknown. Free F1 wins for Cupid and SF (which read data_type) and a prerequisite for the Coma accuracy work.

Coma accuracy (+0.047 F1)

Three small additions, each tested independently against the full bench:

Token Jaccard inside NameCM. New tokens.py splits column names into camelCase / snake_case / digit runs and computes a Jaccard. Folded into NameCM with maximum so it can only help — never dilute trigram on well-formed names.
50-entry abbreviation dictionary. addr→address, num→number, approx→approximation, etc. Both the short form and the expansion are kept in the token tuple, so BlkNum matches both BlkNum exactly and block_number via expansion.
Weighted matcher combination. Switched from uniform average to weighted, with InstancesCM getting a 1.3× weight (tuned empirically — 1.5 over-weights, 1.2 under-weights). Schema matchers stay at 1.0.

Experiments tried and rejected

Documented here so they don't get re-attempted:

SiblingsCM in build_matchers — +41% Coma time, zero F1 movement (flat schemas have no sibling structure).
Datatype gate as score multiplier — every floor (0.5 / 0.8 / 0.9) regressed Coma F1, because real ground-truth matches frequently span types (varchar IDs ↔ int IDs, varchar dates ↔ date dates).
WordNet third arm in NameCM — −0.037 F1, +87% time. maximum combination doesn't fence off the noise: WordNet's high-scoring false positives on common tokens still win bidirectional selection.

Dead code removal

After the vectorised add_values rewrite the legacy paths are unreachable:

QuantileHistogram.bucket_binary_search
QuantileHistogram.normalize_values
QuantileHistogram.calc_dist_matrix
The 7-tuple branch in process_columns (callers always pass 8-tuples now)

Tests / coverage

15 new targeted tests in tests/test_coverage_gaps.py for the touched paths. Per-file coverage on the previously gap-flagged files:

File	Before	After
`coma/similarity/tfidf.py`	87.50%	99%
`coma/similarity/tokens.py`	(new file)	100%
`cupid/linguistic_matching.py`	89.23%	96%
`distribution_based/clustering_utils.py`	70.00%	98%
`distribution_based/column_model.py`	86.95%	100%
`distribution_based/quantile_histogram.py`	80.64%	96%
`jaccard_distance/jaccard_distance.py`	90.47%	100%

213 tests pass.

Test plan

pytest -q tests — 213 passed
Full NYU Open Data benchmark — see table above
Coverage report on every touched file
Each accuracy change re-benched independently to catch regressions

codecov · 2026-04-08T06:26:02Z

Codecov Report

❌ Patch coverage is 95.43726% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.97%. Comparing base (a0d488d) to head (effd12d).

Files with missing lines	Patch %	Lines
valentine/algorithms/cupid/linguistic_matching.py	90.76%	6 Missing ⚠️
.../algorithms/distribution_based/clustering_utils.py	81.25%	2 Missing and 1 partial ⚠️
...lgorithms/distribution_based/quantile_histogram.py	90.00%	1 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #96      +/-   ##
==========================================
+ Coverage   95.44%   95.97%   +0.52%     
==========================================
  Files          50       51       +1     
  Lines        2351     2435      +84     
  Branches      366      368       +2     
==========================================
+ Hits         2244     2337      +93     
+ Misses         64       61       -3     
+ Partials       43       37       -6

Flag	Coverage Δ
unit	`95.97% <95.43%> (+0.52%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
valentine/algorithms/coma/coma.py	`100.00% <100.00%> (ø)`
valentine/algorithms/coma/matchers.py	`95.60% <100.00%> (+0.09%)`	⬆️
valentine/algorithms/coma/similarity/tfidf.py	`98.87% <100.00%> (+3.10%)`	⬆️
valentine/algorithms/coma/similarity/tokens.py	`100.00% <100.00%> (ø)`
valentine/algorithms/coma/similarity/trigram.py	`94.11% <100.00%> (+1.26%)`	⬆️
...tine/algorithms/distribution_based/column_model.py	`100.00% <100.00%> (+7.69%)`	⬆️
...lgorithms/distribution_based/distribution_based.py	`99.02% <100.00%> (+0.03%)`	⬆️
...ne/algorithms/jaccard_distance/jaccard_distance.py	`100.00% <100.00%> (+3.84%)`	⬆️
...e/algorithms/similarity_flooding/string_matcher.py	`88.23% <100.00%> (-5.97%)`	⬇️
valentine/data_sources/base_table.py	`88.23% <100.00%> (+0.23%)`	⬆️
... and 4 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

improve performance across the board

c91a019

kPsarakis self-assigned this Apr 8, 2026

remove dead code and update tests

3e86b62

kPsarakis changed the title ~~improve performance across the board~~ Performance: 35× speedup across matchers + Coma accuracy improvements Apr 8, 2026

kPsarakis requested a review from chrisk21 April 8, 2026 07:07

apply ruff rules

effd12d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: 35× speedup across matchers + Coma accuracy improvements#96

Performance: 35× speedup across matchers + Coma accuracy improvements#96
kPsarakis wants to merge 3 commits intomasterfrom
improve-performance

kPsarakis commented Apr 8, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kPsarakis commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance: before / after

What changed

Performance

Coma accuracy (+0.047 F1)

Experiments tried and rejected

Dead code removal

Tests / coverage

Test plan

Uh oh!

codecov bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kPsarakis commented Apr 8, 2026 •

edited

Loading

codecov bot commented Apr 8, 2026 •

edited

Loading