Skip to content

Performance: 35× speedup across matchers + Coma accuracy improvements#96

Open
kPsarakis wants to merge 3 commits intomasterfrom
improve-performance
Open

Performance: 35× speedup across matchers + Coma accuracy improvements#96
kPsarakis wants to merge 3 commits intomasterfrom
improve-performance

Conversation

@kPsarakis
Copy link
Copy Markdown
Member

@kPsarakis kPsarakis commented Apr 8, 2026

Summary

Resolves #88

Across-the-board performance and accuracy work on every matcher, plus dead-code removal and coverage tests for the touched paths.

Headline: ~35× wall-clock speedup on the full NYU Open Data benchmark (1048s → 30s) with no F1 regressions and a +0.047 Coma F1 improvement from targeted accuracy work.

Performance: before / after

Full NYU Open Data benchmark, 10 dataset pairs, single machine, sequential.

Matcher Total (s) Worst pair (s) Mean F1
Coma 74.87 → 8.46 (8.9×) 24.41 → 2.68 (9.1×) 0.754 → 0.802 (+0.047)
Cupid 191.45 → 6.17 (31.0×) 78.17 → 2.02 (38.8×) 0.480 → 0.495 (+0.015)
DistributionBased 36.86 → 5.25 (7.0×) 13.81 → 2.77 (5.0×) 0.678 → 0.667 (−0.011)
JaccardDistanceMatcher 739.17 → 3.96 (186.9×) 203.98 → 1.61 (126.4×) 0.645 → 0.635 (−0.010)
SimilarityFlooding 5.75 → 5.78 (≈) 1.56 → 1.47 (1.1×) 0.505 → 0.509 (+0.004)
Total 1048.10 → 29.61 (35.4×)

DistributionBased and Jaccard F1 deltas (≤ 0.011) are within run-to-run noise on these matchers.

What changed

Performance

  • Coma — TF-IDF cosine fast path. TfidfCorpus builds float32 sparse CSR matrices, caches per-column vectorisations on object identity, and memoises pair-level similarities on a symmetric (id, id) key. InstancesCM evaluates InstancesDirect and InstancesAll on the same list — the pair cache collapses both calls into one matmul.
  • Cupid — WordNet caching. Cached wn.synsets and wn.wup_similarity (symmetric key), plus the English stopword frozenset and the all_lemma_names corpus walk. The cold-path lemma walk used to dominate; it's now paid once per process.
  • DistributionBased — vectorised quantile histograms. QuantileHistogram.add_values replaces a Python bucket_binary_search loop with a single np.searchsorted + np.bincount over precomputed lower/upper-bound arrays. __slots__ on the histogram, lru_cache on the constant _bucket_distance_matrix(n), and a global ranks pickle cache to avoid re-unpickling per column.
  • JaccardDistanceMatcher — rapidfuzz cdist. Replaced the per-pair Python loop with rapidfuzz.process.cdist, dispatched to the smaller side (rows × cols favours small rows), with score_cutoff=threshold so rapidfuzz can short-circuit. This is the matcher that moved most in absolute time (−735 s).
  • Type inference fix. BaseTable.get_data_type now treats pandas "str" / "string" dtypes as text, not as unknown. Free F1 wins for Cupid and SF (which read data_type) and a prerequisite for the Coma accuracy work.

Coma accuracy (+0.047 F1)

Three small additions, each tested independently against the full bench:

  1. Token Jaccard inside NameCM. New tokens.py splits column names into camelCase / snake_case / digit runs and computes a Jaccard. Folded into NameCM with maximum so it can only help — never dilute trigram on well-formed names.
  2. 50-entry abbreviation dictionary. addr→address, num→number, approx→approximation, etc. Both the short form and the expansion are kept in the token tuple, so BlkNum matches both BlkNum exactly and block_number via expansion.
  3. Weighted matcher combination. Switched from uniform average to weighted, with InstancesCM getting a 1.3× weight (tuned empirically — 1.5 over-weights, 1.2 under-weights). Schema matchers stay at 1.0.

Experiments tried and rejected

Documented here so they don't get re-attempted:

  • SiblingsCM in build_matchers — +41% Coma time, zero F1 movement (flat schemas have no sibling structure).
  • Datatype gate as score multiplier — every floor (0.5 / 0.8 / 0.9) regressed Coma F1, because real ground-truth matches frequently span types (varchar IDs ↔ int IDs, varchar dates ↔ date dates).
  • WordNet third arm in NameCM — −0.037 F1, +87% time. maximum combination doesn't fence off the noise: WordNet's high-scoring false positives on common tokens still win bidirectional selection.

Dead code removal

After the vectorised add_values rewrite the legacy paths are unreachable:

  • QuantileHistogram.bucket_binary_search
  • QuantileHistogram.normalize_values
  • QuantileHistogram.calc_dist_matrix
  • The 7-tuple branch in process_columns (callers always pass 8-tuples now)

Tests / coverage

15 new targeted tests in tests/test_coverage_gaps.py for the touched paths. Per-file coverage on the previously gap-flagged files:

File Before After
coma/similarity/tfidf.py 87.50% 99%
coma/similarity/tokens.py (new file) 100%
cupid/linguistic_matching.py 89.23% 96%
distribution_based/clustering_utils.py 70.00% 98%
distribution_based/column_model.py 86.95% 100%
distribution_based/quantile_histogram.py 80.64% 96%
jaccard_distance/jaccard_distance.py 90.47% 100%

213 tests pass.

Test plan

  • pytest -q tests — 213 passed
  • Full NYU Open Data benchmark — see table above
  • Coverage report on every touched file
  • Each accuracy change re-benched independently to catch regressions

@kPsarakis kPsarakis self-assigned this Apr 8, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 95.43726% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.97%. Comparing base (a0d488d) to head (effd12d).

Files with missing lines Patch % Lines
valentine/algorithms/cupid/linguistic_matching.py 90.76% 6 Missing ⚠️
.../algorithms/distribution_based/clustering_utils.py 81.25% 2 Missing and 1 partial ⚠️
...lgorithms/distribution_based/quantile_histogram.py 90.00% 1 Missing and 2 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #96      +/-   ##
==========================================
+ Coverage   95.44%   95.97%   +0.52%     
==========================================
  Files          50       51       +1     
  Lines        2351     2435      +84     
  Branches      366      368       +2     
==========================================
+ Hits         2244     2337      +93     
+ Misses         64       61       -3     
+ Partials       43       37       -6     
Flag Coverage Δ
unit 95.97% <95.43%> (+0.52%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
valentine/algorithms/coma/coma.py 100.00% <100.00%> (ø)
valentine/algorithms/coma/matchers.py 95.60% <100.00%> (+0.09%) ⬆️
valentine/algorithms/coma/similarity/tfidf.py 98.87% <100.00%> (+3.10%) ⬆️
valentine/algorithms/coma/similarity/tokens.py 100.00% <100.00%> (ø)
valentine/algorithms/coma/similarity/trigram.py 94.11% <100.00%> (+1.26%) ⬆️
...tine/algorithms/distribution_based/column_model.py 100.00% <100.00%> (+7.69%) ⬆️
...lgorithms/distribution_based/distribution_based.py 99.02% <100.00%> (+0.03%) ⬆️
...ne/algorithms/jaccard_distance/jaccard_distance.py 100.00% <100.00%> (+3.84%) ⬆️
...e/algorithms/similarity_flooding/string_matcher.py 88.23% <100.00%> (-5.97%) ⬇️
valentine/data_sources/base_table.py 88.23% <100.00%> (+0.23%) ⬆️
... and 4 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kPsarakis kPsarakis changed the title improve performance across the board Performance: 35× speedup across matchers + Coma accuracy improvements Apr 8, 2026
@kPsarakis kPsarakis requested a review from chrisk21 April 8, 2026 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance: profile and optimize algorithm speed

1 participant