Performance: 35× speedup across matchers + Coma accuracy improvements#96
Open
Performance: 35× speedup across matchers + Coma accuracy improvements#96
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #96 +/- ##
==========================================
+ Coverage 95.44% 95.97% +0.52%
==========================================
Files 50 51 +1
Lines 2351 2435 +84
Branches 366 368 +2
==========================================
+ Hits 2244 2337 +93
+ Misses 64 61 -3
+ Partials 43 37 -6
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves #88
Across-the-board performance and accuracy work on every matcher, plus dead-code removal and coverage tests for the touched paths.
Headline: ~35× wall-clock speedup on the full NYU Open Data benchmark (1048s → 30s) with no F1 regressions and a +0.047 Coma F1 improvement from targeted accuracy work.
Performance: before / after
Full NYU Open Data benchmark, 10 dataset pairs, single machine, sequential.
DistributionBased and Jaccard F1 deltas (≤ 0.011) are within run-to-run noise on these matchers.
What changed
Performance
TfidfCorpusbuilds float32 sparse CSR matrices, caches per-column vectorisations on object identity, and memoises pair-level similarities on a symmetric(id, id)key.InstancesCMevaluatesInstancesDirectandInstancesAllon the same list — the pair cache collapses both calls into one matmul.wn.synsetsandwn.wup_similarity(symmetric key), plus the English stopword frozenset and theall_lemma_namescorpus walk. The cold-path lemma walk used to dominate; it's now paid once per process.QuantileHistogram.add_valuesreplaces a Pythonbucket_binary_searchloop with a singlenp.searchsorted+np.bincountover precomputed lower/upper-bound arrays.__slots__on the histogram,lru_cacheon the constant_bucket_distance_matrix(n), and a global ranks pickle cache to avoid re-unpickling per column.rapidfuzz.process.cdist, dispatched to the smaller side (rows × cols favours small rows), withscore_cutoff=thresholdso rapidfuzz can short-circuit. This is the matcher that moved most in absolute time (−735 s).BaseTable.get_data_typenow treats pandas"str"/"string"dtypes as text, not as unknown. Free F1 wins for Cupid and SF (which readdata_type) and a prerequisite for the Coma accuracy work.Coma accuracy (+0.047 F1)
Three small additions, each tested independently against the full bench:
NameCM. Newtokens.pysplits column names intocamelCase/snake_case/ digit runs and computes a Jaccard. Folded intoNameCMwithmaximumso it can only help — never dilute trigram on well-formed names.addr→address,num→number,approx→approximation, etc. Both the short form and the expansion are kept in the token tuple, soBlkNummatches bothBlkNumexactly andblock_numbervia expansion.averagetoweighted, withInstancesCMgetting a 1.3× weight (tuned empirically — 1.5 over-weights, 1.2 under-weights). Schema matchers stay at 1.0.Experiments tried and rejected
Documented here so they don't get re-attempted:
build_matchers— +41% Coma time, zero F1 movement (flat schemas have no sibling structure).NameCM— −0.037 F1, +87% time.maximumcombination doesn't fence off the noise: WordNet's high-scoring false positives on common tokens still win bidirectional selection.Dead code removal
After the vectorised
add_valuesrewrite the legacy paths are unreachable:QuantileHistogram.bucket_binary_searchQuantileHistogram.normalize_valuesQuantileHistogram.calc_dist_matrixprocess_columns(callers always pass 8-tuples now)Tests / coverage
15 new targeted tests in
tests/test_coverage_gaps.pyfor the touched paths. Per-file coverage on the previously gap-flagged files:coma/similarity/tfidf.pycoma/similarity/tokens.pycupid/linguistic_matching.pydistribution_based/clustering_utils.pydistribution_based/column_model.pydistribution_based/quantile_histogram.pyjaccard_distance/jaccard_distance.py213 tests pass.
Test plan
pytest -q tests— 213 passed