This module mines robust cross-page positive patch pairs from real Pecha scans.
- Patch metadata parquet (typically
meta/patches.parquet) - Patch images on disk (
patches/.../patch_*.png) - DINOv2-compatible image encoder (optional projection head checkpoint)
- Load/clean metadata and apply filters (
ink_ratio_min, optionalboundary_score_min). - Embed patches per
scale_wwith L2-normalized vectors. - Build FAISS inner-product index per scale.
- Retrieve filtered neighbors with exclusion rules (same page/line/nearby).
- Keep only mutual-nearest-neighbor candidates.
- Stage 1 (fast): build a per-source shortlist of mutual candidates (no stability/multiscale/signature checks yet).
- Stage 2 (slow): verify only the shortlist with deterministic stability checks and optional multi-scale/signature checks.
- Keep top pairs per source patch.
- Save
mnn_pairs.parquetand a JSON summary.
python -m pechabridge.cli.mine_mnn_pairs \
--dataset /path/to/out_dataset \
--meta /path/to/out_dataset/meta/patches.parquet \
--out /path/to/out_dataset/meta/mnn_pairs.parquet \
--config /path/to/configs/mnn_mining.yaml \
--num-workers 8 \
--debug_dump 50--debug_dump N writes random pair preview grids to:
<dataset>/debug/mnn_pairs/
Parquet columns:
src_patch_id,dst_patch_idsrc_doc_id,src_page_id,src_line_id,src_scale_wdst_doc_id,dst_page_id,dst_line_id,dst_scale_wsimrank_src_to_dst,rank_dst_to_srcstability_count,stability_ratiomulti_scale_oknotes
Summary JSON:
- same path as output parquet with suffix
.summary.json - includes counts, sim/stability stats, and top doc/page match sources
performance.two_stage_verify: true(default) is recommended for large scales.- For a quick first pass, disable
stability.enabledandmultiscale.enabled. - Reduce
mining.topK/mining.mutual_topKfor faster candidate generation. --num-workerscontrols source-loop mining threads (and also embedding DataLoader / FAISS threads).