Filter repeating running heads and feet before segmentation#616
Filter repeating running heads and feet before segmentation#616de-code wants to merge 3 commits into
Conversation
related to eLifePathways/ScienceBeam2.0#61 Add a pre-segmentation noise filter that detects layout blocks repeating at the top or bottom of pages across a document (running heads, running feet) using position and cross-page text repetition. Detected blocks are excluded from the segmentation model input and preserved in the output XML as <note type="running-head"> / <note type="running-foot"> elements for auditability. Enabled by default via noise_filter_enabled in config.yml.
ScienceBeam Parser EvaluationOverall (59 docs across 6 corpora)grobid 0.9.0-crf: 60 docs | sciencebeam-parser:main-5b07693a-20260527.2200: 59 docs | sciencebeam-parser:pr-616-e32a7154-20260527.2242: 59 docs
biorxiv (9 docs)grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-5b07693a-20260527.2200: 9 docs | sciencebeam-parser:pr-616-e32a7154-20260527.2242: 9 docs
ore (10 docs)grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-5b07693a-20260527.2200: 10 docs | sciencebeam-parser:pr-616-e32a7154-20260527.2242: 10 docs
pkp (10 docs)grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-5b07693a-20260527.2200: 10 docs | sciencebeam-parser:pr-616-e32a7154-20260527.2242: 10 docs
scielo_br (10 docs)grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-5b07693a-20260527.2200: 10 docs | sciencebeam-parser:pr-616-e32a7154-20260527.2242: 10 docs
scielo_mx (10 docs)grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-5b07693a-20260527.2200: 10 docs | sciencebeam-parser:pr-616-e32a7154-20260527.2242: 10 docs
scielo_preprints-jats (10 docs)grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-5b07693a-20260527.2200: 10 docs | sciencebeam-parser:pr-616-e32a7154-20260527.2242: 10 docs
|
related to https://github.com/eLifePathways/ScienceBeam2.0/issues/61
Add a pre-segmentation noise filter that detects layout blocks whose text repeats at the top or bottom of pages across a document. Detected blocks are excluded from the segmentation model input and preserved in the output XML as / elements for auditability.
Enabled by default via noise_filter_enabled in config.yml.