Skip to content

Filter repeating running heads and feet before segmentation#616

Draft
de-code wants to merge 3 commits into
mainfrom
filter-header-footer
Draft

Filter repeating running heads and feet before segmentation#616
de-code wants to merge 3 commits into
mainfrom
filter-header-footer

Conversation

@de-code
Copy link
Copy Markdown
Collaborator

@de-code de-code commented May 27, 2026

related to https://github.com/eLifePathways/ScienceBeam2.0/issues/61

Add a pre-segmentation noise filter that detects layout blocks whose text repeats at the top or bottom of pages across a document. Detected blocks are excluded from the segmentation model input and preserved in the output XML as / elements for auditability.

Enabled by default via noise_filter_enabled in config.yml.

related to eLifePathways/ScienceBeam2.0#61

Add a pre-segmentation noise filter that detects layout blocks
repeating at the top or bottom of pages across a document (running
heads, running feet) using position and cross-page text repetition.
Detected blocks are excluded from the segmentation model input and
preserved in the output XML as <note type="running-head"> /
<note type="running-foot"> elements for auditability.

Enabled by default via noise_filter_enabled in config.yml.
@de-code de-code changed the title Filter running headers, footers and page numbers before segmentation Filter repeating running heads and feet before segmentation May 27, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

ScienceBeam Parser Evaluation

Overall (59 docs across 6 corpora)

grobid 0.9.0-crf: 60 docs | sciencebeam-parser:main-5b07693a-20260527.2200: 59 docs | sciencebeam-parser:pr-616-e32a7154-20260527.2242: 59 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-5b07693a-20260527.2200 sciencebeam-parser:pr-616-e32a7154-20260527.2242 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-5b07693a-20260527.2200
title (exact) string 0.504 0.453 0.406 -0.099 -0.047
title (levenshtein) string 0.643 0.509 0.436 -0.207 -0.073
title (edit_sim) string 0.639 0.605 0.575 -0.064 -0.030
abstract (levenshtein) string 0.642 0.516 0.622 -0.020 +0.106
abstract (edit_sim) string 0.662 0.546 0.624 -0.038 +0.078
author_full_names (levenshtein) partial_ulist 0.701 0.679 0.683 -0.019 +0.004
author_full_names (edit_sim) partial_ulist 0.717 0.704 0.709 -0.008 +0.006
affiliation_text (levenshtein) partial_ulist 0.000 0.481 0.494 +0.494 +0.013
affiliation_text (edit_sim) partial_ulist 0.000 0.532 0.551 +0.551 +0.019
keywords (levenshtein) partial_ulist 0.500 0.000 0.000 -0.500 +0.000
keywords (edit_sim) partial_ulist 0.457 0.000 0.000 -0.457 +0.000
body_section_titles (levenshtein) partial_list 0.223 0.295 0.278 +0.055 -0.017
body_section_titles (edit_sim) partial_list 0.224 0.298 0.287 +0.063 -0.011
acknowledgement (levenshtein) string 0.264 0.303 0.483 +0.219 +0.180
acknowledgement (edit_sim) string 0.374 0.418 0.479 +0.105 +0.061
first_reference_text (levenshtein) string 0.000 0.386 0.396 +0.396 +0.010
first_reference_text (edit_sim) string 0.000 0.554 0.568 +0.568 +0.014
reference_title (levenshtein) partial_list 0.282 0.298 0.460 +0.177 +0.162
reference_title (edit_sim) partial_list 0.306 0.321 0.453 +0.147 +0.132
reference_doi (levenshtein) partial_ulist 0.546 0.315 0.339 -0.206 +0.025
reference_doi (edit_sim) partial_ulist 0.448 0.324 0.348 -0.101 +0.024
biorxiv (9 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-5b07693a-20260527.2200: 9 docs | sciencebeam-parser:pr-616-e32a7154-20260527.2242: 9 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-5b07693a-20260527.2200 sciencebeam-parser:pr-616-e32a7154-20260527.2242 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-5b07693a-20260527.2200
title (exact) string 0.889 0.941 0.941 +0.052 +0.000
title (levenshtein) string 0.947 0.941 0.941 -0.006 +0.000
title (edit_sim) string 0.939 0.944 0.944 +0.005 +0.000
abstract (levenshtein) string 0.947 0.364 0.875 -0.072 +0.511
abstract (edit_sim) string 0.947 0.541 0.919 -0.028 +0.378
author_full_names (levenshtein) partial_ulist 0.970 0.962 0.962 -0.008 +0.000
author_full_names (edit_sim) partial_ulist 0.933 0.926 0.926 -0.007 +0.000
affiliation_text (levenshtein) partial_ulist 0.000 0.907 0.962 +0.962 +0.054
affiliation_text (edit_sim) partial_ulist 0.000 0.891 0.958 +0.958 +0.067
keywords (levenshtein) partial_ulist 0.901 0.000 0.000 -0.901 +0.000
keywords (edit_sim) partial_ulist 0.907 0.000 0.000 -0.907 +0.000
body_section_titles (levenshtein) partial_list 0.516 0.819 0.830 +0.314 +0.011
body_section_titles (edit_sim) partial_list 0.472 0.764 0.782 +0.310 +0.017
acknowledgement (levenshtein) string 0.750 0.875 0.941 +0.191 +0.066
acknowledgement (edit_sim) string 0.726 0.871 0.917 +0.191 +0.046
first_reference_text (levenshtein) string 0.000 0.875 0.941 +0.941 +0.066
first_reference_text (edit_sim) string 0.000 0.870 0.961 +0.961 +0.090
reference_title (levenshtein) partial_list 0.766 0.291 0.891 +0.125 +0.600
reference_title (edit_sim) partial_list 0.727 0.343 0.850 +0.123 +0.507
reference_doi (levenshtein) partial_ulist 0.954 0.860 0.950 -0.004 +0.090
reference_doi (edit_sim) partial_ulist 0.871 0.775 0.880 +0.009 +0.104
ore (10 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-5b07693a-20260527.2200: 10 docs | sciencebeam-parser:pr-616-e32a7154-20260527.2242: 10 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-5b07693a-20260527.2200 sciencebeam-parser:pr-616-e32a7154-20260527.2242 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-5b07693a-20260527.2200
title (exact) string 0.462 1.000 1.000 +0.538 +0.000
title (levenshtein) string 0.462 1.000 1.000 +0.538 +0.000
title (edit_sim) string 0.547 1.000 1.000 +0.453 +0.000
abstract (levenshtein) string 0.571 0.571 0.571 +0.000 +0.000
abstract (edit_sim) string 0.680 0.600 0.618 -0.062 +0.018
author_full_names (levenshtein) partial_ulist 0.757 0.897 0.938 +0.182 +0.041
author_full_names (edit_sim) partial_ulist 0.757 0.898 0.939 +0.182 +0.041
affiliation_text (levenshtein) partial_ulist 0.000 0.805 0.824 +0.824 +0.019
affiliation_text (edit_sim) partial_ulist 0.000 0.802 0.821 +0.821 +0.019
keywords (levenshtein) partial_ulist 0.431 0.000 0.000 -0.431 +0.000
keywords (edit_sim) partial_ulist 0.395 0.000 0.000 -0.395 +0.000
body_section_titles (levenshtein) partial_list 0.276 0.101 0.107 -0.169 +0.006
body_section_titles (edit_sim) partial_list 0.301 0.189 0.195 -0.106 +0.006
acknowledgement (levenshtein) string 0.833 1.000 1.000 +0.167 +0.000
acknowledgement (edit_sim) string 0.888 1.000 1.000 +0.112 +0.000
first_reference_text (levenshtein) string 0.000 0.462 0.462 +0.462 +0.000
first_reference_text (edit_sim) string 0.000 0.739 0.723 +0.723 -0.016
reference_title (levenshtein) partial_list 0.237 0.424 0.536 +0.299 +0.112
reference_title (edit_sim) partial_list 0.270 0.439 0.521 +0.251 +0.082
reference_doi (levenshtein) partial_ulist 0.681 0.016 0.019 -0.662 +0.003
reference_doi (edit_sim) partial_ulist 0.489 0.004 0.004 -0.485 -0.000
pkp (10 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-5b07693a-20260527.2200: 10 docs | sciencebeam-parser:pr-616-e32a7154-20260527.2242: 10 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-5b07693a-20260527.2200 sciencebeam-parser:pr-616-e32a7154-20260527.2242 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-5b07693a-20260527.2200
title (exact) string 0.571 0.182 0.182 -0.390 +0.000
title (levenshtein) string 0.667 0.182 0.182 -0.485 +0.000
title (edit_sim) string 0.564 0.317 0.317 -0.247 +0.000
abstract (levenshtein) string 0.824 0.889 0.889 +0.065 +0.000
abstract (edit_sim) string 0.744 0.730 0.730 -0.014 +0.000
author_full_names (levenshtein) partial_ulist 0.853 0.831 0.831 -0.022 +0.000
author_full_names (edit_sim) partial_ulist 0.843 0.845 0.845 +0.002 +0.000
affiliation_text (levenshtein) partial_ulist 0.000 0.522 0.489 +0.489 -0.033
affiliation_text (edit_sim) partial_ulist 0.000 0.438 0.438 +0.438 +0.000
keywords (levenshtein) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
keywords (edit_sim) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
body_section_titles (levenshtein) partial_list 0.000 0.000 0.000 +0.000 +0.000
body_section_titles (edit_sim) partial_list 0.000 0.000 0.000 +0.000 +0.000
acknowledgement (levenshtein) string 0.000 0.000 0.000 +0.000 +0.000
acknowledgement (edit_sim) string 0.000 0.000 0.000 +0.000 +0.000
first_reference_text (levenshtein) string 0.000 0.000 0.000 +0.000 +0.000
first_reference_text (edit_sim) string 0.000 0.000 0.000 +0.000 +0.000
reference_title (levenshtein) partial_list 0.000 0.000 0.000 +0.000 +0.000
reference_title (edit_sim) partial_list 0.000 0.000 0.000 +0.000 +0.000
reference_doi (levenshtein) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
reference_doi (edit_sim) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
scielo_br (10 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-5b07693a-20260527.2200: 10 docs | sciencebeam-parser:pr-616-e32a7154-20260527.2242: 10 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-5b07693a-20260527.2200 sciencebeam-parser:pr-616-e32a7154-20260527.2242 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-5b07693a-20260527.2200
title (exact) string 0.462 0.462 0.182 -0.280 -0.280
title (levenshtein) string 0.462 0.462 0.182 -0.280 -0.280
title (edit_sim) string 0.483 0.479 0.361 -0.123 -0.118
abstract (levenshtein) string 0.714 0.462 0.500 -0.214 +0.038
abstract (edit_sim) string 0.620 0.477 0.526 -0.094 +0.049
author_full_names (levenshtein) partial_ulist 0.667 0.571 0.520 -0.147 -0.051
author_full_names (edit_sim) partial_ulist 0.708 0.645 0.592 -0.116 -0.053
affiliation_text (levenshtein) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
affiliation_text (edit_sim) partial_ulist 0.000 0.348 0.307 +0.307 -0.041
keywords (levenshtein) partial_ulist 0.429 0.000 0.000 -0.429 +0.000
keywords (edit_sim) partial_ulist 0.387 0.000 0.000 -0.387 +0.000
body_section_titles (levenshtein) partial_list 0.247 0.375 0.263 +0.017 -0.112
body_section_titles (edit_sim) partial_list 0.242 0.399 0.294 +0.052 -0.105
acknowledgement (levenshtein) string 0.000 0.000 1.000 +1.000 +1.000
acknowledgement (edit_sim) string 0.632 0.683 1.000 +0.368 +0.317
first_reference_text (levenshtein) string 0.000 0.000 0.000 +0.000 +0.000
first_reference_text (edit_sim) string 0.000 0.385 0.384 +0.384 -0.001
reference_title (levenshtein) partial_list 0.147 0.379 0.313 +0.166 -0.066
reference_title (edit_sim) partial_list 0.248 0.423 0.360 +0.111 -0.064
reference_doi (levenshtein) partial_ulist 0.889 0.333 0.333 -0.556 +0.000
reference_doi (edit_sim) partial_ulist 0.671 0.538 0.538 -0.133 +0.000
scielo_mx (10 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-5b07693a-20260527.2200: 10 docs | sciencebeam-parser:pr-616-e32a7154-20260527.2242: 10 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-5b07693a-20260527.2200 sciencebeam-parser:pr-616-e32a7154-20260527.2242 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-5b07693a-20260527.2200
title (exact) string 0.182 0.182 0.182 -0.000 +0.000
title (levenshtein) string 0.571 0.333 0.182 -0.390 -0.152
title (edit_sim) string 0.589 0.393 0.335 -0.254 -0.058
abstract (levenshtein) string 0.333 0.333 0.462 +0.128 +0.128
abstract (edit_sim) string 0.524 0.439 0.492 -0.032 +0.053
author_full_names (levenshtein) partial_ulist 0.389 0.323 0.357 -0.032 +0.035
author_full_names (edit_sim) partial_ulist 0.398 0.332 0.376 -0.021 +0.045
affiliation_text (levenshtein) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
affiliation_text (edit_sim) partial_ulist 0.000 0.168 0.216 +0.216 +0.047
keywords (levenshtein) partial_ulist 0.532 0.000 0.000 -0.532 +0.000
keywords (edit_sim) partial_ulist 0.447 0.000 0.000 -0.447 +0.000
body_section_titles (levenshtein) partial_list 0.000 0.000 0.000 +0.000 +0.000
body_section_titles (edit_sim) partial_list 0.000 0.000 0.000 +0.000 +0.000
acknowledgement (levenshtein) string 0.000 0.000 0.000 +0.000 +0.000
acknowledgement (edit_sim) string 0.000 0.000 0.000 +0.000 +0.000
first_reference_text (levenshtein) string 0.000 0.364 0.364 +0.364 +0.000
first_reference_text (edit_sim) string 0.000 0.575 0.590 +0.590 +0.014
reference_title (levenshtein) partial_list 0.368 0.198 0.405 +0.038 +0.208
reference_title (edit_sim) partial_list 0.361 0.242 0.390 +0.029 +0.149
reference_doi (levenshtein) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
reference_doi (edit_sim) partial_ulist 0.000 0.000 0.000 +0.000 +0.000
scielo_preprints-jats (10 docs)

grobid 0.9.0-crf: 10 docs | sciencebeam-parser:main-5b07693a-20260527.2200: 10 docs | sciencebeam-parser:pr-616-e32a7154-20260527.2242: 10 docs

Field (method) Type grobid 0.9.0-crf sciencebeam-parser:main-5b07693a-20260527.2200 sciencebeam-parser:pr-616-e32a7154-20260527.2242 Δ grobid 0.9.0-crf Δ sciencebeam-parser:main-5b07693a-20260527.2200
title (exact) string 0.462 0.000 0.000 -0.462 +0.000
title (levenshtein) string 0.750 0.182 0.182 -0.568 +0.000
title (edit_sim) string 0.714 0.533 0.533 -0.180 +0.000
abstract (levenshtein) string 0.462 0.462 0.462 +0.000 +0.000
abstract (edit_sim) string 0.457 0.486 0.488 +0.030 +0.001
author_full_names (levenshtein) partial_ulist 0.574 0.517 0.517 -0.057 +0.000
author_full_names (edit_sim) partial_ulist 0.663 0.598 0.598 -0.064 +0.000
affiliation_text (levenshtein) partial_ulist 0.000 0.696 0.737 +0.737 +0.041
affiliation_text (edit_sim) partial_ulist 0.000 0.580 0.606 +0.606 +0.026
keywords (levenshtein) partial_ulist 0.706 0.000 0.000 -0.706 +0.000
keywords (edit_sim) partial_ulist 0.608 0.000 0.000 -0.608 +0.000
body_section_titles (levenshtein) partial_list 0.302 0.528 0.523 +0.221 -0.005
body_section_titles (edit_sim) partial_list 0.328 0.480 0.499 +0.171 +0.019
acknowledgement (levenshtein) string 0.000 0.000 0.000 +0.000 +0.000
acknowledgement (edit_sim) string 0.000 0.000 0.000 +0.000 +0.000
first_reference_text (levenshtein) string 0.000 0.667 0.667 +0.667 +0.000
first_reference_text (edit_sim) string 0.000 0.787 0.789 +0.789 +0.003
reference_title (levenshtein) partial_list 0.177 0.495 0.657 +0.480 +0.162
reference_title (edit_sim) partial_list 0.231 0.482 0.637 +0.406 +0.155
reference_doi (levenshtein) partial_ulist 0.751 0.734 0.795 +0.044 +0.061
reference_doi (edit_sim) partial_ulist 0.659 0.669 0.717 +0.057 +0.048

@de-code de-code deployed to benchmark May 27, 2026 22:42 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant