Skip to content

fix: use .search() instead of .match() for negative patterns in PruningContentFilter#1808

Open
jnMetaCode wants to merge 1 commit intounclecode:mainfrom
jnMetaCode:fix/pruning-negative-pattern-match
Open

fix: use .search() instead of .match() for negative patterns in PruningContentFilter#1808
jnMetaCode wants to merge 1 commit intounclecode:mainfrom
jnMetaCode:fix/pruning-negative-pattern-match

Conversation

@jnMetaCode
Copy link

Summary

PruningContentFilter._compute_class_id_weight() uses self.negative_patterns.match() (lines 771, 775) to check CSS classes and element IDs against negative patterns. However, re.match() only matches at the start of the string, so a class like "main sidebar-nav" would not be caught by a pattern intended to match "sidebar" or "nav".

The base class RelevantContentFilter.is_excluded() (line 327) correctly uses .search() for the same self.negative_patterns regex. This fix aligns _compute_class_id_weight with that behavior.

Changes

  • content_filter_strategy.py line 771: self.negative_patterns.match(classes)self.negative_patterns.search(classes)
  • content_filter_strategy.py line 775: self.negative_patterns.match(element_id)self.negative_patterns.search(element_id)

Impact

Without this fix, negative patterns (e.g., sidebar, nav, footer) fail to penalize elements when the matching substring is not at the start of the class/id string, causing irrelevant content to score higher than it should.

…ngContentFilter

Signed-off-by: JiangNan <1394485448@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant