Skip to content

fix: _fallback_html_strip uses HTMLParser instead of regex, leaks script content #3214

@mrveiss

Description

@mrveiss

Context

Discovered during code review of PR #3204 (CodeQL fixes for #3164).

Problem

In autobot-backend/api/knowledge.py, the _fallback_html_strip() function was updated to replace a regex-based HTML tag stripper with a new _TagStripper(HTMLParser) inner class. Two issues:

1. Script/style content leaks into extracted text

The same file already has _HtmlTextExtractor(HTMLParser) (line ~643) which properly suppresses <script> and <style> tag content via _in_script and _in_style flags. The new _TagStripper class does not suppress these tags, so any script or style text in the HTML will be included in the extracted output.

2. Fallback defeats its own purpose

The function docstring says: "Fallback HTML stripping using regex when parser fails." It is called (line ~712) inside an except Exception block when _HtmlTextExtractor.feed() raises. But the fix replaced the regex fallback with another HTMLParser invocation — if HTMLParser is what failed in the primary path, the fallback will likely fail the same way.

Expected behavior

  • The fallback should either reuse _HtmlTextExtractor with error handling, or use a true regex fallback
  • Script/style content must never leak into extracted text
  • The docstring should match the actual implementation

Files

  • autobot-backend/api/knowledge.py_fallback_html_strip() and _TagStripper inner class

Related

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions