-
-
Notifications
You must be signed in to change notification settings - Fork 1
fix: _fallback_html_strip uses HTMLParser instead of regex, leaks script content #3214
Description
Context
Discovered during code review of PR #3204 (CodeQL fixes for #3164).
Problem
In autobot-backend/api/knowledge.py, the _fallback_html_strip() function was updated to replace a regex-based HTML tag stripper with a new _TagStripper(HTMLParser) inner class. Two issues:
1. Script/style content leaks into extracted text
The same file already has _HtmlTextExtractor(HTMLParser) (line ~643) which properly suppresses <script> and <style> tag content via _in_script and _in_style flags. The new _TagStripper class does not suppress these tags, so any script or style text in the HTML will be included in the extracted output.
2. Fallback defeats its own purpose
The function docstring says: "Fallback HTML stripping using regex when parser fails." It is called (line ~712) inside an except Exception block when _HtmlTextExtractor.feed() raises. But the fix replaced the regex fallback with another HTMLParser invocation — if HTMLParser is what failed in the primary path, the fallback will likely fail the same way.
Expected behavior
- The fallback should either reuse
_HtmlTextExtractorwith error handling, or use a true regex fallback - Script/style content must never leak into extracted text
- The docstring should match the actual implementation
Files
autobot-backend/api/knowledge.py—_fallback_html_strip()and_TagStripperinner class