fix: _fallback_html_strip uses HTMLParser instead of regex, leaks script content

## Context

Discovered during code review of PR #3204 (CodeQL fixes for #3164).

## Problem

In `autobot-backend/api/knowledge.py`, the `_fallback_html_strip()` function was updated to replace a regex-based HTML tag stripper with a new `_TagStripper(HTMLParser)` inner class. Two issues:

### 1. Script/style content leaks into extracted text

The same file already has `_HtmlTextExtractor(HTMLParser)` (line ~643) which properly suppresses `<script>` and `<style>` tag content via `_in_script` and `_in_style` flags. The new `_TagStripper` class does not suppress these tags, so any script or style text in the HTML will be included in the extracted output.

### 2. Fallback defeats its own purpose

The function docstring says: *"Fallback HTML stripping using regex when parser fails."* It is called (line ~712) inside an `except Exception` block when `_HtmlTextExtractor.feed()` raises. But the fix replaced the regex fallback with another `HTMLParser` invocation — if HTMLParser is what failed in the primary path, the fallback will likely fail the same way.

## Expected behavior

- The fallback should either reuse `_HtmlTextExtractor` with error handling, or use a true regex fallback
- Script/style content must never leak into extracted text
- The docstring should match the actual implementation

## Files

- `autobot-backend/api/knowledge.py` — `_fallback_html_strip()` and `_TagStripper` inner class

## Related

- PR #3204, issue #3164

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: _fallback_html_strip uses HTMLParser instead of regex, leaks script content #3214

Context

Problem

1. Script/style content leaks into extracted text

2. Fallback defeats its own purpose

Expected behavior

Files

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

fix: _fallback_html_strip uses HTMLParser instead of regex, leaks script content #3214

Description

Context

Problem

1. Script/style content leaks into extracted text

2. Fallback defeats its own purpose

Expected behavior

Files

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions