Context
Currently, the Chrome Side Panel extension for summarize extracts main text from the rendered DOM using Readability, with media URLs (YouTube, Twitter, direct audio/video, podcasts) always routed to the backend daemon (URL mode/extractOnly) via shouldPreferUrlMode(url). Reddit threads could be more robustly extracted through Reddit's .json API, but the extension only leverages this in CLI if users manually add .json.
Problem
- Page content extraction for Reddit threads is unstable/noisy using Readability.
- Reddit exposes a solid
.json API that returns well-structured data for both the initial post and comments.
- We want to improve "in-browser/extension" summarization experience for Reddit, while keeping all media-specific and
preferUrl logic unaffected for all other URLs.
Hard Requirements
DO NOT alter the preferUrl logic, its flow, or cache rules.
- If
shouldPreferUrlMode(url) is true for a URL (e.g., YouTube/Twitter/media/podcast):
- The router must skip all new extractors and run only the original url-daemon extraction logic. No regression or behavioral change is allowed here.
- Only for
preferUrl === false URLs should the new extractors run.
Implementation Scope
A) Modular Extractor Router for extension background
- Define an "Extractor" interface:
match(ctx) and extract(ctx) → ExtractorResult | null
- Routing logic:
- If preferUrl: run url-daemon, do NOT try new extractors (and log this fact).
- Otherwise: try
reddit-thread => page-readability => url-daemon, in order, falling back as needed.
- Maintain all existing caching logic; router outputs are cached for summarize/chat.
B) RedditThreadExtractor (for .json API, structured text output)
- Match: reddit.com URLs with
/comments/, extracting subreddit+postId.
- Fetch
.json (async, never sync XHR) from Reddit with browser creds, parse, and format:
- Post header (author, subreddit, time, title, post body, comment count).
- Comments: recursive with 2-space indentation per level,
[ISO date] author (score:N): body. Truncate per-comment bodies and stop at max comments/depth/sum length.
- Only trigger if
preferUrl is false.
- On any fetch/parse failure, return null (router fallback).
C) Logging
- All extraction attempts must add diagnostic logs (
extractor.route.start, extractor.try, extractor.success, extractor.fail, extractor.route.preferUrlHardSwitch for preferUrl skips).
Not in Scope
- Do not modify
shouldPreferUrlMode (no new logic, no changes allowed here at all).
- Do not touch daemon/core extractor router or its logic.
- Do not sync browser cookies/session to daemon.
- No settings UI for these behaviors in this phase.
Acceptance Criteria
- For all media URLs (
preferUrl true), logs show router does not run new extractors (must log preferUrlHardSwitch). No regression allowed.
- For Reddit threads (
/comments/), output from reddit-thread extractor is used (with structured post + comments), falls back to page or url-daemon if needed, never crashes.
- Logs diagnostically record every routing/extractor decision for debugging.
- Daemon is never given browser cookies/session; logs are privacy safe.
Reference
Context
Currently, the Chrome Side Panel extension for
summarizeextracts main text from the rendered DOM using Readability, with media URLs (YouTube, Twitter, direct audio/video, podcasts) always routed to the backend daemon (URL mode/extractOnly) viashouldPreferUrlMode(url). Reddit threads could be more robustly extracted through Reddit's.jsonAPI, but the extension only leverages this in CLI if users manually add.json.Problem
.jsonAPI that returns well-structured data for both the initial post and comments.preferUrllogic unaffected for all other URLs.Hard Requirements
DO NOT alter the preferUrl logic, its flow, or cache rules.
shouldPreferUrlMode(url)is true for a URL (e.g., YouTube/Twitter/media/podcast):preferUrl === falseURLs should the new extractors run.Implementation Scope
A) Modular Extractor Router for extension background
match(ctx)andextract(ctx)→ExtractorResult | nullreddit-thread=>page-readability=>url-daemon, in order, falling back as needed.B) RedditThreadExtractor (for .json API, structured text output)
/comments/, extracting subreddit+postId..json(async, never sync XHR) from Reddit with browser creds, parse, and format:[ISO date] author (score:N): body. Truncate per-comment bodies and stop at max comments/depth/sum length.preferUrlis false.C) Logging
extractor.route.start,extractor.try,extractor.success,extractor.fail,extractor.route.preferUrlHardSwitchfor preferUrl skips).Not in Scope
shouldPreferUrlMode(no new logic, no changes allowed here at all).Acceptance Criteria
preferUrltrue), logs show router does not run new extractors (must logpreferUrlHardSwitch). No regression allowed./comments/), output fromreddit-threadextractor is used (with structured post + comments), falls back to page or url-daemon if needed, never crashes.Reference