Conversation
…ource papers Why --- Most IEEE / ACM / Springer / Elsevier results come back from their plugins with pdf_url=None because the publisher sites are paywalled even when the paper itself is open access. The OA copy usually exists somewhere else - the author's institutional repo, an arXiv preprint, ResearchGate, etc. - and Unpaywall indexes ~50M of them keyed by DOI. Without this resolver the pipeline's per-paper PPT emit gate cuts those papers because pdf_url is missing, even though a downloadable PDF was one HTTP roundtrip away. How --- New module autopapertoppt/core/oa_resolver.py runs after dedup + rank + top-tier filter. For every paper still missing pdf_url: 1. Unpaywall by DOI via https://api.unpaywall.org/v2/{doi}. Requires AUTOPAPERTOPPT_CONTACT_EMAIL for politeness; skipped silently with a one-shot WARNING when unset. Returns best_oa_location.url_for_pdf when found. 2. arXiv title search for papers without DOI (or where Unpaywall missed). Uses arXiv's field-restricted ti:"<title>" syntax, accepts only exact normalised-title matches so loosely similar titles do not get adopted by accident. Both lookups are best-effort and never raise; concurrency capped at 5 by a semaphore; HTTPS-only enforced by the existing transport wrapper. Surfaces -------- - run_search now takes resolve_oa: bool = True. Default ON. - CLI: --no-oa-resolve flag to skip the resolver per run. - All existing fake_run_search mocks across tests/test_cli.py, tests/test_mcp_tools.py, tests/gui/test_search_page.py updated to accept **_kwargs so the new kwarg does not break them. Note: OpenAlex and Semantic Scholar parsers ALREADY surface their OA URL fields (best_oa_location.pdf_url, openAccessPdf.url) - this PR doesn't touch them. The resolver only kicks in for papers whose source plugin returns no pdf_url, which is almost exclusively IEEE / ACM (via Crossref) / DBLP / paywalled OpenAlex hits. Tests ----- +11 tests in tests/test_oa_resolver.py covering: early-exit when all papers have pdf_url, Unpaywall happy path, arXiv fallback when Unpaywall misses, both miss, email-unset skip, one-shot warning flag, fuzzy title matching, arxiv-sourced paper skip, exact-match only, https-only enforcement. Docs ---- configuration.md: AUTOPAPERTOPPT_CONTACT_EMAIL row mentions the new Unpaywall use. architecture.md: pipeline diagram updated + new "OA PDF resolution" subsection. cli.md: --no-oa-resolve row. 462 tests pass, ruff + bandit clean.
…v_id/S2/CORE For broader coverage out of the box and bigger pdf_url hit-rate on the 'public papers I can read' axis. Default flips ------------- - IEEE plugin: was opt-IN via AUTOPAPERTOPPT_ENABLE_IEEE_SCRAPING=1. Now default-ON, opt-OUT via AUTOPAPERTOPPT_DISABLE_IEEE_SCRAPING=1. - Scholar plugin: same flip, opt-OUT via AUTOPAPERTOPPT_DISABLE_SCHOLAR_SCRAPING=1. - DEFAULT_SOURCES now includes scholar (it was already missing). - top_tier_only CLI flag flipped: --all-venues (off by default) is replaced by --top-tier-only (off by default). Most IEEE / ACM workshop papers live outside the whitelist; the previous default silently dropped them. FetcherConfig gains an opt_out_env_var field so list_sources can report enablement correctly for the new gating semantics. OA resolver — three new strategies ----------------------------------- Previous resolver: Unpaywall -> arXiv title search. New chain (returns first hit): 1. arXiv-ID direct — derive https://arxiv.org/pdf/{id}.pdf from paper.arxiv_id when set. Zero network, highest precision. Strips trailing v<N> version suffix. 2. Unpaywall by DOI — unchanged. 3. Semantic Scholar OA index — query /graph/v1/paper/DOI:{doi}?fields=openAccessPdf. S2's index is partially disjoint from Unpaywall. 4. CORE.ac.uk — 200M+ institutional/regional OA repos. Needs AUTOPAPERTOPPT_CORE_API_KEY (free), skipped silently otherwise. 5. arXiv title search — unchanged, last resort. Strategies 2-4 are data-driven through a _DOI_STRATEGIES tuple to keep _resolve_one's cognitive complexity under the ruff/SonarQube cap of 10. Tests ----- - test_ieee.py / test_scholar.py: opt-in test renamed to opt-out, autouse fixture clears the DISABLE flag instead of setting ENABLE. - test_mcp_tools.py: list_sources output schema updated (needs_env_var -> opt_in_env_var + opt_out_env_var). - test_cli.py: top_tier_filter_on_by_default renamed to top_tier_filter_off_by_default; asserts the inverse. - test_oa_resolver.py: +5 tests covering arxiv_id direct, S2 fallback, CORE fallback, CORE-key-missing skip, arxiv_id version stripping. 467 tests pass, ruff + bandit clean. Docs ---- configuration.md: env-var table updated for the flipped semantics and the new CORE key. architecture.md: pipeline diagram and OA resolution subsection updated to the 5-strategy chain. cli.md: --top-tier-only row replaces --all-venues with revised default.
Two production-quality improvements on top of the new resolver +
default flips.
Semantic Scholar
----------------
- The OA resolver's _query_semantic_scholar now honours
AUTOPAPERTOPPT_S2_API_KEY (sent as x-api-key header). Without the
key the resolver hit the anonymous tier (~1 req/s) and rate-limited
fast under burst. The source plugin already used the key; only the
resolver's separate query path was missing it.
- Added an in-process _S2_CACHE dict keyed by DOI so re-querying the
same paper within one run is a no-op. Negative results (not in OA
index) are cached too. 429 responses are NOT cached (so the user
can retry after setting the API key).
Google Scholar captcha
----------------------
Google's captcha is served as HTTP 200 with an HTML form, so the
old status-code-only checks missed it; the plugin just returned an
empty result and the pipeline retried for the rest of the run,
deepening the IP block.
- New _is_captcha_response detects the bot-check page by URL path
(/sorry/) and known body markers ('unusual traffic',
captcha-form, g-recaptcha, ...).
- New _engage_captcha_cooldown sets a process-level timestamp
(default 30 min). Subsequent Scholar requests in the same run
raise SourceUnavailableError immediately without issuing HTTP,
so we don't burn the rate-limit budget once we know Google has
flagged us.
- The error message now tells the user what to do: rotate IP (VPN),
set AUTOPAPERTOPPT_DISABLE_SCHOLAR_SCRAPING=1, or wait.
Tests
-----
+5 tests: S2 cache hit, S2 x-api-key header sent when env var set,
captcha URL detection, captcha body markers, cooldown engages after
first hit + raises immediately on the second call.
471 tests pass, ruff + bandit clean.
Docs
----
configuration.md: AUTOPAPERTOPPT_S2_API_KEY row now mentions the
OA resolver path. Scholar / IEEE rows already documented from the
previous flip commit.
httpx-based Scholar scrapes get captcha'd within a few requests because Google's detection flags non-browser User-Agents and the predictable request shape. WebRunner (https://webrunner.readthedocs.io/) drives a real Chrome via Selenium with the --disable-blink-features=AutomationControlled flag, which survives Google's standard heuristics. How it works ------------ sources/scholar/webrunner_backend.py: is_available() — True iff je_web_runner is importable AND AUTOPAPERTOPPT_DISABLE_WEBRUNNER is not set. Cheap repeated calls. fetch_serp_html(query) — async wrapper that runs the Selenium block in asyncio.to_thread so the pipeline's event loop isn't blocked while Chrome boots (5-10 s cold). Uses Chrome with the auto-control disable flag, headless=new, lang=en-US, 1280x720 window — same fingerprint as the WebRunner google_search.py example. sources/scholar/fetcher.py — search() picks WebRunner when is_available, falls back to httpx on any RuntimeError (e.g. Chrome not installed). The httpx path keeps the captcha cooldown logic from the previous commit as a second-line defence. Optional extra -------------- pyproject.toml gains a [scholar] extra: pip install autopapertoppt[scholar] # adds je_web_runner The [dev] extra also pulls it in so the test suite covers both backends. Without the extra, Scholar falls back to the httpx path silently — no behavioural regression. Tests ----- +3 tests in tests/sources/test_scholar.py: - WebRunner backend is used when is_available returns True - WebRunner failure (Chrome boot error) falls back to httpx - DISABLE_WEBRUNNER=1 makes is_available return False The autouse fixture defaults DISABLE_WEBRUNNER=1 so existing tests exercise the httpx path; only the three new tests opt in. 474 tests pass, ruff + bandit clean. Docs ---- configuration.md: AUTOPAPERTOPPT_DISABLE_WEBRUNNER row added with behaviour, install command for the [scholar] extra, and the CI override use case.
Promoted je_web_runner from the [scholar] extra to the main dependency list. The Scholar plugin's WebRunner-first behaviour (captcha-resilient real-browser path) is now the default on every 'pip install autopapertoppt' install rather than gated behind an opt-in extra. Users without Chrome on PATH or who explicitly set AUTOPAPERTOPPT_DISABLE_WEBRUNNER=1 still get the httpx scrape fallback, including the captcha cooldown logic. No behavioural regression — only the default capability widened. Removed the now-redundant [scholar] extra from pyproject.toml. Anyone who had 'pip install autopapertoppt[scholar]' in their deploy gets je_web_runner via the base dep set; the extra-less install is now strictly more capable. Docs updated: configuration.md no longer references the extra, just the DISABLE_WEBRUNNER opt-out and its use cases (CI without a Chrome binary, latency-sensitive runs).
…olar
Google's bot detection rejects WebRunner requests too once an IP is
flagged from earlier scraping. The reliable workaround is to seed a
real Google sign-in into a Chrome profile dir once; subsequent
headless runs reuse the session cookie and Google trusts the
authenticated request.
Two new env vars in sources/scholar/webrunner_backend.py:
AUTOPAPERTOPPT_CHROME_PROFILE_DIR
When set, passes --user-data-dir=<path> to Chrome so cookies,
login state, and captcha clearance survive across CLI runs.
AUTOPAPERTOPPT_CHROME_HEADLESS
Default '1' (headless). Set '0' for the one-time interactive
sign-in that seeds the profile dir. When non-headless the
Chrome window holds open for 60 s (vs the 3 s headless wait)
so the user has time to log in / accept consent banners /
solve any captcha.
_build_chrome_args extracted from _drive_chrome_sync so the args
construction is unit-testable without spinning up Chrome.
Tests
-----
+3 tests covering: default headless + no profile, profile dir
passed as --user-data-dir, HEADLESS=0 drops the headless flag.
477 tests pass, ruff + bandit clean.
Docs
----
configuration.md gains two env var rows + a new 'Suppressing
Scholar captchas with a persistent Chrome profile' recipe walking
through the one-time interactive setup and the steady-state
headless usage, with caveats about profile-dir locking, cookie
expiry, and treating the profile dir as a secret.
The interactive-login phase opens Chrome visibly for the user to sign into Google. Users naturally close the Chrome window the moment login completes (which is what we want — the cookies are already persisted to the profile dir at that point) but closing the window invalidates the WebDriver session, so the subsequent page_source access raises or returns None and the parser crashes with AttributeError. Wrap the page_source access in a defensive try; on None or exception return an empty SERP HTML shell that the parser handles as 'valid but zero results'. The cookie store under --user-data-dir is already on disk before any of this code runs, so the interactive seed has done its job even when page_source fails. User-facing log message updated to call out that the user MAY close the window once logged in.
Two changes pulling the WebRunner story together. IEEE via WebRunner ------------------ New sources/ieee/webrunner_backend.py drives a visible Chrome to the IEEE Xplore origin and runs the existing /rest/search POST inside the real-browser context via execute_async_script. IEEE's endpoint blocks httpx-style POSTs (TLS handshake / JS-engine fingerprint) but accepts the same request when it originates from a real Chrome page that already holds the session cookies set by the home page. Flow per search: 1. Boot visible Chrome 2. Navigate to https://ieeexplore.ieee.org/Xplore/home.jsp (sets the session cookies the REST endpoint requires) 3. execute_async_script -> fetch('/rest/search', POST, body) (right cookies, right Origin, right fingerprint) 4. Parse the returned JSON with the existing parse_search_record fetch_by_id similarly navigates to /document/<arnumber> and returns page_source for the existing HTML parser. sources/ieee/fetcher.py gains _scrape_search and _scrape_document helpers that try WebRunner first (when je_web_runner is importable + AUTOPAPERTOPPT_DISABLE_WEBRUNNER is not set) and fall back to the httpx scrape on RuntimeError. The API path (with AUTOPAPERTOPPT_IEEE_API_KEY) is unchanged. Visible-only ------------ Per the new policy WebRunner never runs headless — Google's detection is more aggressive against headless signatures, and the user wanted to see what's happening. Removed: - AUTOPAPERTOPPT_CHROME_HEADLESS env var (no longer read) - the _build_chrome_args tuple return (no more headless flag to communicate) - the conditional interactive-vs-headless wait logic Replaced with a single 15-second visible page-load wait; users can close the window early if the page is ready. ACM intentionally not touched — it queries Crossref API (no scraping), so WebRunner adds nothing. Tests ----- - IEEE: +2 tests covering WebRunner-routed search + httpx fallback on WebRunner RuntimeError. Autouse fixture defaults DISABLE_WEBRUNNER=1 so the existing httpx-transport tests still exercise their intended path. - Scholar: existing _build_chrome_args tests rewritten for the always-visible signature. 478 tests pass, ruff + bandit clean. Docs ---- configuration.md updated: DISABLE_WEBRUNNER now covers both Scholar and IEEE; CHROME_PROFILE_DIR works for both; CHROME_HEADLESS row removed entirely.
The real bottleneck for IEEE / ACM / Springer / Elsevier / Wiley
papers is the PDF download step, not the search. httpx-style GETs to
those publisher CDNs reliably 403 because they fingerprint the TLS
handshake + JS engine; the search-via-Crossref (ACM) or
search-via-WebRunner (IEEE) paths already work fine.
New autopapertoppt/fetchers/webrunner_pdf.py provides:
is_available() - selenium importable + DISABLE_WEBRUNNER not set
should_use_webrunner() - host suffix check against a curated list
of paywalled publisher CDNs
download_via_browser() - boot Chrome with download prefs, navigate,
wait for the file to land in the configured
temp dir, validate %PDF magic, move to target
Chrome is configured to save PDFs straight to disk
(plugins.always_open_pdf_externally + download.default_directory)
instead of opening the built-in PDF viewer; otherwise we couldn't
grab the bytes. The profile-dir env var the rest of WebRunner uses
is honoured here too, so institutional auth cookies surface
subscription PDFs.
pdf_download.py routes paywalled URLs through this path first; on
False (Chrome failed to boot, navigation failed, timeout, magic-byte
mismatch) it falls back to the existing httpx path. Other URLs
(arxiv, NCBI PMC, institutional repos) skip the WebRunner path
entirely and use httpx directly.
ACM searches still go through Crossref REST API (no scraping needed
or wanted there), but ACM PDF URLs (dl.acm.org/...) now route
through WebRunner.
Tests
-----
+8 tests in tests/test_webrunner_pdf.py covering:
- paywalled-host detection for IEEE/ACM/Springer/Elsevier/Wiley/OUP/Nature/Science
- non-paywalled hosts (arxiv, NCBI, generic) skip WebRunner
- garbage URLs return False
- DISABLE_WEBRUNNER=1 short-circuits is_available
- integration: pdf_download routes paywalled paper through WebRunner
- integration: arxiv paper bypasses WebRunner
496 tests pass, ruff + bandit clean.
Docs
----
configuration.md's DISABLE_WEBRUNNER row now lists the full set of
paywalled publisher hosts the PDF path covers.
Two bugs the visible-Chrome run made obvious.
Bug 1: je_web_runner singleton race
-----------------------------------
The Scholar and IEEE WebRunner backends both imported the
module-level singleton webdriver_wrapper_instance and both
called set_driver(...) against it. When the pipeline fans out
sources via asyncio.gather, the two backends ran concurrently and
stomped on the same singleton — one Chrome became orphaned, the
other's execute_async_script / page_source read a window in the
wrong state. The symptom was silent: no exception, no log, just
Chrome stuck on IEEE Xplore home page with zero records returned.
Fix: new shared helper autopapertoppt/fetchers/webrunner_browser.py
that uses selenium.webdriver.Chrome directly, spinning a FRESH
driver per call. No shared state across backends; each call owns
and quits its own Chrome.
Both source backends + the PDF download path now route through
make_driver() instead of the je_web_runner singleton. The
je_web_runner dependency stays in pyproject (it's a base dep) but
the project no longer uses its singleton.
Bug 2: captcha doesn't wait for the user
----------------------------------------
The Scholar backend used a fixed 15s sleep after navigation, then
read page_source. If Google served a captcha, the user couldn't
solve it in 15s and the next iteration just dropped the source.
Fix: webrunner_browser.wait_for_captcha_solved polls every 2s for
up to 5 minutes. The poll checks URL fragments (/sorry/, /captcha,
/recaptcha) AND body markers ('unusual traffic', captcha-form,
g-recaptcha, etc.). When the captcha clears (user solved it,
Google redirected back to the real SERP), the loop returns True
and the caller reads page_source. Timeout → returns False, source
falls through to the empty-SERP shell path.
Same wait now applies to IEEE — if IEEE serves a 'verify you're
human' page (some regions / IPs), the user has the same 5-minute
window to clear it.
Also added: IEEE WebRunner now logs the record count it received
so future debugging doesn't have the 'no log line at all' mystery.
496 tests pass, ruff + bandit clean.
Move Code Quality Requirements (290 lines), Project-Specific Compliance Patterns, Slide Deck Rules, and Environment / env-vars sections out of the always-loaded CLAUDE.md (671 -> 94 lines) into four new reference subagents under .claude/agents/: code-quality-reviewer, compliance-auditor, slide-deck-rules, env-vars. CLAUDE.md keeps the overview, Definition of Done pointer, Git Commit hygiene, and a condensed IEEE / publisher CDN "browser automation mandatory" HARD RULE summary that points at the compliance-auditor subagent for the full audit checklist. Also add the IEEE WebRunner sanity gate to dod-verify (grep for --headless / direct httpx POSTs to ieeexplore /rest/search outside the webrunner_backend), update AGENTS.md "Where to look for the rest" to point at the new subagents, and update tests/test_agents_md.py to treat CLAUDE.md + every .claude/agents/*.md as one combined Claude-rules document so the mirror-with-AGENTS.md tests still pass after the split. .gitignore: add chrome_profile / chrome-profile / chrome_profiles / selenium-debug.log / *.crdownload so WebRunner-driven Selenium runs don't accidentally commit browser state.
…DF download
The CLI's built-in WebRunner backend boots Chrome from inside asyncio.gather
which works for unattended runs but burns the LLM agent's ability to make
per-step decisions. These scripts are the LLM-as-agent path: the LLM in the
editor session invokes them via Bash, the script opens a visible Chrome
window (no headless), and the LLM decides URLs / inspects HTML / picks
papers / drives downloads.
Search + parse:
* scripts/llm_driven_search.py -- boot Chrome, hit Scholar SERP for a chosen
query, JS-fetch IEEE /rest/search from inside the IEEE origin, dump
HTML + JSON to exports/_llm_scratch/ for the next step to read.
* scripts/llm_parse_results.py -- read the dumped artefacts, parse via the
project's scholar / ieee parsers, dedup + rank + export xlsx + md.
Per-publisher PDF download (shared helpers in scripts/_pdf_downloaders.py,
thin single-paper CLIs reusing them):
* download_ieee -- /document/<arnumber> sets cookies, /stamp/stamp.jsp
serves an iframe wrapper, fallback extracts the iframe src
(/stampPDF/getPDF.jsp) and lets Chrome's plugins.always_open_pdf_externally
auto-download.
* download_acm -- /doi/<doi> cookies, /doi/pdf/<doi> streams the PDF.
* download_springer -- /article/<doi> with /chapter/<doi> fallback,
/content/pdf/<doi>.pdf streams the PDF.
* dispatch_for_url -- pick the right downloader by URL host; pulls DOI
out of the URL when the xlsx DOI column is empty.
* CLIs: scripts/llm_download_ieee_pdf.py <arnumber>,
scripts/llm_download_acm_pdf.py <doi>,
scripts/llm_download_springer_pdf.py <doi>.
Batch driver:
* scripts/llm_download_pdfs.py <xlsx> -- read the Papers sheet, group by
publisher, open ONE Chrome session, walk all rows, summarise ok/fail.
Idempotent: papers whose canonical <id>.pdf already exists and validates
skip immediately. Validated end-to-end on a test-time-compute-scaling
query: 7/7 PDFs (6 IEEE + 1 ACM, 10.8 MB total), all %PDF-1.4..1.6 head
+ %%EOF tail. Re-run is full cache-hit.
PDF integrity validation: _is_valid_pdf checks both %PDF- magic header AND
%%EOF tail marker to reject HTML "Sign in / Get access" gates that
publishers serve to unauthenticated visitors.
Files baseline via _snapshot_pdfs + _wait_for_new_pdf (set-diff) instead of
deleting all .pdf at each step, so a batch of 7 papers ends with all 7
PDFs on disk rather than only the last successful one.
Repo root now has only the English README.md plus a readmes/ directory containing the 13 translated copies (de, es, fr, hi, id, it, ja, ko, pt, ru, vi, zh-CN, zh-TW). The English README's language picker points at readmes/README.<lang>.md and each translated README's English link points back at ../README.md. Each docs/<lang>/index.rst that mentioned README.<lang>.md by name is updated to reference readmes/README.<lang>.md so the Sphinx-rendered docs page links to the file's real path. tests/test_i18n.py: update readme_overrides + the default-path formatter to look under readmes/, and update the zh-TW path entry in test_zh_tw_files_use_traditional_chinese_vocabulary so the S-Chinese vocab guard keeps protecting the moved file.
The LLM-as-agent path was running searches that involved IEEE / Scholar / paywalled publishers without first checking whether the user had VPN or institutional access. Without VPN: Scholar captchas within a few requests, IEEE serves abstract-only / 403 for the PDF stage, and the user spends minutes watching a useless Chrome window. CLAUDE.md HARD RULE now requires the agent to confirm VPN status (recall from conversation, or AskUserQuestion) BEFORE running 'python -m autopapertoppt -q ...' or any scripts/llm_*.py invocation. When the user says no VPN, the source mix is restricted to open sources (arxiv,openalex,pubmed,crossref,dblp,openaire). compliance-auditor: the in-practice list now leads with the VPN confirmation step before the WebRunner-vs-httpx rules. paper-summary-author: source-level browser-automation rule prefaced with a VPN gate paragraph naming AskUserQuestion as the canonical asking mechanism.
The previous commit bundled ieee and scholar together as the "skip when no VPN" set. That was wrong: Google Scholar is publicly accessible (no subscription / institutional auth needed), so it stays in the source mix even when the user has no VPN. Only IEEE drops, because IEEE's /rest/search + PDF stamp.jsp paths require institutional access for anything useful. Updated source mix when no VPN: arxiv,openalex,pubmed,crossref,dblp,openaire,scholar (was: arxiv,openalex,pubmed,crossref,dblp,openaire — scholar was incorrectly excluded). Chrome still boots for the scholar branch because of Google's captcha resilience, but the SERP itself works fine without VPN. CLAUDE.md HARD RULE summary, compliance-auditor in-practice rule #1, and paper-summary-author VPN-gate paragraph all corrected to name IEEE / ACM / Springer (paywalled publishers) as the trigger and to keep scholar in the no-VPN mix.
…sked for a deck Decision tree already said "rich thesis-style is the default deliverable", but the rule was being interpreted as aspirational. Add an explicit "Default CLI invocation" subsection that gives the canonical command shape and names --lightweight / --no-pdf as anti-patterns when the user asks for a real deck. Those flags are now only valid for ad-hoc smoke tests, CLI-regression debugging, or when the user explicitly names them. Also append the matching bullet to the Anti-patterns section so a quick skim catches the rule.
…end-to-end runbook scripts/_pdf_downloaders.py: Add download_arxiv (open access; /pdf/<id>.pdf direct), download_aclanthology (open; landing-URL + .pdf), download_neurips (open; hash/<id>-Abstract → file/<id>-Paper-Conference.pdf swap), download_openreview (open; forum?id → pdf?id). Split dispatch_for_url into _dispatch_paywalled / _dispatch_open_access / _dispatch_by_doi_prefix to stay under the cognitive-complexity cap. DOI-prefix fallback (10.1109 → IEEE, 10.1145 → ACM, 10.1007 / 10.1038 → Springer) catches opaque hosts (openalex.org / semanticscholar.org) where the URL alone is uninformative. scripts/llm_download_pdfs.py: Wire the new handlers via _invoke(publisher, args, driver, out_dir). identifier in the plan is now a tuple so NeurIPS's (year, hash) rides through cleanly. paper-summary-author.md: Add "End-to-end runbook (search → rich deck) — DO NOT ASK MID-FLIGHT" with five phases (discovery + CLI primary attempt → no-pdf_url fallback → rich authoring → audits → optional commit) and a decision-rules table covering every failure point so future sessions don't pause for user input mid-run. scripts/_overflow_check.py: Headless overflow inspector for one-or-more .pptx. Honours TEXT_TO_FIT_SHAPE / SHAPE_TO_FIT_TEXT auto-fit modes so shrink_to_fit textboxes don't false-flag. Excludes page_number / footer shapes from the FOOTER_GUARD = 7.05" check since those shapes ARE the footer band. scripts/_dump_pdf_text.py, scripts/_inspect_xlsx.py: Internal scratch utilities the LLM-as-agent flow uses to read PDFs / xlsx rows when the Read tool can't decode them directly. Underscored to mark them as not user-facing CLIs.
…tw rich decks Four papers from the "speculative decoding LLM inference" query, each authored by reading the full PDF (downloaded via scripts/llm_download_pdfs.py — ACL, arXiv, IEEE, NeurIPS dispatchers respectively): - xia2024unlocking — ACL 2024 Findings survey (Spec-Bench) - spector2023accelerating — ICML 2023 staged speculative decoding (3.16x) - xu2024edgellm — IEEE TMC 2025 on-device speculative (9.3x) - svirschevski2024specexec — NeurIPS 2024 RAM/SSD offload + parallel spec Each Paper carries a rich PaperSummary with pain_points (4 quadrants), research_question, contributions_detailed (capped at 4 per the anti-pattern), headline_metrics (5-6 KPIs), technique_table, method_sections, evaluation_sections, system_flow (5-6 stages), 3 RQs with rq_results tables, core_observation, limitations and future_work — all in Traditional Chinese. URLs / venues copied verbatim from the source xlsx; URL audit confirms 4/4 match. Overflow check passes for all 4 decks (headless inspector now respects auto_size = TEXT_TO_FIT_SHAPE / SHAPE_TO_FIT_TEXT, so shrink_to_fit textboxes don't false-flag). A 5th xlsx row (OpenAlex W4405717632 with IEEE DOI 10.1109/tmc.2024.3513457) is an OpenAlex wrapper of the same paper as xu2024edgellm; the IEEE DOI alone cannot be resolved to an arnumber, so the dispatcher (correctly) skips it. Output: exports/speculative-decoding-zh-tw/<key>-zh-tw.pptx × 4.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
selenium.webdriver.Chromevia the newautopapertoppt/fetchers/webrunner_browser.pyhelper (persistent user-data dir + interactive login + captcha-wait). Paywalled PDF downloads for IEEE / ACM / Springer also go through this path. The httpx branch in each plugin is kept only as a CI / no-Chrome safety net.scripts/llm_*.py). New per-publisher PDF downloaders (download_ieee/_acm/_springer/_arxiv/_aclanthology/_neurips/_openreview) plus a batch driver that walks an xlsx in one Chrome session. Companionscripts/llm_driven_search.py+llm_parse_results.pylet an LLM in the editor drive Scholar SERP + IEEE/rest/searchitself when the CLI's async backend isn't a fit. Working end-to-end example:scripts/regen_speculative_decoding_zh_tw.pyships 4 hand-authored Traditional Chinese rich decks (Xia ACL Findings, Spector ICML, Xu IEEE TMC, Svirschevski NeurIPS).autopapertoppt/core/oa_resolver.pywalks Unpaywall (AUTOPAPERTOPPT_CONTACT_EMAIL) → arXiv title search → S2 cache → CORE.ac.uk to recoverpdf_urlfor paywalled-source papers; lifts coverage 40-70 pp on IEEE / ACM / Springer / Elsevier-heavy queries.CLAUDE.mdslimmed from 671 -> 94 lines; the verbose sections moved into four new reference subagents under.claude/agents/(code-quality-reviewer,compliance-auditor,slide-deck-rules,env-vars).paper-summary-author.mdnow carries a 5-phase end-to-end runbook + decision-rules table so future sessions run search -> rich-deck end-to-end without pausing for user input. Non-English READMEs moved intoreadmes/.ieee(Scholar stays in the mix because Google Scholar is public).tests/test_oa_resolver.py(+320),tests/test_webrunner_pdf.py(+129), expandedtests/sources/test_ieee.py+test_scholar.pyfor the WebRunner paths, mirror tests intests/test_agents_md.pyupdated to treat CLAUDE.md + every.claude/agents/*.mdas one combined Claude-rules document.Test plan
py -m pytest tests/— 496 passed locallypy -m ruff check .— clean acrossautopapertoppt/,sources/,scripts/,tests/py -m bandit -c pyproject.toml -r autopapertoppt/ sources/ scripts/— 0 issuesspeculative decoding LLM inferencereturned 5 papers across ACL / arXiv / IEEE / NeurIPS / OpenAlexpython -m scripts.llm_download_pdfs <xlsx>saved 4/4 valid PDFs (%PDF-...%%EOF)python -m scripts.regen_speculative_decoding_zh_twproduced 4 zh-tw decks (18-19 slides each), URL audit 4/4 matches xlsx, overflow inspector 4/4 PASSpython -m autopapertoppt -q "..." --lang zh-tw --export pptx,xlsx,bib --yesagainst a fresh query end-to-end with VPN on, confirm Chrome window pops up for the IEEE step