Skip to content

Visible-Chrome publisher flows + LLM-as-agent runbook + doc restructure#9

Merged
JE-Chen merged 18 commits into
mainfrom
dev
May 19, 2026
Merged

Visible-Chrome publisher flows + LLM-as-agent runbook + doc restructure#9
JE-Chen merged 18 commits into
mainfrom
dev

Conversation

@JE-Chen
Copy link
Copy Markdown
Member

@JE-Chen JE-Chen commented May 19, 2026

Summary

  • Visible-Chrome publisher flows. Google Scholar and IEEE Xplore are now routed through a real, visible selenium.webdriver.Chrome via the new autopapertoppt/fetchers/webrunner_browser.py helper (persistent user-data dir + interactive login + captcha-wait). Paywalled PDF downloads for IEEE / ACM / Springer also go through this path. The httpx branch in each plugin is kept only as a CI / no-Chrome safety net.
  • LLM-as-agent batch flow (scripts/llm_*.py). New per-publisher PDF downloaders (download_ieee / _acm / _springer / _arxiv / _aclanthology / _neurips / _openreview) plus a batch driver that walks an xlsx in one Chrome session. Companion scripts/llm_driven_search.py + llm_parse_results.py let an LLM in the editor drive Scholar SERP + IEEE /rest/search itself when the CLI's async backend isn't a fit. Working end-to-end example: scripts/regen_speculative_decoding_zh_tw.py ships 4 hand-authored Traditional Chinese rich decks (Xia ACL Findings, Spector ICML, Xu IEEE TMC, Svirschevski NeurIPS).
  • OA PDF resolver. New post-dedup resolver under autopapertoppt/core/oa_resolver.py walks Unpaywall (AUTOPAPERTOPPT_CONTACT_EMAIL) → arXiv title search → S2 cache → CORE.ac.uk to recover pdf_url for paywalled-source papers; lifts coverage 40-70 pp on IEEE / ACM / Springer / Elsevier-heavy queries.
  • Doc restructure. CLAUDE.md slimmed from 671 -> 94 lines; the verbose sections moved into four new reference subagents under .claude/agents/ (code-quality-reviewer, compliance-auditor, slide-deck-rules, env-vars). paper-summary-author.md now carries a 5-phase end-to-end runbook + decision-rules table so future sessions run search -> rich-deck end-to-end without pausing for user input. Non-English READMEs moved into readmes/.
  • VPN gate. Hard rule + matching subagent + memory entry: confirm VPN status (recall or AskUserQuestion) before triggering any paywalled-publisher search; when no VPN, skip only ieee (Scholar stays in the mix because Google Scholar is public).
  • Tests. New tests/test_oa_resolver.py (+320), tests/test_webrunner_pdf.py (+129), expanded tests/sources/test_ieee.py + test_scholar.py for the WebRunner paths, mirror tests in tests/test_agents_md.py updated to treat CLAUDE.md + every .claude/agents/*.md as one combined Claude-rules document.

Test plan

  • py -m pytest tests/ — 496 passed locally
  • py -m ruff check . — clean across autopapertoppt/, sources/, scripts/, tests/
  • py -m bandit -c pyproject.toml -r autopapertoppt/ sources/ scripts/ — 0 issues
  • Search smoke — speculative decoding LLM inference returned 5 papers across ACL / arXiv / IEEE / NeurIPS / OpenAlex
  • LLM-driven download smoke — python -m scripts.llm_download_pdfs <xlsx> saved 4/4 valid PDFs (%PDF-...%%EOF)
  • Rich-deck smoke — python -m scripts.regen_speculative_decoding_zh_tw produced 4 zh-tw decks (18-19 slides each), URL audit 4/4 matches xlsx, overflow inspector 4/4 PASS
  • Manual: open one of the 4 generated decks in PowerPoint to eyeball wrapping at 1280x720
  • Manual: kick python -m autopapertoppt -q "..." --lang zh-tw --export pptx,xlsx,bib --yes against a fresh query end-to-end with VPN on, confirm Chrome window pops up for the IEEE step

JE-Chen added 18 commits May 18, 2026 15:58
…ource papers

Why
---
Most IEEE / ACM / Springer / Elsevier results come back from their
plugins with pdf_url=None because the publisher sites are paywalled
even when the paper itself is open access. The OA copy usually
exists somewhere else - the author's institutional repo, an arXiv
preprint, ResearchGate, etc. - and Unpaywall indexes ~50M of them
keyed by DOI. Without this resolver the pipeline's per-paper PPT
emit gate cuts those papers because pdf_url is missing, even
though a downloadable PDF was one HTTP roundtrip away.

How
---
New module autopapertoppt/core/oa_resolver.py runs after dedup +
rank + top-tier filter. For every paper still missing pdf_url:

  1. Unpaywall by DOI via https://api.unpaywall.org/v2/{doi}.
     Requires AUTOPAPERTOPPT_CONTACT_EMAIL for politeness; skipped
     silently with a one-shot WARNING when unset. Returns
     best_oa_location.url_for_pdf when found.

  2. arXiv title search for papers without DOI (or where Unpaywall
     missed). Uses arXiv's field-restricted ti:"<title>" syntax,
     accepts only exact normalised-title matches so loosely similar
     titles do not get adopted by accident.

Both lookups are best-effort and never raise; concurrency capped
at 5 by a semaphore; HTTPS-only enforced by the existing transport
wrapper.

Surfaces
--------
- run_search now takes resolve_oa: bool = True. Default ON.
- CLI: --no-oa-resolve flag to skip the resolver per run.
- All existing fake_run_search mocks across tests/test_cli.py,
  tests/test_mcp_tools.py, tests/gui/test_search_page.py updated
  to accept **_kwargs so the new kwarg does not break them.

Note: OpenAlex and Semantic Scholar parsers ALREADY surface their
OA URL fields (best_oa_location.pdf_url, openAccessPdf.url) - this
PR doesn't touch them. The resolver only kicks in for papers whose
source plugin returns no pdf_url, which is almost exclusively IEEE
/ ACM (via Crossref) / DBLP / paywalled OpenAlex hits.

Tests
-----
+11 tests in tests/test_oa_resolver.py covering: early-exit when
all papers have pdf_url, Unpaywall happy path, arXiv fallback when
Unpaywall misses, both miss, email-unset skip, one-shot warning
flag, fuzzy title matching, arxiv-sourced paper skip, exact-match
only, https-only enforcement.

Docs
----
configuration.md: AUTOPAPERTOPPT_CONTACT_EMAIL row mentions the
new Unpaywall use. architecture.md: pipeline diagram updated +
new "OA PDF resolution" subsection. cli.md: --no-oa-resolve row.

462 tests pass, ruff + bandit clean.
…v_id/S2/CORE

For broader coverage out of the box and bigger pdf_url hit-rate on
the 'public papers I can read' axis.

Default flips
-------------
- IEEE plugin: was opt-IN via AUTOPAPERTOPPT_ENABLE_IEEE_SCRAPING=1.
  Now default-ON, opt-OUT via AUTOPAPERTOPPT_DISABLE_IEEE_SCRAPING=1.
- Scholar plugin: same flip, opt-OUT via
  AUTOPAPERTOPPT_DISABLE_SCHOLAR_SCRAPING=1.
- DEFAULT_SOURCES now includes scholar (it was already missing).
- top_tier_only CLI flag flipped: --all-venues (off by default) is
  replaced by --top-tier-only (off by default). Most IEEE / ACM
  workshop papers live outside the whitelist; the previous default
  silently dropped them.

FetcherConfig gains an opt_out_env_var field so list_sources can
report enablement correctly for the new gating semantics.

OA resolver — three new strategies
-----------------------------------
Previous resolver: Unpaywall -> arXiv title search.

New chain (returns first hit):

  1. arXiv-ID direct  — derive https://arxiv.org/pdf/{id}.pdf from
     paper.arxiv_id when set. Zero network, highest precision.
     Strips trailing v<N> version suffix.
  2. Unpaywall by DOI — unchanged.
  3. Semantic Scholar OA index — query
     /graph/v1/paper/DOI:{doi}?fields=openAccessPdf. S2's index is
     partially disjoint from Unpaywall.
  4. CORE.ac.uk — 200M+ institutional/regional OA repos. Needs
     AUTOPAPERTOPPT_CORE_API_KEY (free), skipped silently otherwise.
  5. arXiv title search — unchanged, last resort.

Strategies 2-4 are data-driven through a _DOI_STRATEGIES tuple to
keep _resolve_one's cognitive complexity under the ruff/SonarQube
cap of 10.

Tests
-----
- test_ieee.py / test_scholar.py: opt-in test renamed to opt-out,
  autouse fixture clears the DISABLE flag instead of setting ENABLE.
- test_mcp_tools.py: list_sources output schema updated
  (needs_env_var -> opt_in_env_var + opt_out_env_var).
- test_cli.py: top_tier_filter_on_by_default renamed to
  top_tier_filter_off_by_default; asserts the inverse.
- test_oa_resolver.py: +5 tests covering arxiv_id direct, S2
  fallback, CORE fallback, CORE-key-missing skip, arxiv_id version
  stripping.

467 tests pass, ruff + bandit clean.

Docs
----
configuration.md: env-var table updated for the flipped semantics
and the new CORE key. architecture.md: pipeline diagram and OA
resolution subsection updated to the 5-strategy chain. cli.md:
--top-tier-only row replaces --all-venues with revised default.
Two production-quality improvements on top of the new resolver +
default flips.

Semantic Scholar
----------------
- The OA resolver's _query_semantic_scholar now honours
  AUTOPAPERTOPPT_S2_API_KEY (sent as x-api-key header). Without the
  key the resolver hit the anonymous tier (~1 req/s) and rate-limited
  fast under burst. The source plugin already used the key; only the
  resolver's separate query path was missing it.
- Added an in-process _S2_CACHE dict keyed by DOI so re-querying the
  same paper within one run is a no-op. Negative results (not in OA
  index) are cached too. 429 responses are NOT cached (so the user
  can retry after setting the API key).

Google Scholar captcha
----------------------
Google's captcha is served as HTTP 200 with an HTML form, so the
old status-code-only checks missed it; the plugin just returned an
empty result and the pipeline retried for the rest of the run,
deepening the IP block.

- New _is_captcha_response detects the bot-check page by URL path
  (/sorry/) and known body markers ('unusual traffic',
  captcha-form, g-recaptcha, ...).
- New _engage_captcha_cooldown sets a process-level timestamp
  (default 30 min). Subsequent Scholar requests in the same run
  raise SourceUnavailableError immediately without issuing HTTP,
  so we don't burn the rate-limit budget once we know Google has
  flagged us.
- The error message now tells the user what to do: rotate IP (VPN),
  set AUTOPAPERTOPPT_DISABLE_SCHOLAR_SCRAPING=1, or wait.

Tests
-----
+5 tests: S2 cache hit, S2 x-api-key header sent when env var set,
captcha URL detection, captcha body markers, cooldown engages after
first hit + raises immediately on the second call.

471 tests pass, ruff + bandit clean.

Docs
----
configuration.md: AUTOPAPERTOPPT_S2_API_KEY row now mentions the
OA resolver path. Scholar / IEEE rows already documented from the
previous flip commit.
httpx-based Scholar scrapes get captcha'd within a few requests
because Google's detection flags non-browser User-Agents and the
predictable request shape. WebRunner (https://webrunner.readthedocs.io/)
drives a real Chrome via Selenium with the
--disable-blink-features=AutomationControlled flag, which survives
Google's standard heuristics.

How it works
------------
sources/scholar/webrunner_backend.py:

  is_available() — True iff je_web_runner is importable AND
  AUTOPAPERTOPPT_DISABLE_WEBRUNNER is not set. Cheap repeated calls.

  fetch_serp_html(query) — async wrapper that runs the Selenium
  block in asyncio.to_thread so the pipeline's event loop isn't
  blocked while Chrome boots (5-10 s cold). Uses Chrome with the
  auto-control disable flag, headless=new, lang=en-US, 1280x720
  window — same fingerprint as the WebRunner google_search.py
  example.

sources/scholar/fetcher.py — search() picks WebRunner when
is_available, falls back to httpx on any RuntimeError (e.g. Chrome
not installed). The httpx path keeps the captcha cooldown logic
from the previous commit as a second-line defence.

Optional extra
--------------
pyproject.toml gains a [scholar] extra:

  pip install autopapertoppt[scholar]    # adds je_web_runner

The [dev] extra also pulls it in so the test suite covers both
backends. Without the extra, Scholar falls back to the httpx
path silently — no behavioural regression.

Tests
-----
+3 tests in tests/sources/test_scholar.py:
  - WebRunner backend is used when is_available returns True
  - WebRunner failure (Chrome boot error) falls back to httpx
  - DISABLE_WEBRUNNER=1 makes is_available return False

The autouse fixture defaults DISABLE_WEBRUNNER=1 so existing tests
exercise the httpx path; only the three new tests opt in.

474 tests pass, ruff + bandit clean.

Docs
----
configuration.md: AUTOPAPERTOPPT_DISABLE_WEBRUNNER row added with
behaviour, install command for the [scholar] extra, and the CI
override use case.
Promoted je_web_runner from the [scholar] extra to the main
dependency list. The Scholar plugin's WebRunner-first behaviour
(captcha-resilient real-browser path) is now the default on every
'pip install autopapertoppt' install rather than gated behind
an opt-in extra.

Users without Chrome on PATH or who explicitly set
AUTOPAPERTOPPT_DISABLE_WEBRUNNER=1 still get the httpx scrape
fallback, including the captcha cooldown logic. No behavioural
regression — only the default capability widened.

Removed the now-redundant [scholar] extra from pyproject.toml.
Anyone who had 'pip install autopapertoppt[scholar]' in their
deploy gets je_web_runner via the base dep set; the extra-less
install is now strictly more capable.

Docs updated: configuration.md no longer references the extra,
just the DISABLE_WEBRUNNER opt-out and its use cases (CI without
a Chrome binary, latency-sensitive runs).
…olar

Google's bot detection rejects WebRunner requests too once an IP is
flagged from earlier scraping. The reliable workaround is to seed a
real Google sign-in into a Chrome profile dir once; subsequent
headless runs reuse the session cookie and Google trusts the
authenticated request.

Two new env vars in sources/scholar/webrunner_backend.py:

  AUTOPAPERTOPPT_CHROME_PROFILE_DIR
    When set, passes --user-data-dir=<path> to Chrome so cookies,
    login state, and captcha clearance survive across CLI runs.

  AUTOPAPERTOPPT_CHROME_HEADLESS
    Default '1' (headless). Set '0' for the one-time interactive
    sign-in that seeds the profile dir. When non-headless the
    Chrome window holds open for 60 s (vs the 3 s headless wait)
    so the user has time to log in / accept consent banners /
    solve any captcha.

_build_chrome_args extracted from _drive_chrome_sync so the args
construction is unit-testable without spinning up Chrome.

Tests
-----
+3 tests covering: default headless + no profile, profile dir
passed as --user-data-dir, HEADLESS=0 drops the headless flag.

477 tests pass, ruff + bandit clean.

Docs
----
configuration.md gains two env var rows + a new 'Suppressing
Scholar captchas with a persistent Chrome profile' recipe walking
through the one-time interactive setup and the steady-state
headless usage, with caveats about profile-dir locking, cookie
expiry, and treating the profile dir as a secret.
The interactive-login phase opens Chrome visibly for the user to
sign into Google. Users naturally close the Chrome window the
moment login completes (which is what we want — the cookies are
already persisted to the profile dir at that point) but closing
the window invalidates the WebDriver session, so the subsequent
page_source access raises or returns None and the parser crashes
with AttributeError.

Wrap the page_source access in a defensive try; on None or
exception return an empty SERP HTML shell that the parser
handles as 'valid but zero results'. The cookie store under
--user-data-dir is already on disk before any of this code runs,
so the interactive seed has done its job even when page_source
fails.

User-facing log message updated to call out that the user MAY
close the window once logged in.
Two changes pulling the WebRunner story together.

IEEE via WebRunner
------------------
New sources/ieee/webrunner_backend.py drives a visible Chrome to
the IEEE Xplore origin and runs the existing /rest/search POST
inside the real-browser context via execute_async_script. IEEE's
endpoint blocks httpx-style POSTs (TLS handshake / JS-engine
fingerprint) but accepts the same request when it originates from
a real Chrome page that already holds the session cookies set by
the home page.

Flow per search:
  1. Boot visible Chrome
  2. Navigate to https://ieeexplore.ieee.org/Xplore/home.jsp
     (sets the session cookies the REST endpoint requires)
  3. execute_async_script -> fetch('/rest/search', POST, body)
     (right cookies, right Origin, right fingerprint)
  4. Parse the returned JSON with the existing parse_search_record

fetch_by_id similarly navigates to /document/<arnumber> and
returns page_source for the existing HTML parser.

sources/ieee/fetcher.py gains _scrape_search and _scrape_document
helpers that try WebRunner first (when je_web_runner is importable
+ AUTOPAPERTOPPT_DISABLE_WEBRUNNER is not set) and fall back to
the httpx scrape on RuntimeError. The API path (with
AUTOPAPERTOPPT_IEEE_API_KEY) is unchanged.

Visible-only
------------
Per the new policy WebRunner never runs headless — Google's
detection is more aggressive against headless signatures, and the
user wanted to see what's happening. Removed:

  - AUTOPAPERTOPPT_CHROME_HEADLESS env var (no longer read)
  - the _build_chrome_args tuple return (no more headless flag
    to communicate)
  - the conditional interactive-vs-headless wait logic

Replaced with a single 15-second visible page-load wait; users
can close the window early if the page is ready.

ACM intentionally not touched — it queries Crossref API (no
scraping), so WebRunner adds nothing.

Tests
-----
- IEEE: +2 tests covering WebRunner-routed search + httpx
  fallback on WebRunner RuntimeError. Autouse fixture defaults
  DISABLE_WEBRUNNER=1 so the existing httpx-transport tests
  still exercise their intended path.
- Scholar: existing _build_chrome_args tests rewritten for the
  always-visible signature.

478 tests pass, ruff + bandit clean.

Docs
----
configuration.md updated: DISABLE_WEBRUNNER now covers both
Scholar and IEEE; CHROME_PROFILE_DIR works for both; CHROME_HEADLESS
row removed entirely.
The real bottleneck for IEEE / ACM / Springer / Elsevier / Wiley
papers is the PDF download step, not the search. httpx-style GETs to
those publisher CDNs reliably 403 because they fingerprint the TLS
handshake + JS engine; the search-via-Crossref (ACM) or
search-via-WebRunner (IEEE) paths already work fine.

New autopapertoppt/fetchers/webrunner_pdf.py provides:

  is_available()         - selenium importable + DISABLE_WEBRUNNER not set
  should_use_webrunner() - host suffix check against a curated list
                           of paywalled publisher CDNs
  download_via_browser() - boot Chrome with download prefs, navigate,
                           wait for the file to land in the configured
                           temp dir, validate %PDF magic, move to target

Chrome is configured to save PDFs straight to disk
(plugins.always_open_pdf_externally + download.default_directory)
instead of opening the built-in PDF viewer; otherwise we couldn't
grab the bytes. The profile-dir env var the rest of WebRunner uses
is honoured here too, so institutional auth cookies surface
subscription PDFs.

pdf_download.py routes paywalled URLs through this path first; on
False (Chrome failed to boot, navigation failed, timeout, magic-byte
mismatch) it falls back to the existing httpx path. Other URLs
(arxiv, NCBI PMC, institutional repos) skip the WebRunner path
entirely and use httpx directly.

ACM searches still go through Crossref REST API (no scraping needed
or wanted there), but ACM PDF URLs (dl.acm.org/...) now route
through WebRunner.

Tests
-----
+8 tests in tests/test_webrunner_pdf.py covering:
  - paywalled-host detection for IEEE/ACM/Springer/Elsevier/Wiley/OUP/Nature/Science
  - non-paywalled hosts (arxiv, NCBI, generic) skip WebRunner
  - garbage URLs return False
  - DISABLE_WEBRUNNER=1 short-circuits is_available
  - integration: pdf_download routes paywalled paper through WebRunner
  - integration: arxiv paper bypasses WebRunner

496 tests pass, ruff + bandit clean.

Docs
----
configuration.md's DISABLE_WEBRUNNER row now lists the full set of
paywalled publisher hosts the PDF path covers.
Two bugs the visible-Chrome run made obvious.

Bug 1: je_web_runner singleton race
-----------------------------------
The Scholar and IEEE WebRunner backends both imported the
module-level singleton webdriver_wrapper_instance and both
called set_driver(...) against it. When the pipeline fans out
sources via asyncio.gather, the two backends ran concurrently and
stomped on the same singleton — one Chrome became orphaned, the
other's execute_async_script / page_source read a window in the
wrong state. The symptom was silent: no exception, no log, just
Chrome stuck on IEEE Xplore home page with zero records returned.

Fix: new shared helper autopapertoppt/fetchers/webrunner_browser.py
that uses selenium.webdriver.Chrome directly, spinning a FRESH
driver per call. No shared state across backends; each call owns
and quits its own Chrome.

Both source backends + the PDF download path now route through
make_driver() instead of the je_web_runner singleton. The
je_web_runner dependency stays in pyproject (it's a base dep) but
the project no longer uses its singleton.

Bug 2: captcha doesn't wait for the user
----------------------------------------
The Scholar backend used a fixed 15s sleep after navigation, then
read page_source. If Google served a captcha, the user couldn't
solve it in 15s and the next iteration just dropped the source.

Fix: webrunner_browser.wait_for_captcha_solved polls every 2s for
up to 5 minutes. The poll checks URL fragments (/sorry/, /captcha,
/recaptcha) AND body markers ('unusual traffic', captcha-form,
g-recaptcha, etc.). When the captcha clears (user solved it,
Google redirected back to the real SERP), the loop returns True
and the caller reads page_source. Timeout → returns False, source
falls through to the empty-SERP shell path.

Same wait now applies to IEEE — if IEEE serves a 'verify you're
human' page (some regions / IPs), the user has the same 5-minute
window to clear it.

Also added: IEEE WebRunner now logs the record count it received
so future debugging doesn't have the 'no log line at all' mystery.

496 tests pass, ruff + bandit clean.
Move Code Quality Requirements (290 lines), Project-Specific Compliance
Patterns, Slide Deck Rules, and Environment / env-vars sections out of
the always-loaded CLAUDE.md (671 -> 94 lines) into four new reference
subagents under .claude/agents/: code-quality-reviewer, compliance-auditor,
slide-deck-rules, env-vars. CLAUDE.md keeps the overview, Definition of
Done pointer, Git Commit hygiene, and a condensed IEEE / publisher CDN
"browser automation mandatory" HARD RULE summary that points at the
compliance-auditor subagent for the full audit checklist.

Also add the IEEE WebRunner sanity gate to dod-verify (grep for
--headless / direct httpx POSTs to ieeexplore /rest/search outside the
webrunner_backend), update AGENTS.md "Where to look for the rest" to
point at the new subagents, and update tests/test_agents_md.py to
treat CLAUDE.md + every .claude/agents/*.md as one combined Claude-rules
document so the mirror-with-AGENTS.md tests still pass after the split.

.gitignore: add chrome_profile / chrome-profile / chrome_profiles /
selenium-debug.log / *.crdownload so WebRunner-driven Selenium runs
don't accidentally commit browser state.
…DF download

The CLI's built-in WebRunner backend boots Chrome from inside asyncio.gather
which works for unattended runs but burns the LLM agent's ability to make
per-step decisions. These scripts are the LLM-as-agent path: the LLM in the
editor session invokes them via Bash, the script opens a visible Chrome
window (no headless), and the LLM decides URLs / inspects HTML / picks
papers / drives downloads.

Search + parse:
* scripts/llm_driven_search.py -- boot Chrome, hit Scholar SERP for a chosen
  query, JS-fetch IEEE /rest/search from inside the IEEE origin, dump
  HTML + JSON to exports/_llm_scratch/ for the next step to read.
* scripts/llm_parse_results.py -- read the dumped artefacts, parse via the
  project's scholar / ieee parsers, dedup + rank + export xlsx + md.

Per-publisher PDF download (shared helpers in scripts/_pdf_downloaders.py,
thin single-paper CLIs reusing them):
* download_ieee  -- /document/<arnumber> sets cookies, /stamp/stamp.jsp
  serves an iframe wrapper, fallback extracts the iframe src
  (/stampPDF/getPDF.jsp) and lets Chrome's plugins.always_open_pdf_externally
  auto-download.
* download_acm   -- /doi/<doi> cookies, /doi/pdf/<doi> streams the PDF.
* download_springer -- /article/<doi> with /chapter/<doi> fallback,
  /content/pdf/<doi>.pdf streams the PDF.
* dispatch_for_url -- pick the right downloader by URL host; pulls DOI
  out of the URL when the xlsx DOI column is empty.
* CLIs: scripts/llm_download_ieee_pdf.py <arnumber>,
        scripts/llm_download_acm_pdf.py  <doi>,
        scripts/llm_download_springer_pdf.py <doi>.

Batch driver:
* scripts/llm_download_pdfs.py <xlsx> -- read the Papers sheet, group by
  publisher, open ONE Chrome session, walk all rows, summarise ok/fail.
  Idempotent: papers whose canonical <id>.pdf already exists and validates
  skip immediately. Validated end-to-end on a test-time-compute-scaling
  query: 7/7 PDFs (6 IEEE + 1 ACM, 10.8 MB total), all %PDF-1.4..1.6 head
  + %%EOF tail. Re-run is full cache-hit.

PDF integrity validation: _is_valid_pdf checks both %PDF- magic header AND
%%EOF tail marker to reject HTML "Sign in / Get access" gates that
publishers serve to unauthenticated visitors.

Files baseline via _snapshot_pdfs + _wait_for_new_pdf (set-diff) instead of
deleting all .pdf at each step, so a batch of 7 papers ends with all 7
PDFs on disk rather than only the last successful one.
Repo root now has only the English README.md plus a readmes/ directory
containing the 13 translated copies (de, es, fr, hi, id, it, ja, ko,
pt, ru, vi, zh-CN, zh-TW). The English README's language picker
points at readmes/README.<lang>.md and each translated README's
English link points back at ../README.md.

Each docs/<lang>/index.rst that mentioned README.<lang>.md by name
is updated to reference readmes/README.<lang>.md so the Sphinx-rendered
docs page links to the file's real path.

tests/test_i18n.py: update readme_overrides + the default-path
formatter to look under readmes/, and update the zh-TW path entry
in test_zh_tw_files_use_traditional_chinese_vocabulary so the
S-Chinese vocab guard keeps protecting the moved file.
The LLM-as-agent path was running searches that involved IEEE / Scholar /
paywalled publishers without first checking whether the user had VPN
or institutional access. Without VPN: Scholar captchas within a few
requests, IEEE serves abstract-only / 403 for the PDF stage, and the
user spends minutes watching a useless Chrome window.

CLAUDE.md HARD RULE now requires the agent to confirm VPN status
(recall from conversation, or AskUserQuestion) BEFORE running
'python -m autopapertoppt -q ...' or any scripts/llm_*.py invocation.
When the user says no VPN, the source mix is restricted to open
sources (arxiv,openalex,pubmed,crossref,dblp,openaire).

compliance-auditor: the in-practice list now leads with the VPN
confirmation step before the WebRunner-vs-httpx rules.

paper-summary-author: source-level browser-automation rule prefaced
with a VPN gate paragraph naming AskUserQuestion as the canonical
asking mechanism.
The previous commit bundled ieee and scholar together as the
"skip when no VPN" set. That was wrong: Google Scholar is publicly
accessible (no subscription / institutional auth needed), so it
stays in the source mix even when the user has no VPN. Only IEEE
drops, because IEEE's /rest/search + PDF stamp.jsp paths require
institutional access for anything useful.

Updated source mix when no VPN:
  arxiv,openalex,pubmed,crossref,dblp,openaire,scholar
(was: arxiv,openalex,pubmed,crossref,dblp,openaire — scholar was
incorrectly excluded).

Chrome still boots for the scholar branch because of Google's
captcha resilience, but the SERP itself works fine without VPN.

CLAUDE.md HARD RULE summary, compliance-auditor in-practice rule #1,
and paper-summary-author VPN-gate paragraph all corrected to name
IEEE / ACM / Springer (paywalled publishers) as the trigger and
to keep scholar in the no-VPN mix.
…sked for a deck

Decision tree already said "rich thesis-style is the default deliverable",
but the rule was being interpreted as aspirational. Add an explicit
"Default CLI invocation" subsection that gives the canonical command
shape and names --lightweight / --no-pdf as anti-patterns when the user
asks for a real deck. Those flags are now only valid for ad-hoc smoke
tests, CLI-regression debugging, or when the user explicitly names them.

Also append the matching bullet to the Anti-patterns section so a quick
skim catches the rule.
…end-to-end runbook

scripts/_pdf_downloaders.py:
  Add download_arxiv (open access; /pdf/<id>.pdf direct), download_aclanthology
  (open; landing-URL + .pdf), download_neurips (open; hash/<id>-Abstract → file/<id>-Paper-Conference.pdf
  swap), download_openreview (open; forum?id → pdf?id). Split dispatch_for_url into
  _dispatch_paywalled / _dispatch_open_access / _dispatch_by_doi_prefix to stay under
  the cognitive-complexity cap. DOI-prefix fallback (10.1109 → IEEE, 10.1145 → ACM,
  10.1007 / 10.1038 → Springer) catches opaque hosts (openalex.org / semanticscholar.org)
  where the URL alone is uninformative.

scripts/llm_download_pdfs.py:
  Wire the new handlers via _invoke(publisher, args, driver, out_dir). identifier
  in the plan is now a tuple so NeurIPS's (year, hash) rides through cleanly.

paper-summary-author.md:
  Add "End-to-end runbook (search → rich deck) — DO NOT ASK MID-FLIGHT" with
  five phases (discovery + CLI primary attempt → no-pdf_url fallback → rich
  authoring → audits → optional commit) and a decision-rules table covering
  every failure point so future sessions don't pause for user input mid-run.

scripts/_overflow_check.py:
  Headless overflow inspector for one-or-more .pptx. Honours TEXT_TO_FIT_SHAPE /
  SHAPE_TO_FIT_TEXT auto-fit modes so shrink_to_fit textboxes don't false-flag.
  Excludes page_number / footer shapes from the FOOTER_GUARD = 7.05" check
  since those shapes ARE the footer band.

scripts/_dump_pdf_text.py, scripts/_inspect_xlsx.py:
  Internal scratch utilities the LLM-as-agent flow uses to read PDFs / xlsx
  rows when the Read tool can't decode them directly. Underscored to mark
  them as not user-facing CLIs.
…tw rich decks

Four papers from the "speculative decoding LLM inference" query, each authored
by reading the full PDF (downloaded via scripts/llm_download_pdfs.py — ACL,
arXiv, IEEE, NeurIPS dispatchers respectively):

  - xia2024unlocking      — ACL 2024 Findings survey (Spec-Bench)
  - spector2023accelerating — ICML 2023 staged speculative decoding (3.16x)
  - xu2024edgellm         — IEEE TMC 2025 on-device speculative (9.3x)
  - svirschevski2024specexec — NeurIPS 2024 RAM/SSD offload + parallel spec

Each Paper carries a rich PaperSummary with pain_points (4 quadrants),
research_question, contributions_detailed (capped at 4 per the anti-pattern),
headline_metrics (5-6 KPIs), technique_table, method_sections, evaluation_sections,
system_flow (5-6 stages), 3 RQs with rq_results tables, core_observation,
limitations and future_work — all in Traditional Chinese.

URLs / venues copied verbatim from the source xlsx; URL audit confirms 4/4
match. Overflow check passes for all 4 decks (headless inspector now respects
auto_size = TEXT_TO_FIT_SHAPE / SHAPE_TO_FIT_TEXT, so shrink_to_fit textboxes
don't false-flag).

A 5th xlsx row (OpenAlex W4405717632 with IEEE DOI 10.1109/tmc.2024.3513457)
is an OpenAlex wrapper of the same paper as xu2024edgellm; the IEEE DOI alone
cannot be resolved to an arnumber, so the dispatcher (correctly) skips it.

Output: exports/speculative-decoding-zh-tw/<key>-zh-tw.pptx × 4.
@JE-Chen JE-Chen merged commit 7458d9c into main May 19, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant