Skip to content

feat: add structured archive export#17

Draft
chboishabba wants to merge 1 commit into
simwai:masterfrom
chboishabba:codex/structured-sqlite-export
Draft

feat: add structured archive export#17
chboishabba wants to merge 1 commit into
simwai:masterfrom
chboishabba:codex/structured-sqlite-export

Conversation

@chboishabba
Copy link
Copy Markdown

@chboishabba chboishabba commented May 15, 2026

Summary

This PR adds a structured archive layer for Perplexity exports before Markdown/vector indexing, and improves long-thread extraction beyond the initial captured API page.

The exporter now writes itir.perplexity.thread.v1 JSON with normalized messages, stable source thread/message IDs, thread metadata, and captured API provenance. Markdown export remains available as an optional sidecar, and the existing vector/RAG flow can still be enabled when Markdown sidecars are present.

Why

The current Markdown-first flow is useful for reading and local vector search, but it makes dedupe, reimport, pagination validation, and downstream archive integrations harder than they need to be. A canonical thread/message export gives SQLite/MyChatArchive/ITIR-style tools a stable source of truth, while Markdown and vector indexes can be regenerated from that canonical layer.

A major practical issue is long Perplexity threads. The first browser-captured /rest/thread/<id> response can contain only the first page of entries. Without structured IDs and pagination checks, it is hard to know whether a run captured the whole thread, only page one, or duplicated page one repeatedly.

Pagination / long-thread behavior

This PR changes thread extraction so it does not stop at the first captured thread API response when Perplexity reports more pages:

  • captures the exact /rest/thread/<thread-id> response for the conversation being exported
  • follows next_cursor from page context with authenticated browser cookies
  • appends additional entries when Perplexity returns genuinely new pages
  • tracks entry identities across pages
  • stops pagination if Perplexity replays only duplicate entries, instead of inflating the archive with repeated page-one content
  • keeps the raw API response/entries in the structured export so downstream tools can audit what was captured

That last point is important: this does not pretend to bypass Perplexity/Cloudflare or guarantee all private webapp pagination works forever. It makes successful pagination useful, and failed/replayed pagination detectable and safe.

For cases where Perplexity's private API still only yields the first page but the UI download contains more content, this PR also adds a downloaded Markdown recovery path. npm run bundle:perplexity-downloads converts downloaded Markdown chunks into the same structured JSON shape so the recovered turns can attach to the same canonical thread downstream.

Changes

  • Adds structured JSON export by default via EXPORT_STRUCTURED_JSON=true.
  • Makes Markdown export optional via EXPORT_MARKDOWN=true.
  • Adds normalized user/assistant message extraction from Perplexity API entries.
  • Adds authenticated cursor pagination for thread detail responses.
  • Adds duplicate-page detection so replayed first pages do not create fake extra messages.
  • Adds bundle:perplexity-downloads for converting downloaded Markdown chunks into the same structured JSON shape.
  • Defaults headful browser mode because Cloudflare/Turnstile makes headless unreliable for this flow.
  • Adds unit coverage for config, structured file writes, pagination, duplicate-page replay, filename truncation, and downloaded Markdown bundling.

Validation

  • npm run type-check
  • npm run test:unit (25 tests passed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant