feat: add structured archive export by chboishabba · Pull Request #17 · simwai/perplexity-ai-export

chboishabba · 2026-05-15T13:17:06Z

Summary

This PR adds a structured archive layer for Perplexity exports before Markdown/vector indexing, and improves long-thread extraction beyond the initial captured API page.

The exporter now writes itir.perplexity.thread.v1 JSON with normalized messages, stable source thread/message IDs, thread metadata, and captured API provenance. Markdown export remains available as an optional sidecar, and the existing vector/RAG flow can still be enabled when Markdown sidecars are present.

Why

The current Markdown-first flow is useful for reading and local vector search, but it makes dedupe, reimport, pagination validation, and downstream archive integrations harder than they need to be. A canonical thread/message export gives SQLite/MyChatArchive/ITIR-style tools a stable source of truth, while Markdown and vector indexes can be regenerated from that canonical layer.

A major practical issue is long Perplexity threads. The first browser-captured /rest/thread/<id> response can contain only the first page of entries. Without structured IDs and pagination checks, it is hard to know whether a run captured the whole thread, only page one, or duplicated page one repeatedly.

Pagination / long-thread behavior

This PR changes thread extraction so it does not stop at the first captured thread API response when Perplexity reports more pages:

captures the exact /rest/thread/<thread-id> response for the conversation being exported
follows next_cursor from page context with authenticated browser cookies
appends additional entries when Perplexity returns genuinely new pages
tracks entry identities across pages
stops pagination if Perplexity replays only duplicate entries, instead of inflating the archive with repeated page-one content
keeps the raw API response/entries in the structured export so downstream tools can audit what was captured

That last point is important: this does not pretend to bypass Perplexity/Cloudflare or guarantee all private webapp pagination works forever. It makes successful pagination useful, and failed/replayed pagination detectable and safe.

For cases where Perplexity's private API still only yields the first page but the UI download contains more content, this PR also adds a downloaded Markdown recovery path. npm run bundle:perplexity-downloads converts downloaded Markdown chunks into the same structured JSON shape so the recovered turns can attach to the same canonical thread downstream.

Changes

Adds structured JSON export by default via EXPORT_STRUCTURED_JSON=true.
Makes Markdown export optional via EXPORT_MARKDOWN=true.
Adds normalized user/assistant message extraction from Perplexity API entries.
Adds authenticated cursor pagination for thread detail responses.
Adds duplicate-page detection so replayed first pages do not create fake extra messages.
Adds bundle:perplexity-downloads for converting downloaded Markdown chunks into the same structured JSON shape.
Defaults headful browser mode because Cloudflare/Turnstile makes headless unreliable for this flow.
Adds unit coverage for config, structured file writes, pagination, duplicate-page replay, filename truncation, and downloaded Markdown bundling.

Validation

npm run type-check
npm run test:unit (25 tests passed)

feat: add structured archive export

89fffa0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add structured archive export#17

feat: add structured archive export#17
chboishabba wants to merge 1 commit into
simwai:masterfrom
chboishabba:codex/structured-sqlite-export

chboishabba commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chboishabba commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Pagination / long-thread behavior

Changes

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chboishabba commented May 15, 2026 •

edited

Loading