feat: add structured archive export#17
Draft
chboishabba wants to merge 1 commit into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a structured archive layer for Perplexity exports before Markdown/vector indexing, and improves long-thread extraction beyond the initial captured API page.
The exporter now writes
itir.perplexity.thread.v1JSON with normalized messages, stable source thread/message IDs, thread metadata, and captured API provenance. Markdown export remains available as an optional sidecar, and the existing vector/RAG flow can still be enabled when Markdown sidecars are present.Why
The current Markdown-first flow is useful for reading and local vector search, but it makes dedupe, reimport, pagination validation, and downstream archive integrations harder than they need to be. A canonical thread/message export gives SQLite/MyChatArchive/ITIR-style tools a stable source of truth, while Markdown and vector indexes can be regenerated from that canonical layer.
A major practical issue is long Perplexity threads. The first browser-captured
/rest/thread/<id>response can contain only the first page of entries. Without structured IDs and pagination checks, it is hard to know whether a run captured the whole thread, only page one, or duplicated page one repeatedly.Pagination / long-thread behavior
This PR changes thread extraction so it does not stop at the first captured thread API response when Perplexity reports more pages:
/rest/thread/<thread-id>response for the conversation being exportednext_cursorfrom page context with authenticated browser cookiesThat last point is important: this does not pretend to bypass Perplexity/Cloudflare or guarantee all private webapp pagination works forever. It makes successful pagination useful, and failed/replayed pagination detectable and safe.
For cases where Perplexity's private API still only yields the first page but the UI download contains more content, this PR also adds a downloaded Markdown recovery path.
npm run bundle:perplexity-downloadsconverts downloaded Markdown chunks into the same structured JSON shape so the recovered turns can attach to the same canonical thread downstream.Changes
EXPORT_STRUCTURED_JSON=true.EXPORT_MARKDOWN=true.bundle:perplexity-downloadsfor converting downloaded Markdown chunks into the same structured JSON shape.Validation
npm run type-checknpm run test:unit(25 tests passed)