You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a local copy of the UFOpaedia wiki downloaded from Internet Archive + Wayback Machine snapshots. The raw data needs to be organized and indexed into a searchable, structured format — similar to how we structured the Discord exports.
What we already have (in scratch/wiki-dump/ufopaedia-dump/, gitignored):
Source
File
Content
2020 XML dump
ufopaediaorg-20200218-history.xml (33 MB)
14,400 pages, 1 revision each (current-only wikitext)
Wayback snapshots
wayback/*.json (14 files)
14 key pages, 1-40 snapshots each (Feb 2020 – Feb 2026, rendered HTML)
Image metadata
ufopaediaorg-20200218-images.txt
7,962 images (TSV: filename, URL, uploader)
Page titles
ufopaediaorg-20200218-titles.txt
14,400 page titles
EsTeR edits
ester_edits.json
50 edits tracked via Wayback Machine
Images
images/ (~1.2 GB)
15,900 image files with .desc metadata
Limitation: The XML has only 1 revision per page (despite filename saying "history"). Cloudflare blocks all live access to ufopaedia.org, so no fresh dump is possible. The Wayback snapshots partially cover the 2020-2026 gap for 14 key pages.
Task
Create scratch/wiki-dump/index_wiki_dump.py that parses all data sources into a unified index.
Checklist
Write index_wiki_dump.py using Python standard library only (xml.etree.ElementTree iterparse, json, re, pathlib)
Parse the 33 MB XML dump — extract title, namespace, page ID, revision metadata, full wikitext per page
Extract from wikitext: categories ([[Category:...]]), internal links ([[...]]), templates ({{...}})
Cross-reference with wayback/*.json — add snapshot timestamps and lengths to matching pages
Generate individual page JSON files in indexed/pages/ (one per page, 14,400 files)
Generate namespace-grouped files in indexed/by-namespace/ (main.json, talk.json, user.json, etc.)
Generate master indexed/index.json with all pages' metadata
Generate indexed/apocalypse-pages.json — filtered to only Apocalypse-related pages (title contains "(Apocalypse)" or has Apocalypse category)
Generate indexed/stats.json — total pages per namespace, top contributors, top categories
Verify: script completes, index.json has 14,400 pages, OpenApoc.json has wikitext + wayback refs
Consider downloading older history dumps (2012, 2014) from Internet Archive to get actual revision history
Context
We have a local copy of the UFOpaedia wiki downloaded from Internet Archive + Wayback Machine snapshots. The raw data needs to be organized and indexed into a searchable, structured format — similar to how we structured the Discord exports.
What we already have (in
scratch/wiki-dump/ufopaedia-dump/, gitignored):ufopaediaorg-20200218-history.xml(33 MB)wayback/*.json(14 files)ufopaediaorg-20200218-images.txtufopaediaorg-20200218-titles.txtester_edits.jsonimages/(~1.2 GB).descmetadataLimitation: The XML has only 1 revision per page (despite filename saying "history"). Cloudflare blocks all live access to ufopaedia.org, so no fresh dump is possible. The Wayback snapshots partially cover the 2020-2026 gap for 14 key pages.
Task
Create
scratch/wiki-dump/index_wiki_dump.pythat parses all data sources into a unified index.Checklist
index_wiki_dump.pyusing Python standard library only (xml.etree.ElementTreeiterparse,json,re,pathlib)[[Category:...]]), internal links ([[...]]), templates ({{...}})wayback/*.json— add snapshot timestamps and lengths to matching pagesindexed/pages/(one per page, 14,400 files)indexed/by-namespace/(main.json, talk.json, user.json, etc.)indexed/index.jsonwith all pages' metadataindexed/apocalypse-pages.json— filtered to only Apocalypse-related pages (title contains "(Apocalypse)" or has Apocalypse category)indexed/stats.json— total pages per namespace, top contributors, top categoriesOutput Structure
Per-Page JSON Schema
{ "title": "Agents (Apocalypse)", "namespace": 0, "namespace_name": "main", "page_id": 1234, "dump_date": "2020-02-18", "revision": { "id": 89000, "parent_id": 88999, "timestamp": "2019-11-15T10:30:00Z", "contributor": "EsTeR", "contributor_id": 42, "comment": "Updated agent stats", "content_length": 5000 }, "wikitext": "Full wikitext content...", "categories": ["Apocalypse", "Agents"], "links": ["Agent Equipment (Apocalypse)", "Training (Apocalypse)"], "templates": ["Ref Open", "Apoc Icon"], "wayback_snapshots": [ { "timestamp": "20201029221921", "date": "2020-10-29", "content_length": 32000 } ], "has_wayback_updates": true }Related Files
scratch/wiki-dump/scrape_wayback.py,scratch/wiki-dump/dump_ufopaedia.pyscratch/wiki-updates/ufopaedia-pages.jsondiscord-exports/*.jsonLabels
enhancement, documentation