Skip to content

Organize and index UFOpaedia wiki dump data #3

@ayrtondenner

Description

@ayrtondenner

Context

We have a local copy of the UFOpaedia wiki downloaded from Internet Archive + Wayback Machine snapshots. The raw data needs to be organized and indexed into a searchable, structured format — similar to how we structured the Discord exports.

What we already have (in scratch/wiki-dump/ufopaedia-dump/, gitignored):

Source File Content
2020 XML dump ufopaediaorg-20200218-history.xml (33 MB) 14,400 pages, 1 revision each (current-only wikitext)
Wayback snapshots wayback/*.json (14 files) 14 key pages, 1-40 snapshots each (Feb 2020 – Feb 2026, rendered HTML)
Image metadata ufopaediaorg-20200218-images.txt 7,962 images (TSV: filename, URL, uploader)
Page titles ufopaediaorg-20200218-titles.txt 14,400 page titles
EsTeR edits ester_edits.json 50 edits tracked via Wayback Machine
Images images/ (~1.2 GB) 15,900 image files with .desc metadata

Limitation: The XML has only 1 revision per page (despite filename saying "history"). Cloudflare blocks all live access to ufopaedia.org, so no fresh dump is possible. The Wayback snapshots partially cover the 2020-2026 gap for 14 key pages.

Task

Create scratch/wiki-dump/index_wiki_dump.py that parses all data sources into a unified index.

Checklist

  • Write index_wiki_dump.py using Python standard library only (xml.etree.ElementTree iterparse, json, re, pathlib)
  • Parse the 33 MB XML dump — extract title, namespace, page ID, revision metadata, full wikitext per page
  • Extract from wikitext: categories ([[Category:...]]), internal links ([[...]]), templates ({{...}})
  • Cross-reference with wayback/*.json — add snapshot timestamps and lengths to matching pages
  • Generate individual page JSON files in indexed/pages/ (one per page, 14,400 files)
  • Generate namespace-grouped files in indexed/by-namespace/ (main.json, talk.json, user.json, etc.)
  • Generate master indexed/index.json with all pages' metadata
  • Generate indexed/apocalypse-pages.json — filtered to only Apocalypse-related pages (title contains "(Apocalypse)" or has Apocalypse category)
  • Generate indexed/stats.json — total pages per namespace, top contributors, top categories
  • Verify: script completes, index.json has 14,400 pages, OpenApoc.json has wikitext + wayback refs
  • Consider downloading older history dumps (2012, 2014) from Internet Archive to get actual revision history

Output Structure

scratch/wiki-dump/ufopaedia-dump/indexed/
├── index.json              # Master index: all pages with metadata
├── stats.json              # Statistics summary
├── by-namespace/           # Pages grouped by namespace
│   ├── main.json
│   ├── talk.json
│   ├── user.json
│   ├── file.json
│   ├── template.json
│   └── category.json
├── pages/                  # Individual page JSON files (14,400)
│   ├── Main_Page.json
│   ├── OpenApoc.json
│   └── ...
└── apocalypse-pages.json   # Filtered: Apocalypse-related pages only

Per-Page JSON Schema

{
  "title": "Agents (Apocalypse)",
  "namespace": 0,
  "namespace_name": "main",
  "page_id": 1234,
  "dump_date": "2020-02-18",
  "revision": {
    "id": 89000,
    "parent_id": 88999,
    "timestamp": "2019-11-15T10:30:00Z",
    "contributor": "EsTeR",
    "contributor_id": 42,
    "comment": "Updated agent stats",
    "content_length": 5000
  },
  "wikitext": "Full wikitext content...",
  "categories": ["Apocalypse", "Agents"],
  "links": ["Agent Equipment (Apocalypse)", "Training (Apocalypse)"],
  "templates": ["Ref Open", "Apoc Icon"],
  "wayback_snapshots": [
    { "timestamp": "20201029221921", "date": "2020-10-29", "content_length": 32000 }
  ],
  "has_wayback_updates": true
}

Related Files

  • Existing scripts: scratch/wiki-dump/scrape_wayback.py, scratch/wiki-dump/dump_ufopaedia.py
  • Wiki updates tracking: scratch/wiki-updates/ufopaedia-pages.json
  • Discord exports (similar structure): discord-exports/*.json

Labels

enhancement, documentation

Metadata

Metadata

Assignees

Labels

ufopaediaUFOpaedia wiki-related tasks

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions