Skip to content

Epic: Enrichment Provider System — Wikidata/Wikipedia and semantic web integrations #29

@gregoryfoster

Description

@gregoryfoster

Overview

This epic tracks the design and implementation of the person record enrichment provider system, beginning with Wikipedia/Wikidata integration and expanding to authoritative identity systems linked through the semantic web.

Background

Person Validator enriches person records with data from external authoritative sources. This epic establishes the full provider framework and implements the first set of providers, anchored on Wikidata as the hub of the semantic web identity graph.

Why Wikidata first

Wikidata (wikidata.org) is a free, machine-readable knowledge graph operated by the Wikimedia Foundation. Every person entity in Wikidata carries:

  • Structured biographical facts (birth/death dates, occupation, nationality)
  • A stable QID (e.g. Q23) that serves as a universal identifier
  • Links to dozens of external authoritative identity systems via external identifier properties (P214 for VIAF, P496 for ORCID, P2390 for Ballotpedia, P4386 for OpenSecrets, etc.)
  • Aliases in multiple languages and scripts
  • A sitelink to the corresponding Wikipedia article (when one exists)

A single Wikidata entity can yield the identifier keys needed to query VIAF, ORCID, Ballotpedia, OpenSecrets, Library of Congress, GND, IMDb, MusicBrainz, and ~2,394 other identity systems. This makes Wikidata the natural foundation for a cascading enrichment system.

Architecture summary

Providers execute in dependency-resolved rounds:

  • Round 1 (parallel): WikidataProvider — no dependencies
  • Round 2 (parallel): WikipediaProvider, VIAFProvider, ORCIDProvider, BallotpediaProvider, OpenSecretsProvider — all depend on wikidata_qid

When WikidataProvider cannot auto-link (ambiguous or low-confidence match), it creates a WikidataCandidateReview record for human adjudication in the Django admin. Accepting a candidate immediately triggers full downstream enrichment.

Wikidata external identifier taxonomy

Wikidata contains 10,028 external identifier properties. 2,394 apply to human persons (Q5). These are imported into the ExternalIdentifierProperty model and exposed in Django admin for administrative narrowing. WikidataProvider consults this table at runtime to extract and construct URLs for all enabled identifiers — no hardcoded allowlist.

Confidence convention

Link method Confidence
Auto-linked, unconfirmed (auto_linked review open) 0.75
Auto-linked, admin confirmed (confirmed review) 0.95
Admin-adjudicated from candidates (accepted review) 0.95
Derived from confirmed Wikidata identifier 0.90
Downstream provider data (VIAF facts, ORCID record) 0.85
Name alias — auto-linked, unconfirmed 0.70
Name alias — confirmed 0.80

See #31 for the confirmation mechanism design: WikidataCandidateReview carries an
auto_linked status for unambiguous matches, allowing admins to confirm (bump to 0.95)
or reject (rollback attributes and re-queue for manual search).

Implementation phases

Phase 0 — Foundation

Prerequisites for all providers. No user-visible enrichment output yet.

Phase 1 — WikidataProvider

Phase 2 — WikipediaProvider

Phase 3 — Priority Providers (in order)

Phase 4 — Cron Infrastructure

Future providers (researched, not yet scheduled)

Provider Wikidata Prop Coverage Notes
ISNI P213 Authors, performers, public figures Free REST API; complements VIAF
Library of Congress P244 US-catalogued persons id.loc.gov JSON API
GND (lobid.org) P227 European academics & cultural figures SPARQL federation from Wikidata works
MusicBrainz P434 Musicians, composers, producers Rich open API; no auth required
IMDb P345 Film/TV persons SPARQL federation via QLever confirmed working
Semantic Scholar P7924 Academic researchers Free REST API
SNAC P3430 Archival identity records Free REST API

SPARQL federation capability (future optimization)

Research confirmed that the Wikidata SPARQL endpoint supports SERVICE{} federation to:

  • GND (lobid.org): SERVICE <https://lobid.org/gnd/search> — returns GND biographical data
  • IMDb (QLever): SERVICE <https://qlever.dev/api/imdb> — returns IMDb person data including birth year and professions

Future providers for these sources may use federated SPARQL queries rather than separate HTTP clients.

Design document

Full design rationale, data models, API details, and all decisions: DESIGN-ENRICHMENT-PROVIDERS.md in the repository root.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestenrichmentEnrichment provider systemtrackingTracking / epic issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions