Overview
This epic tracks the design and implementation of the person record enrichment provider system, beginning with Wikipedia/Wikidata integration and expanding to authoritative identity systems linked through the semantic web.
Background
Person Validator enriches person records with data from external authoritative sources. This epic establishes the full provider framework and implements the first set of providers, anchored on Wikidata as the hub of the semantic web identity graph.
Why Wikidata first
Wikidata (wikidata.org) is a free, machine-readable knowledge graph operated by the Wikimedia Foundation. Every person entity in Wikidata carries:
- Structured biographical facts (birth/death dates, occupation, nationality)
- A stable QID (e.g.
Q23) that serves as a universal identifier
- Links to dozens of external authoritative identity systems via external identifier properties (P214 for VIAF, P496 for ORCID, P2390 for Ballotpedia, P4386 for OpenSecrets, etc.)
- Aliases in multiple languages and scripts
- A sitelink to the corresponding Wikipedia article (when one exists)
A single Wikidata entity can yield the identifier keys needed to query VIAF, ORCID, Ballotpedia, OpenSecrets, Library of Congress, GND, IMDb, MusicBrainz, and ~2,394 other identity systems. This makes Wikidata the natural foundation for a cascading enrichment system.
Architecture summary
Providers execute in dependency-resolved rounds:
- Round 1 (parallel): WikidataProvider — no dependencies
- Round 2 (parallel): WikipediaProvider, VIAFProvider, ORCIDProvider, BallotpediaProvider, OpenSecretsProvider — all depend on
wikidata_qid
When WikidataProvider cannot auto-link (ambiguous or low-confidence match), it creates a WikidataCandidateReview record for human adjudication in the Django admin. Accepting a candidate immediately triggers full downstream enrichment.
Wikidata external identifier taxonomy
Wikidata contains 10,028 external identifier properties. 2,394 apply to human persons (Q5). These are imported into the ExternalIdentifierProperty model and exposed in Django admin for administrative narrowing. WikidataProvider consults this table at runtime to extract and construct URLs for all enabled identifiers — no hardcoded allowlist.
Confidence convention
| Link method |
Confidence |
Auto-linked, unconfirmed (auto_linked review open) |
0.75 |
Auto-linked, admin confirmed (confirmed review) |
0.95 |
Admin-adjudicated from candidates (accepted review) |
0.95 |
| Derived from confirmed Wikidata identifier |
0.90 |
| Downstream provider data (VIAF facts, ORCID record) |
0.85 |
| Name alias — auto-linked, unconfirmed |
0.70 |
| Name alias — confirmed |
0.80 |
See #31 for the confirmation mechanism design: WikidataCandidateReview carries an
auto_linked status for unambiguous matches, allowing admins to confirm (bump to 0.95)
or reject (rollback attributes and re-queue for manual search).
Implementation phases
Phase 0 — Foundation
Prerequisites for all providers. No user-visible enrichment output yet.
Phase 1 — WikidataProvider
Phase 2 — WikipediaProvider
Phase 3 — Priority Providers (in order)
Phase 4 — Cron Infrastructure
Future providers (researched, not yet scheduled)
| Provider |
Wikidata Prop |
Coverage |
Notes |
| ISNI |
P213 |
Authors, performers, public figures |
Free REST API; complements VIAF |
| Library of Congress |
P244 |
US-catalogued persons |
id.loc.gov JSON API |
| GND (lobid.org) |
P227 |
European academics & cultural figures |
SPARQL federation from Wikidata works |
| MusicBrainz |
P434 |
Musicians, composers, producers |
Rich open API; no auth required |
| IMDb |
P345 |
Film/TV persons |
SPARQL federation via QLever confirmed working |
| Semantic Scholar |
P7924 |
Academic researchers |
Free REST API |
| SNAC |
P3430 |
Archival identity records |
Free REST API |
SPARQL federation capability (future optimization)
Research confirmed that the Wikidata SPARQL endpoint supports SERVICE{} federation to:
- GND (lobid.org):
SERVICE <https://lobid.org/gnd/search> — returns GND biographical data
- IMDb (QLever):
SERVICE <https://qlever.dev/api/imdb> — returns IMDb person data including birth year and professions
Future providers for these sources may use federated SPARQL queries rather than separate HTTP clients.
Design document
Full design rationale, data models, API details, and all decisions: DESIGN-ENRICHMENT-PROVIDERS.md in the repository root.
Overview
This epic tracks the design and implementation of the person record enrichment provider system, beginning with Wikipedia/Wikidata integration and expanding to authoritative identity systems linked through the semantic web.
Background
Person Validator enriches person records with data from external authoritative sources. This epic establishes the full provider framework and implements the first set of providers, anchored on Wikidata as the hub of the semantic web identity graph.
Why Wikidata first
Wikidata (wikidata.org) is a free, machine-readable knowledge graph operated by the Wikimedia Foundation. Every person entity in Wikidata carries:
Q23) that serves as a universal identifierA single Wikidata entity can yield the identifier keys needed to query VIAF, ORCID, Ballotpedia, OpenSecrets, Library of Congress, GND, IMDb, MusicBrainz, and ~2,394 other identity systems. This makes Wikidata the natural foundation for a cascading enrichment system.
Architecture summary
Providers execute in dependency-resolved rounds:
wikidata_qidWhen WikidataProvider cannot auto-link (ambiguous or low-confidence match), it creates a
WikidataCandidateReviewrecord for human adjudication in the Django admin. Accepting a candidate immediately triggers full downstream enrichment.Wikidata external identifier taxonomy
Wikidata contains 10,028 external identifier properties. 2,394 apply to human persons (Q5). These are imported into the
ExternalIdentifierPropertymodel and exposed in Django admin for administrative narrowing. WikidataProvider consults this table at runtime to extract and construct URLs for all enabled identifiers — no hardcoded allowlist.Confidence convention
auto_linkedreview open)confirmedreview)acceptedreview)See #31 for the confirmation mechanism design:
WikidataCandidateReviewcarries anauto_linkedstatus for unambiguous matches, allowing admins to confirm (bump to 0.95)or reject (rollback attributes and re-queue for manual search).
Implementation phases
Phase 0 — Foundation
Prerequisites for all providers. No user-visible enrichment output yet.
SocialPlatform→ExternalPlatformconfidence+provenancetoPersonNameEnrichmentRunaudit log modelExternalIdentifierPropertymodel +sync_wikidata_propertiescommandEnrichmentRunner(dependency graph, parallel execution)Phase 1 — WikidataProvider
WikidataCandidateReviewmodel + post-save signalWikidataProviderimplementationPhase 2 — WikipediaProvider
WikipediaProviderimplementationPhase 3 — Priority Providers (in order)
Phase 3: BallotpediaProvider — US political figures #24:
BallotpediaProviderPhase 3: OpenSecretsProvider — US federal campaign finance #25:
OpenSecretsProviderPhase 3: VIAFProvider — international library authority file #26:
VIAFProviderPhase 3: ORCIDProvider — academic researcher identity #27:
ORCIDProviderDeclarative enrichment provider selection via Django admin #36: Declarative provider enable/disable via Django admin (pre-Phase 4)
Phase 4 — Cron Infrastructure
run_enrichment_croncommand + systemd timer unitsFuture providers (researched, not yet scheduled)
SPARQL federation capability (future optimization)
Research confirmed that the Wikidata SPARQL endpoint supports
SERVICE{}federation to:SERVICE <https://lobid.org/gnd/search>— returns GND biographical dataSERVICE <https://qlever.dev/api/imdb>— returns IMDb person data including birth year and professionsFuture providers for these sources may use federated SPARQL queries rather than separate HTTP clients.
Design document
Full design rationale, data models, API details, and all decisions:
DESIGN-ENRICHMENT-PROVIDERS.mdin the repository root.