Skip to content

Add FIFA auto-update script; fix 6 outdated FIFA codes; add corrections layer#103

Merged
luccasmmg merged 5 commits into
mainfrom
fix-fifa-codes
May 8, 2026
Merged

Add FIFA auto-update script; fix 6 outdated FIFA codes; add corrections layer#103
luccasmmg merged 5 commits into
mainfrom
fix-fifa-codes

Conversation

@olayway
Copy link
Copy Markdown
Contributor

@olayway olayway commented May 8, 2026

Summary

  • Add scripts/fifa.py — fetches the authoritative list of FIFA member codes directly from inside.fifa.com/associations on every pipeline run, replacing the stale statoids-sourced values; fixes 6 outdated codes (see table below)
  • Add scripts/corrections.py — a general-purpose manual override layer (currently empty) that runs last in the pipeline, for cases where automated sources can't resolve an issue
  • Enforce LF line endings — all CSV-writing scripts now explicitly set lineterminator='\n' to prevent CRLF output on Windows runners
  • Fix Wikidata non-determinism for Antarctica — filter Q21590062 (Antarctic Treaty Area) from the Wikidata SPARQL results, which was randomly displacing the correct Q51 (Antarctica continent) due to both carrying P297=AQ

Supersedes #75

#75 (open since 2020) proposes a manual edit to fix Sudan's FIFA code from SUD to SDN. This PR makes that fix redundant — and explains why a manual edit alone was never the right solution:

The pipeline re-generates country-codes.csv from upstream sources on every make update run. Any manual CSV edit gets silently overwritten the next time the pipeline runs, because the FIFA values come from statoids.com, which still carries SUD. A direct fix to the CSV would last only until the next automated commit.

This PR addresses the root cause: it replaces statoids as the FIFA source with a live scraper against inside.fifa.com, so the correct codes are produced automatically on every run. Sudan (SUDSDN) is one of six corrections this approach yields.

How fifa.py works

inside.fifa.com/associations is a server-side rendered page that lists all 211 FIFA member associations as hyperlinks in the form /associations/<CODE>. A single HTTP request is enough to extract the complete valid code list.

For each row in country-codes.csv:

  • If the current FIFA code is already in the valid set → leave it unchanged
  • If it is missing or invalid → try ISO3166-1-Alpha-3 as a candidate (covers the majority of statoids lag cases, since many FIFA codes match ISO alpha-3)
  • If neither resolves → emit a warning and leave unchanged (can be handled in corrections.py for known edge cases)

Why not statoids?

statoids.com (the previous source for FIFA codes) is no longer maintained (page says: Last updated: 2016-03-02). Some codes had not been updated in years:

Country ISO α-2 Was Now Reason
Sudan SD SUD SDN Historical code, FIFA now uses SDN
South Sudan SS (blank) SSD Joined FIFA 2012, statoids had no entry
Lebanon LB LIB LBN LIB is the MARC library code, not FIFA
Gibraltar GI GBZ GIB Admitted to FIFA 2016 with code GIB
Singapore SG SIN SGP Previously unknown statoids error
Curaçao CW (blank) CUW Previously unknown missing entry

The last two were only discovered by running the new scraper — they would not have been caught by a manual corrections approach.

Why not other sources?

Source Problem
footballsquads.co.uk Only ~142 of 211 members; also has outdated codes
Wikipedia CC BY-SA licence conflicts with this dataset's public-domain dedication

Pipeline fix: LF line endings

All CSV-writing scripts (cleanup.py, corrections.py, fifa.py, reorder_columns.py, reorder_rows.py) now explicitly set lineterminator='\n'. Without this, Python's csv module defaults to \r\n (the CSV standard) and pandas uses the platform line separator, both of which produce CRLF on Windows runners — making the entire CSV appear changed on every pipeline run even when the data is identical.

Pipeline fix: Wikidata non-determinism for Antarctica

Both Q51 (Antarctica — the continent) and Q21590062 (Antarctic Treaty Area) carry P297=AQ in Wikidata. The SPARQL query in wd_countries.sh returns both, and since wd_countries.py picks the first result, Antarctica's wikidata_id could flip between the two across runs depending on endpoint ordering. Q21590062 has been added to the existing filter list in wd_countries.sh alongside other known bad duplicates, ensuring Q51 is always used.

@olayway olayway marked this pull request as draft May 8, 2026 10:14
olayway and others added 3 commits May 8, 2026 13:06
Introduces scripts/corrections.py — a post-pipeline override step that
patches specific field values that are wrong in upstream sources and
would otherwise be overwritten on every `make update` run.

First use: fix four FIFA codes where statoids.com is outdated:
  SD  FIFA: SUD  -> SDN  (historical code, FIFA now uses SDN)
  SS  FIFA: ''   -> SSD  (South Sudan joined FIFA 2012, statoids has no entry)
  LB  FIFA: LIB  -> LBN  (LIB is Lebanon's MARC code, not its FIFA code)
  GI  FIFA: GBZ  -> GIB  (Gibraltar admitted to FIFA 2016 with code GIB)

The script docstring documents why no fully automated FIFA source exists
and what the alternatives are (FIFA's own site, footballsquads.co.uk,
Wikipedia) and why each falls short.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…a.com)

scripts/fifa.py fetches the full list of 211 valid FIFA member codes from
inside.fifa.com/associations in a single HTTP request, then validates and
corrects the FIFA column in country-codes.csv:
- Keeps existing codes that are already valid
- For invalid/missing codes, tries ISO 3166-1 alpha-3 as a candidate
  (covers the majority of statoids lag cases)
- Emits warnings for anything it cannot resolve automatically

Corrections found this run:
  SD  FIFA: SUD  -> SDN  (statoids lag)
  SS  FIFA: ''   -> SSD  (statoids missing entry)
  LB  FIFA: LIB  -> LBN  (wrong code from statoids)
  GI  FIFA: GBZ  -> GIB  (statoids lag, Gibraltar admitted 2016)
  SG  FIFA: SIN  -> SGP  (previously unknown error)
  CW  FIFA: ''   -> CUW  (previously unknown missing entry)

scripts/corrections.py is retained as an empty-but-documented framework
for future field-specific overrides that cannot be automated; the FIFA
entries it previously held are now handled by fifa.py on every pipeline run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ation

goodtables is unmaintained and no longer installable; frictionless is its
actively maintained successor and validates the same datapackage.yml format.
Also corrects the target file reference (datapackage.json → datapackage.yml).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add lineterminator='\n' to all csv.writer/DictWriter calls and pandas
to_csv(), and convert the existing country-codes.csv from CRLF to LF.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@olayway olayway changed the title Add corrections layer; fix 4 outdated FIFA codes from statoids Add FIFA auto-update script and corrections layer; fix 6 outdated FIFA codes May 8, 2026
Both Q51 (Antarctica continent) and Q21590062 (Antarctic Treaty Area)
carry P297=AQ, causing non-deterministic ordering in the SPARQL results.
Q51 is the correct item; filter Q21590062 the same way other known bad
duplicates are handled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@olayway olayway changed the title Add FIFA auto-update script and corrections layer; fix 6 outdated FIFA codes Add FIFA auto-update script; fix 6 outdated FIFA codes; add corrections layer May 8, 2026
@olayway olayway marked this pull request as ready for review May 8, 2026 11:44
@olayway olayway requested a review from luccasmmg May 8, 2026 11:48
Copy link
Copy Markdown
Contributor

@luccasmmg luccasmmg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@luccasmmg luccasmmg merged commit cb1f176 into main May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants