Add FIFA auto-update script; fix 6 outdated FIFA codes; add corrections layer#103
Merged
Conversation
Introduces scripts/corrections.py — a post-pipeline override step that patches specific field values that are wrong in upstream sources and would otherwise be overwritten on every `make update` run. First use: fix four FIFA codes where statoids.com is outdated: SD FIFA: SUD -> SDN (historical code, FIFA now uses SDN) SS FIFA: '' -> SSD (South Sudan joined FIFA 2012, statoids has no entry) LB FIFA: LIB -> LBN (LIB is Lebanon's MARC code, not its FIFA code) GI FIFA: GBZ -> GIB (Gibraltar admitted to FIFA 2016 with code GIB) The script docstring documents why no fully automated FIFA source exists and what the alternatives are (FIFA's own site, footballsquads.co.uk, Wikipedia) and why each falls short. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…a.com) scripts/fifa.py fetches the full list of 211 valid FIFA member codes from inside.fifa.com/associations in a single HTTP request, then validates and corrects the FIFA column in country-codes.csv: - Keeps existing codes that are already valid - For invalid/missing codes, tries ISO 3166-1 alpha-3 as a candidate (covers the majority of statoids lag cases) - Emits warnings for anything it cannot resolve automatically Corrections found this run: SD FIFA: SUD -> SDN (statoids lag) SS FIFA: '' -> SSD (statoids missing entry) LB FIFA: LIB -> LBN (wrong code from statoids) GI FIFA: GBZ -> GIB (statoids lag, Gibraltar admitted 2016) SG FIFA: SIN -> SGP (previously unknown error) CW FIFA: '' -> CUW (previously unknown missing entry) scripts/corrections.py is retained as an empty-but-documented framework for future field-specific overrides that cannot be automated; the FIFA entries it previously held are now handled by fifa.py on every pipeline run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ation goodtables is unmaintained and no longer installable; frictionless is its actively maintained successor and validates the same datapackage.yml format. Also corrects the target file reference (datapackage.json → datapackage.yml). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add lineterminator='\n' to all csv.writer/DictWriter calls and pandas to_csv(), and convert the existing country-codes.csv from CRLF to LF. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both Q51 (Antarctica continent) and Q21590062 (Antarctic Treaty Area) carry P297=AQ, causing non-deterministic ordering in the SPARQL results. Q51 is the correct item; filter Q21590062 the same way other known bad duplicates are handled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scripts/fifa.py— fetches the authoritative list of FIFA member codes directly frominside.fifa.com/associationson every pipeline run, replacing the stale statoids-sourced values; fixes 6 outdated codes (see table below)scripts/corrections.py— a general-purpose manual override layer (currently empty) that runs last in the pipeline, for cases where automated sources can't resolve an issuelineterminator='\n'to prevent CRLF output on Windows runnersQ21590062(Antarctic Treaty Area) from the Wikidata SPARQL results, which was randomly displacing the correctQ51(Antarctica continent) due to both carryingP297=AQSupersedes #75
#75 (open since 2020) proposes a manual edit to fix Sudan's FIFA code from
SUDtoSDN. This PR makes that fix redundant — and explains why a manual edit alone was never the right solution:The pipeline re-generates
country-codes.csvfrom upstream sources on everymake updaterun. Any manual CSV edit gets silently overwritten the next time the pipeline runs, because the FIFA values come from statoids.com, which still carriesSUD. A direct fix to the CSV would last only until the next automated commit.This PR addresses the root cause: it replaces statoids as the FIFA source with a live scraper against
inside.fifa.com, so the correct codes are produced automatically on every run. Sudan (SUD→SDN) is one of six corrections this approach yields.How fifa.py works
inside.fifa.com/associationsis a server-side rendered page that lists all 211 FIFA member associations as hyperlinks in the form/associations/<CODE>. A single HTTP request is enough to extract the complete valid code list.For each row in
country-codes.csv:ISO3166-1-Alpha-3as a candidate (covers the majority of statoids lag cases, since many FIFA codes match ISO alpha-3)corrections.pyfor known edge cases)Why not statoids?
statoids.com(the previous source for FIFA codes) is no longer maintained (page says:Last updated: 2016-03-02). Some codes had not been updated in years:SUDSDNSSDLIBLBNGBZGIBSINSGPCUWThe last two were only discovered by running the new scraper — they would not have been caught by a manual corrections approach.
Why not other sources?
footballsquads.co.ukPipeline fix: LF line endings
All CSV-writing scripts (
cleanup.py,corrections.py,fifa.py,reorder_columns.py,reorder_rows.py) now explicitly setlineterminator='\n'. Without this, Python'scsvmodule defaults to\r\n(the CSV standard) and pandas uses the platform line separator, both of which produce CRLF on Windows runners — making the entire CSV appear changed on every pipeline run even when the data is identical.Pipeline fix: Wikidata non-determinism for Antarctica
Both
Q51(Antarctica — the continent) andQ21590062(Antarctic Treaty Area) carryP297=AQin Wikidata. The SPARQL query inwd_countries.shreturns both, and sincewd_countries.pypicks the first result, Antarctica'swikidata_idcould flip between the two across runs depending on endpoint ordering.Q21590062has been added to the existing filter list inwd_countries.shalongside other known bad duplicates, ensuringQ51is always used.