Skip to content

Ethnicity, languages, and immigration#171

Closed
caitlink12 wants to merge 24 commits intov3.0.0-validation-infrastructurefrom
ethnicity
Closed

Ethnicity, languages, and immigration#171
caitlink12 wants to merge 24 commits intov3.0.0-validation-infrastructurefrom
ethnicity

Conversation

@caitlink12
Copy link
Copy Markdown

Included ICES survey cycles into existing variables SDCDCGT_A and SDCGLNG in variables and variable_details

included ICES survey cycles to existing SDCDCGT_A variable and updated suffixes from _i to _m
included ICES survey cycles to existing SDCDCGT_B variable and updated suffixes from _i to _m
included ICES survey cycles to existing SDCGLNG variable
caitlink12 and others added 3 commits February 11, 2026 16:13
included ICES survey cycles into existing SDCGLNG variable
- Fixed typo in variable names (SDCDGT → SDCDCGT)
- Renamed _A/_B suffixes to descriptive _cat13/_cat7
- Extended SDCGLNG to 2019-2023 cycles with source mappings
  (SDC_025 for 2019-2021, LAN_01 for 2022-2023)
- Added cchs2009_m and cchs2010_m to SDCGLNG detail rows
- CEP-010 review document and L6 integration test results
…tity)

SDCFIMM: added 7 databases (2019-2023 master + PUMF). 2022-2023 master
requires recoding due to restructured categories (values swapped, 3-category
system collapsed to 2). 2022 PUMF uses SDCDGIMM source variable.

SDCDVABT: new harmonized variable covering 19 databases from 2005-2023.
Master files use SDCEFABT (2005), SDCDABT (2007-2014), SDCDVABT (2015+).
PUMF files use SDC_015 (2015-2020) and SDCDVABT (2022).

SDCGCGT discontinued in 2019+ (documented in CEP-010).
@DougManuel
Copy link
Copy Markdown
Contributor

Reviewed all ethnicity variable worksheets in this PR. Changes look good.

SDCDCGT_cat13 / SDCDCGT_cat7

  • Renamed from SDCDCGT_A/B. Moving forward, we are trying to have more meaningful names than _A
  • Fixed typo in original variable names (SDCDGT → SDCDCGT, missing C)
  • Master-file only (8 databases, 2003-2018)

SDCGLNG — extended to 2023

Added 2019-2023 databases (master + PUMF)
Source variable: SDC_025 (2019-2021), LAN_01 (2022-2023)

SDCFIMM — extended to 2023

  • Added 7 databases (2019-2023 master + PUMF)
  • 2019-2021: same 2-category structure, source SDCDVIMM
  • 2022-2023 master: categories restructured (values swapped + non-permanent resident added as 3rd category). - - Recoding collapses categories 2 and 3 to harmonized value 1 (immigrant)
  • 2022 PUMF: source variable renamed to SDCDGIMM with swapped values

SDCDVABT — new variable (Aboriginal/Indigenous identity)

  • 19 databases from 2005 to 2023 (master + PUMF)
  • Binary harmonized categories: 1 = Aboriginal/Indigenous, 2 = Non-Aboriginal/Non-Indigenous
  • Source variables: SDCEFABT (2005), SDCDABT (2007-2014), SDCDVABT (2015+ master), SDC_015 (2015-2020 PUMF)
  • 2001-2003 excluded: SDCA_7L/SDCC_7L measure racial origin, not identity

SDCGCGT — discontinued in 2019+

  • SDCGCGT and related variables (SDCDGCGT, SDCDVCGT) were discontinued after 2017-2018. Replacement variables (SDCDVFLA, SDCDVVM) have different category structures and cannot be mapped to the existing harmonized variable.

Documented in CEP-010.

Review details

See ceps/cep-010-ethnicity/PR-171-review-summary.md for full L0-L6 review and source documentation.

@DougManuel
Copy link
Copy Markdown
Contributor

Ready to merge from my perspective.

@rafdoodle rafdoodle changed the base branch from v3.0.0-validation-infrastructure to v3 February 13, 2026 18:42
@rafdoodle rafdoodle changed the base branch from v3 to v3.0.0-validation-infrastructure February 13, 2026 18:43
@rafdoodle rafdoodle changed the base branch from v3.0.0-validation-infrastructure to v3 February 13, 2026 18:49
@rafdoodle rafdoodle changed the base branch from v3 to v3.0.0-validation-infrastructure February 13, 2026 18:53
…ables and for immigration/language variables Doug worked on
Copy link
Copy Markdown
Collaborator

@rafdoodle rafdoodle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ethnicity and immigration status variables are good now. However, SDC_025 (2019-2021) and LAN_01 (2022-2023) are not valid extensions of SDCGLNG (Languages - can converse) as they have completely different non-missing language categories that cannot be recoded the same way as SDCGLNG:

SDCGLNG codes: 1 = English with or without another language, 2 = French with or without another language, 3 = English & French with or without other language, 4 = Neither English nor French (other)
SDC_025/LAN_01 codes: 1 = English only, 2 = French only, 3 = Both, 4 = Neither

They may rather be valid extensions to SDC_5A_1 (Knowledge of official languages) instead.

Additionally, SDC_5A_1, SDCGLHM (Language(s) spoken at home), and SDCDFOLS (First official language spoken) all require proper in-depth databaseStart and variableStart review for PUMF and Master as their current information in the worksheets are hard to follow.

@rafdoodle rafdoodle changed the title Ethnicity Ethnicity and languages Feb 14, 2026
@DougManuel
Copy link
Copy Markdown
Contributor

Review: PR #171 (Ethnicity and languages)

Reviewed 8 SDC variables: SDCDCGT_cat13/cat7 (renamed from _A/_B), SDCGLNG, SDCFIMM, SDCDVABT (new), SDCGCGT, SDCGLHM, SDC_5A_1. L6 integration passed for all PUMF-testable variables. Zero non-scope content changes.

Full review details in CEP-010 (ceps/cep-010-ethnicity/), updated with 2026-02-20 review findings.

Changes summary

  • Renames: SDCDCGT_A → SDCDCGT_cat13, SDCDCGT_B → SDCDCGT_cat7
  • Deleted: SDCGCGT_A (16 rows), SDCGLNG_A (10 rows)
  • New variable: SDCDVABT (Aboriginal/Indigenous identity, 24 rows, 2005-2023)
  • Extensions: SDCFIMM (5→19 rows, added 2019-2023), SDCGLNG (16→30 rows, added 2019-2023)
  • _i_m conversion: Complete across all SDC variables

Fixes applied

  1. _NA::a_NAa / _NA::b_NAb in dummyVariable (76 rows across all SDC-prefixed variables): Colons are invalid in identifiers.
  2. Trailing empty columns removed (19 extra columns in variable_details.csv header).

L6 integration results

  • All 9 PUMF cycles pass: SDCFIMM, SDCGCGT
  • Era-specific: SDCGLNG (2001-2010 only in PUMF), SDCDVABT (2015+ only), SDCGLHM (2007+), SDC_5A_1 (2011+)
  • Master-only: SDCDCGT_cat13, SDCDCGT_cat7 (ethnicity suppressed from PUMF)

Pre-existing issues (not introduced by this PR)

  • SDC_5A_1: 48 rows with _s databases

Recommendation

PR is good to merge. Close and delete the branch after merge.

- Replace _NA::a/_NA::b with _NAa/_NAb in SDC-prefixed dummyVariable (76 rows)
- Remove 19 extra empty columns from variable_details.csv header (41→22 columns)
- Update CEP-010 review summary with 2026-02-20 L6 integration results
- Correct invalid database names cchs2019_m/cchs2020_m -> cchs2019_2020_m
  in SDCGLNG, SDCDVABT, and SDCFIMM rows
- Extend SDC_5A_1 with Master 2011-2023 coverage (SDC_025 for 2011-2021,
  LAN_01 for 2022-2023); no recoding required (1:1 value code mapping)
- Add harmonization note to SDCGLNG description documenting the
  2011 label shift (SDC_025/LAN_01 constructs are functionally equivalent)
- SDCGLNG SDCDLNG 7->4 recoding rows for Master 2007-2010 confirmed present

Deferred: SDCGLHM/SDCDFOLS Master coverage gaps tracked in issue #178
@DougManuel
Copy link
Copy Markdown
Contributor

Code review

Reviewed 13 SDC variables for PUMF and Master across 2001-2023.

L0-L1: Source variable verification (MCP-confirmed)

SDCGLNG 2019+ mappings (the focus of Rafidul's feedback):

  • SDC_025 for 2019-2021 Master: confirmed exists, 4-category structure matches PUMF grouped format
  • LAN_01 for 2022-2023 Master: confirmed exists, same 4-category structure
  • SDCDLNG 7→4 collapsing for pre-2010 Master: semantically correct

Other new/modified variable sources verified:

  • SDCDVABT: SDCDVABT (Master 2015-2023), SDC_015 (PUMF 2015-2020), SDCDABT (Master 2007-2014)
  • SDCFIMM: SDCDVIMM (2019-2023), SDCDGIMM (2022 PUMF, correctly reverses 1↔2 codes), SDCDVIMM 2022-2023 correctly handles new code 3 (non-permanent resident)
  • SDCDCGT_cat13/cat7: SDCDVCGT (Master 2015-2018), SDCDCGT (Master 2003-2014)
  • SDC_5A_1: SDC_025 (2015-2021), LAN_01 (2022-2023)

All era boundary defaults verified correct — no pre-2015 variable names leaking to post-2015 databases.

L6 integration test

rec_with_table() ran successfully for all PUMF cycles. See CEP-010 artifacts for details.

Cross-cycle results: 8 PUMF variables tested across 9 cycles (2001-2018). All show 100% valid within their expected coverage range. No step changes at era boundaries. SDCGLNG correctly limited to 2001-2010 (PUMF coverage ends there).

Issues found

1 issue (P1, confidence 95):

  1. _s_m inconsistency between worksheets: variables.csv was updated from _s (deprecated share file) to _m for 6 variables, but variable_details.csv was not updated to match. This means rec_with_table() with database_name = "cchs2009_m" would find the variable in variables.csv but no matching recode rows in variable_details.csv.

    Affected variables: SDCGCBG, SDCGCBG_A, SDCGLHM_A, SDCGRES, SDCDFOLS, SDC_5A_1 (and derived variables immigration_der, pct_time_der)

    Fix: Update variable_details.csv to replace cchs2009_scchs2009_m, cchs2010_scchs2010_m, cchs2012_scchs2012_m for these variables' databaseStart and variableStart fields.

Informational:

  • SDCGLNG has no coverage for 2011-2018 (PUMF or Master). SDC_025 is available on both PUMF and Master from 2015 — expansion opportunity for a follow-up.

Checked: era boundary defaults, databaseStart consistency, PUMF/Master naming, known error patterns, L6 PUMF integration.

…lability guidance

cchsflow-review:
- Add pre-2007 explicit mapping check (Check 7)
- Add DerivedVar mixed _p/_m detection (Check 8)
- Update era boundary section: concept-first with CCHS naming eras table
- Add DerivedVar feeder check under L6

cchsflow-validation:
- New skill with checks 1-8 including severity ratings

cchsflow-worksheets:
- Add PUMF availability by cycle table (2001-2023)
- Document cchs2021_p as invalid database name
- Add DerivedVar row splitting guidance and age feeder split table
Four docs covering: harmonization workflow (L0-L6), PUMF vs Master
splitting (including PUMF availability by cycle table), derived variable
functions, and variableStart/databaseStart authoring patterns.
… new checks

Add recode block terminology definition to Check 2b. Clarify that the
collision check is at the (database, recStart) level, not databaseStart
overlap alone. Reference check_recode_blocks() and check_invalid_databases()
automated checks.

Add cchs2021_p/2022_p/2023_p to Check 5 invalid database patterns with
context on why they don't exist as standalone PUMF files.

Add multi-block databaseStart fix rule to Step 10: narrow each block to only
the databases where its source variable exists; never replace the full
databaseStart (risks dropping shorthand-covered databases). Include the
Beyond Compare verification step.
Split the 68KB monolithic SKILL.md (1,066 lines) into a 372-line
orchestrator that delegates to focused docs:

- docs/worksheet-reference.md (moved from docs/)
- docs/l0-l2-documentation-review.md (L0-L2 checks)
- docs/l3-l5-worksheet-checks.md (L3-L5 checks)
- docs/l6-implementation-validation.md (rec_with_table testing)
- docs/csv-validation-and-fixes.md (validation tools + fix workflow)
- docs/review/ (Gem system prompt, notebook manifest/coverage)

Added prerequisite section requiring worksheet-reference.md be read
before any review. Added .gitignore exception for skill docs/ folders.
The sub-item numbered 3b is renumbered to 4, with subsequent items
shifted to 5-7 for consistent sequential numbering.
…dit, and dev mode

- Add PUMF-Master variable family pattern documentation to worksheet-reference.md
  explaining the systematic relationship between Master continuous, PUMF categorical,
  and _cont bridging variables (with DHH_AGE example and DHHGAGE_B footnote)
- Add Check 8: Completeness audit (8a missing-code rows, 8b cycle coverage,
  8c variable family completeness) to l3-l5-worksheet-checks.md
- Add --dev mode to SKILL.md for authoring/development use where omissions are P1
- Cross-reference variable family pattern from Check 3 (PUMF vs Master naming)
…done criteria

SKILL.md orchestrates existing docs (foundations, patterns, testing) and adds
a 5-step done criteria checklist that includes R CMD check — filling a gap
where package-level validation was missing from the DV function workflow.
Triage step now detects when PRs touch R/ or tests/ files, flags that
Step 7b package health check will run, and cross-references the
cchsflow-derive done criteria for new or modified functions. Also
strengthens GHA failure handling — treat failing CI as blocking.
- Remove orphan cchs2009_m, cchs2010_m, cchs2012_m from SDCDCGT_cat13
  and SDCDCGT_cat7 databaseStart (no matching variable_details rows)
- Migrate deprecated _s suffixes to _m for 6 in-scope variables:
  SDC_5A_1, SDCDFOLS, SDCGCBG, SDCGCBG_A, SDCGLHM_A, SDCGRES
  (59 row-level changes in variable_details.csv)
- Normalise CSV quoting via fix-worksheets.R (formatting only, no
  content changes beyond the above)
- Add Gem verification prompt and extracted CSVs for NotebookLM
  cross-check (CEP-010)

Relates to #179
@DougManuel
Copy link
Copy Markdown
Contributor

Review summary — PR #171 (ethnicity branch)

Scope

13 ethnicity, language, and migration variables reviewed across 3 domains:

  • Ethnicity (4): SDCGCGT, SDCDCGT_cat13, SDCDCGT_cat7, SDCDVABT
  • Language (4): SDCGLNG, SDC_5A_1, SDCGLHM, SDCGLHM_A
  • Migration (5): SDCFIMM, SDCGCBG, SDCGCBG_A, SDCDFOLS, SDCGRES

Fixes applied

Fix 1 — databaseStart mismatch (P1): Removed orphan cchs2009_m, cchs2010_m, cchs2012_m from SDCDCGT_cat13 and SDCDCGT_cat7 databaseStart in variables.csv. These databases had no matching rows in variable_details.csv.

Fix 2 — _s to _m migration (pre-existing): Converted deprecated share file suffixes to single-year Master form for 6 variables (SDC_5A_1, SDCDFOLS, SDCGCBG, SDCGCBG_A, SDCGLHM_A, SDCGRES). 59 row-level changes in variable_details.csv across both databaseStart and variableStart db:: prefixes.

Fix 3 — CSV normalisation: Ran fix-worksheets.R to standardise quoting across both worksheets (formatting only).

L6 integration test

All 9 PUMF-testable variables pass with 100% validity across cchs2001_p through cchs2017_2018_p (200/200 rows per cycle). No step changes at era boundaries. Results at ceps/cep-010-ethnicity/.

Gem verification (NotebookLM)

Cross-checked all 13 variables against StatCan data dictionaries. Findings:

Finding Status Action
SDCDCGT_cat13/cat7 ends at 2018m (no 2019+ Master ethnicity) By design Successor variables tracked in #179
SDCGLNG / SDC_5A_1 overlap on 2019+ Master By design Both map SDC_025/LAN_01; different historical lineages
SDCGCGT binary (White/Non-white) Correct PUMF version is 2-category; 13-cat is Master-only
SDCDFOLS Master coverage limited to cchs2012_m Pre-existing gap Extension opportunity for follow-up
cchs2023_p not in SDCDVABT or SDCFIMM Not actionable No cchs2023_p database in cchsflow yet

No blocking issues. All findings are either by-design, pre-existing scope, or tracked for follow-up.

Related

Captures NotebookLM cross-check results for all 13 ethnicity, language,
and migration variables. 0 blocking issues, 5 findings classified as
by-design, correct, pre-existing gap, or not actionable.
@DougManuel
Copy link
Copy Markdown
Contributor

Reviewed and ready to merge. @rafidoole for final checks.

This PR was fully reviewed (CEP-010, L0-L6 with Gem verification) and fixes applied in prior commits:

  • SDCDGT_A/B typo corrected to SDCDCGT_A/B
  • databaseStart synced between variables.csv and variable_details.csv
  • dummyVariable colon identifiers fixed (76 rows)
  • Trailing empty columns removed
  • _s_m database suffixes resolved across all ethnicity/language variables
  • SDCGLNG extended to 2019-2023, SDCFIMM extended to 2019-2023, SDCDVABT added (2005-2023)

Re-confirmed today: no _s databases remain in any in-scope variable, both files are in sync.

CEP: ceps/cep-010-ethnicity/

@rafdoodle rafdoodle changed the title Ethnicity and languages Ethnicity, languages, and immigration Apr 9, 2026
Copy link
Copy Markdown
Collaborator

@rafdoodle rafdoodle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final worksheet and implementation changes — ethnicity branch

New intermediate variable

Variable Notes
SDCGRES_cont New. Converts categorical SDCGRES (1 = 0–9 yr, 2 = 10+ yr) to continuous midpoints (4.5 / 15). Required as a feeder for pct_time_der and immigration_der PUMF blocks.

SDC variables

Variable Detail
SDCDCGT_cat13 Removed. Split into three purpose-specific variables below.
SDCDCGT_cat7 Renamed/restructured + coverage extended. Replaces SDCDCGT_cat13. Cycles corrected to granular form (cchs2009_m, cchs2010_m, cchs2012_m added; cchs2009_2010_m, cchs2011_2012_m, cchs2013_2014_m replaced). Extended to cchs2001_m::SDCADRAC; now 11 master cycles.
SDCDCGT_pre2015 New. Split from SDCDCGT_cat13. 14-category pre-2015 master ethnicity (cchs2001_m–cchs2013_2014_m). Extended to cchs2001_m::SDCADRAC; now 9 master cycles.
SDCDCGT_2015plus New. Split from SDCDCGT_cat13. 14-category 2015+ master ethnicity (cchs2015_2016_m, cchs2017_2018_m).
SDCGCBG_A Removed. Superseded by new SDCGCB.
SDCGCBG Cleaned up. Removed incorrect master cycles (cchs2009_m, cchs2010_m, cchs2012_m); now PUMF-only (12 cycles).
SDCGCB New. Master equivalent of SDCGCBG. Full master coverage (16 cycles, cchs2001_m–cchs2023_m).
SDCGRES_A Removed. Superseded by new SDCDRES.
SDCGRES Cleaned up. Removed incorrect master cycles (cchs2009_m, cchs2010_m, cchs2012_m); now PUMF-only (12 cycles).
SDCDRES New. Master equivalent of SDCGRES. Continuous years in Canada. Full master coverage (16 cycles, cchs2001_m–cchs2023_m).
SDCGLHM Removed. Split into SDCGLHM_cat4 (PUMF) and SDCGLHM_cat7 (master).
SDCGLHM_A Removed. Replaced by SDCGLHM_cat7.
SDCGLHM_cat4 New. PUMF-only (9 cycles). Renamed from SDCGLHM.
SDCGLHM_cat7 New. Master-only (7 cycles, cchs2007_2008_m–cchs2013_2014_m). Renamed from SDCGLHM_A.
SDCGLNG Split + cleaned up. Narrowed to PUMF-only (6 cycles, cchs2001_p–cchs2010_p). Master cycles moved to new SDCDLNG.
SDCDLNG New. Master equivalent of SDCGLNG. Languages spoken (7 cycles, cchs2001_m–cchs2010_m).
SDCDFOLS Coverage extended. Added full master coverage (cchs2011_2012_m–cchs2023_m); previously PUMF-only + cchs2012_m stub.
SDC_5A_1 Coverage + ordering fixed. Corrected cycle ordering (cchs2011_2012_m before cchs2012_m). Extended to full master; simplified variableStart (removed redundant explicit mappings).
SDCDVABT Block ordering fix (vd). Master blocks moved after PUMF blocks in variable_details.csv to enforce _p before _m convention.

pct_time variables

Variable Detail
pct_time_der Extended to master; feeder vars updated. Added 16 master cycles (cchs2001_m–cchs2023_m). Function renamed pct_time_funcalculate_pct_time. PUMF feeder vars: [DHHGAGE_cont, SDCGCBG, SDCGRES_cont]; master: [DHH_AGE, SDCGCB, SDCDRES]. Per-cycle variableStart mappings added to variables.csv.
pct_time_der_cat10 Extended to master; function renamed. databaseStart expanded to 28 cycles (matching pct_time_der). Function renamed pct_time_fun_catcategorize_pct_time.

immigration_der

Variable Detail
immigration_der Extended to master; refactored. Removed stale _s cycles. Added 11 master cycles (cchs2001_m–cchs2017_2018_m), limited by SDCDCGT_cat7 availability. Function renamed immigration_funcategorize_immigration; refactored from nested if_else2 to dplyr::case_when. Years parameter changed from categorical 1/2 to continuous with threshold (< 10 / ≥ 10), allowing SDCGRES_cont (PUMF) and SDCDRES (master) to be passed directly without an intermediate binning variable. PUMF feeder vars: [SDCFIMM, SDCGCBG, SDCGCGT, SDCGRES_cont]; master: [SDCFIMM, SDCGCB, SDCDCGT_cat7, SDCDRES]. Category labels updated to "Visible minority" language. variable_details.csv blocks restructured to interleaved PUMF/master format (v3 convention). dummyVariable format corrected: headers → N/A, NA rows → NAa/NAb (no colons). Per-cycle variableStart mappings added to variables.csv.

rafdoodle added a commit that referenced this pull request Apr 10, 2026
@rafdoodle
Copy link
Copy Markdown
Collaborator

Changed manually merged to v3 via commit 702db21. Will close this PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants