Ethnicity, languages, and immigration#171
Ethnicity, languages, and immigration#171caitlink12 wants to merge 24 commits intov3.0.0-validation-infrastructurefrom
Conversation
- Fixed typo in variable names (SDCDGT → SDCDCGT) - Renamed _A/_B suffixes to descriptive _cat13/_cat7 - Extended SDCGLNG to 2019-2023 cycles with source mappings (SDC_025 for 2019-2021, LAN_01 for 2022-2023) - Added cchs2009_m and cchs2010_m to SDCGLNG detail rows - CEP-010 review document and L6 integration test results
…tity) SDCFIMM: added 7 databases (2019-2023 master + PUMF). 2022-2023 master requires recoding due to restructured categories (values swapped, 3-category system collapsed to 2). 2022 PUMF uses SDCDGIMM source variable. SDCDVABT: new harmonized variable covering 19 databases from 2005-2023. Master files use SDCEFABT (2005), SDCDABT (2007-2014), SDCDVABT (2015+). PUMF files use SDC_015 (2015-2020) and SDCDVABT (2022). SDCGCGT discontinued in 2019+ (documented in CEP-010).
|
Reviewed all ethnicity variable worksheets in this PR. Changes look good. SDCDCGT_cat13 / SDCDCGT_cat7
SDCGLNG — extended to 2023Added 2019-2023 databases (master + PUMF) SDCFIMM — extended to 2023
SDCDVABT — new variable (Aboriginal/Indigenous identity)
SDCGCGT — discontinued in 2019+
Documented in CEP-010. Review details See ceps/cep-010-ethnicity/PR-171-review-summary.md for full L0-L6 review and source documentation. |
|
Ready to merge from my perspective. |
…ables and for immigration/language variables Doug worked on
There was a problem hiding this comment.
Ethnicity and immigration status variables are good now. However, SDC_025 (2019-2021) and LAN_01 (2022-2023) are not valid extensions of SDCGLNG (Languages - can converse) as they have completely different non-missing language categories that cannot be recoded the same way as SDCGLNG:
SDCGLNG codes: 1 = English with or without another language, 2 = French with or without another language, 3 = English & French with or without other language, 4 = Neither English nor French (other)
SDC_025/LAN_01 codes: 1 = English only, 2 = French only, 3 = Both, 4 = Neither
They may rather be valid extensions to SDC_5A_1 (Knowledge of official languages) instead.
Additionally, SDC_5A_1, SDCGLHM (Language(s) spoken at home), and SDCDFOLS (First official language spoken) all require proper in-depth databaseStart and variableStart review for PUMF and Master as their current information in the worksheets are hard to follow.
Review: PR #171 (Ethnicity and languages)Reviewed 8 SDC variables: SDCDCGT_cat13/cat7 (renamed from _A/_B), SDCGLNG, SDCFIMM, SDCDVABT (new), SDCGCGT, SDCGLHM, SDC_5A_1. L6 integration passed for all PUMF-testable variables. Zero non-scope content changes. Full review details in CEP-010 ( Changes summary
Fixes applied
L6 integration results
Pre-existing issues (not introduced by this PR)
RecommendationPR is good to merge. Close and delete the branch after merge. |
- Replace _NA::a/_NA::b with _NAa/_NAb in SDC-prefixed dummyVariable (76 rows) - Remove 19 extra empty columns from variable_details.csv header (41→22 columns) - Update CEP-010 review summary with 2026-02-20 L6 integration results
- Correct invalid database names cchs2019_m/cchs2020_m -> cchs2019_2020_m in SDCGLNG, SDCDVABT, and SDCFIMM rows - Extend SDC_5A_1 with Master 2011-2023 coverage (SDC_025 for 2011-2021, LAN_01 for 2022-2023); no recoding required (1:1 value code mapping) - Add harmonization note to SDCGLNG description documenting the 2011 label shift (SDC_025/LAN_01 constructs are functionally equivalent) - SDCGLNG SDCDLNG 7->4 recoding rows for Master 2007-2010 confirmed present Deferred: SDCGLHM/SDCDFOLS Master coverage gaps tracked in issue #178
Code reviewReviewed 13 SDC variables for PUMF and Master across 2001-2023. L0-L1: Source variable verification (MCP-confirmed)SDCGLNG 2019+ mappings (the focus of Rafidul's feedback):
Other new/modified variable sources verified:
All era boundary defaults verified correct — no pre-2015 variable names leaking to post-2015 databases. L6 integration test
Cross-cycle results: 8 PUMF variables tested across 9 cycles (2001-2018). All show 100% valid within their expected coverage range. No step changes at era boundaries. SDCGLNG correctly limited to 2001-2010 (PUMF coverage ends there). Issues found1 issue (P1, confidence 95):
Informational:
Checked: era boundary defaults, databaseStart consistency, PUMF/Master naming, known error patterns, L6 PUMF integration. |
…lability guidance cchsflow-review: - Add pre-2007 explicit mapping check (Check 7) - Add DerivedVar mixed _p/_m detection (Check 8) - Update era boundary section: concept-first with CCHS naming eras table - Add DerivedVar feeder check under L6 cchsflow-validation: - New skill with checks 1-8 including severity ratings cchsflow-worksheets: - Add PUMF availability by cycle table (2001-2023) - Document cchs2021_p as invalid database name - Add DerivedVar row splitting guidance and age feeder split table
Four docs covering: harmonization workflow (L0-L6), PUMF vs Master splitting (including PUMF availability by cycle table), derived variable functions, and variableStart/databaseStart authoring patterns.
… new checks Add recode block terminology definition to Check 2b. Clarify that the collision check is at the (database, recStart) level, not databaseStart overlap alone. Reference check_recode_blocks() and check_invalid_databases() automated checks. Add cchs2021_p/2022_p/2023_p to Check 5 invalid database patterns with context on why they don't exist as standalone PUMF files. Add multi-block databaseStart fix rule to Step 10: narrow each block to only the databases where its source variable exists; never replace the full databaseStart (risks dropping shorthand-covered databases). Include the Beyond Compare verification step.
Split the 68KB monolithic SKILL.md (1,066 lines) into a 372-line orchestrator that delegates to focused docs: - docs/worksheet-reference.md (moved from docs/) - docs/l0-l2-documentation-review.md (L0-L2 checks) - docs/l3-l5-worksheet-checks.md (L3-L5 checks) - docs/l6-implementation-validation.md (rec_with_table testing) - docs/csv-validation-and-fixes.md (validation tools + fix workflow) - docs/review/ (Gem system prompt, notebook manifest/coverage) Added prerequisite section requiring worksheet-reference.md be read before any review. Added .gitignore exception for skill docs/ folders.
The sub-item numbered 3b is renumbered to 4, with subsequent items shifted to 5-7 for consistent sequential numbering.
…dit, and dev mode - Add PUMF-Master variable family pattern documentation to worksheet-reference.md explaining the systematic relationship between Master continuous, PUMF categorical, and _cont bridging variables (with DHH_AGE example and DHHGAGE_B footnote) - Add Check 8: Completeness audit (8a missing-code rows, 8b cycle coverage, 8c variable family completeness) to l3-l5-worksheet-checks.md - Add --dev mode to SKILL.md for authoring/development use where omissions are P1 - Cross-reference variable family pattern from Check 3 (PUMF vs Master naming)
…done criteria SKILL.md orchestrates existing docs (foundations, patterns, testing) and adds a 5-step done criteria checklist that includes R CMD check — filling a gap where package-level validation was missing from the DV function workflow.
Triage step now detects when PRs touch R/ or tests/ files, flags that Step 7b package health check will run, and cross-references the cchsflow-derive done criteria for new or modified functions. Also strengthens GHA failure handling — treat failing CI as blocking.
- Remove orphan cchs2009_m, cchs2010_m, cchs2012_m from SDCDCGT_cat13 and SDCDCGT_cat7 databaseStart (no matching variable_details rows) - Migrate deprecated _s suffixes to _m for 6 in-scope variables: SDC_5A_1, SDCDFOLS, SDCGCBG, SDCGCBG_A, SDCGLHM_A, SDCGRES (59 row-level changes in variable_details.csv) - Normalise CSV quoting via fix-worksheets.R (formatting only, no content changes beyond the above) - Add Gem verification prompt and extracted CSVs for NotebookLM cross-check (CEP-010) Relates to #179
Review summary — PR #171 (ethnicity branch)Scope13 ethnicity, language, and migration variables reviewed across 3 domains:
Fixes appliedFix 1 — databaseStart mismatch (P1): Removed orphan Fix 2 — Fix 3 — CSV normalisation: Ran fix-worksheets.R to standardise quoting across both worksheets (formatting only). L6 integration testAll 9 PUMF-testable variables pass with 100% validity across cchs2001_p through cchs2017_2018_p (200/200 rows per cycle). No step changes at era boundaries. Results at Gem verification (NotebookLM)Cross-checked all 13 variables against StatCan data dictionaries. Findings:
No blocking issues. All findings are either by-design, pre-existing scope, or tracked for follow-up. Related
|
Captures NotebookLM cross-check results for all 13 ethnicity, language, and migration variables. 0 blocking issues, 5 findings classified as by-design, correct, pre-existing gap, or not actionable.
|
Reviewed and ready to merge. @rafidoole for final checks. This PR was fully reviewed (CEP-010, L0-L6 with Gem verification) and fixes applied in prior commits:
Re-confirmed today: no CEP: |
…roperly renamed/reordered variable rows as well
There was a problem hiding this comment.
Final worksheet and implementation changes — ethnicity branch
New intermediate variable
| Variable | Notes |
|---|---|
SDCGRES_cont |
New. Converts categorical SDCGRES (1 = 0–9 yr, 2 = 10+ yr) to continuous midpoints (4.5 / 15). Required as a feeder for pct_time_der and immigration_der PUMF blocks. |
SDC variables
| Variable | Detail |
|---|---|
SDCDCGT_cat13 |
Removed. Split into three purpose-specific variables below. |
SDCDCGT_cat7 |
Renamed/restructured + coverage extended. Replaces SDCDCGT_cat13. Cycles corrected to granular form (cchs2009_m, cchs2010_m, cchs2012_m added; cchs2009_2010_m, cchs2011_2012_m, cchs2013_2014_m replaced). Extended to cchs2001_m::SDCADRAC; now 11 master cycles. |
SDCDCGT_pre2015 |
New. Split from SDCDCGT_cat13. 14-category pre-2015 master ethnicity (cchs2001_m–cchs2013_2014_m). Extended to cchs2001_m::SDCADRAC; now 9 master cycles. |
SDCDCGT_2015plus |
New. Split from SDCDCGT_cat13. 14-category 2015+ master ethnicity (cchs2015_2016_m, cchs2017_2018_m). |
SDCGCBG_A |
Removed. Superseded by new SDCGCB. |
SDCGCBG |
Cleaned up. Removed incorrect master cycles (cchs2009_m, cchs2010_m, cchs2012_m); now PUMF-only (12 cycles). |
SDCGCB |
New. Master equivalent of SDCGCBG. Full master coverage (16 cycles, cchs2001_m–cchs2023_m). |
SDCGRES_A |
Removed. Superseded by new SDCDRES. |
SDCGRES |
Cleaned up. Removed incorrect master cycles (cchs2009_m, cchs2010_m, cchs2012_m); now PUMF-only (12 cycles). |
SDCDRES |
New. Master equivalent of SDCGRES. Continuous years in Canada. Full master coverage (16 cycles, cchs2001_m–cchs2023_m). |
SDCGLHM |
Removed. Split into SDCGLHM_cat4 (PUMF) and SDCGLHM_cat7 (master). |
SDCGLHM_A |
Removed. Replaced by SDCGLHM_cat7. |
SDCGLHM_cat4 |
New. PUMF-only (9 cycles). Renamed from SDCGLHM. |
SDCGLHM_cat7 |
New. Master-only (7 cycles, cchs2007_2008_m–cchs2013_2014_m). Renamed from SDCGLHM_A. |
SDCGLNG |
Split + cleaned up. Narrowed to PUMF-only (6 cycles, cchs2001_p–cchs2010_p). Master cycles moved to new SDCDLNG. |
SDCDLNG |
New. Master equivalent of SDCGLNG. Languages spoken (7 cycles, cchs2001_m–cchs2010_m). |
SDCDFOLS |
Coverage extended. Added full master coverage (cchs2011_2012_m–cchs2023_m); previously PUMF-only + cchs2012_m stub. |
SDC_5A_1 |
Coverage + ordering fixed. Corrected cycle ordering (cchs2011_2012_m before cchs2012_m). Extended to full master; simplified variableStart (removed redundant explicit mappings). |
SDCDVABT |
Block ordering fix (vd). Master blocks moved after PUMF blocks in variable_details.csv to enforce _p before _m convention. |
pct_time variables
| Variable | Detail |
|---|---|
pct_time_der |
Extended to master; feeder vars updated. Added 16 master cycles (cchs2001_m–cchs2023_m). Function renamed pct_time_fun → calculate_pct_time. PUMF feeder vars: [DHHGAGE_cont, SDCGCBG, SDCGRES_cont]; master: [DHH_AGE, SDCGCB, SDCDRES]. Per-cycle variableStart mappings added to variables.csv. |
pct_time_der_cat10 |
Extended to master; function renamed. databaseStart expanded to 28 cycles (matching pct_time_der). Function renamed pct_time_fun_cat → categorize_pct_time. |
immigration_der
| Variable | Detail |
|---|---|
immigration_der |
Extended to master; refactored. Removed stale _s cycles. Added 11 master cycles (cchs2001_m–cchs2017_2018_m), limited by SDCDCGT_cat7 availability. Function renamed immigration_fun → categorize_immigration; refactored from nested if_else2 to dplyr::case_when. Years parameter changed from categorical 1/2 to continuous with threshold (< 10 / ≥ 10), allowing SDCGRES_cont (PUMF) and SDCDRES (master) to be passed directly without an intermediate binning variable. PUMF feeder vars: [SDCFIMM, SDCGCBG, SDCGCGT, SDCGRES_cont]; master: [SDCFIMM, SDCGCB, SDCDCGT_cat7, SDCDRES]. Category labels updated to "Visible minority" language. variable_details.csv blocks restructured to interleaved PUMF/master format (v3 convention). dummyVariable format corrected: headers → N/A, NA rows → NAa/NAb (no colons). Per-cycle variableStart mappings added to variables.csv. |
|
Changed manually merged to v3 via commit 702db21. Will close this PR now. |
Included ICES survey cycles into existing variables SDCDCGT_A and SDCGLNG in variables and variable_details