Skip to content

ORosca/daacs-math-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DAACS Math Data Preparation and QA Pipelines

This repository contains R scripts for preparing, quality-checking, combining, and summarizing DAACS mathematics assessment data across multiple institutions and waves. The project builds clean student-level math datasets, diagnoses missingness and duplicate-score patterns, patches selected rows before deduplication, produces final descriptive summaries, and generates item-selection tables for the planned Bayesian/frequentist study. The repository also includes a focused descriptive script for the younger UAlbany subset. :contentReference[oaicite:0]{index=0} :contentReference[oaicite:1]{index=1} :contentReference[oaicite:2]{index=2} :contentReference[oaicite:3]{index=3} :contentReference[oaicite:4]{index=4} :contentReference[oaicite:5]{index=5} :contentReference[oaicite:6]{index=6}

Repository structure

The workflow is organized around one shared utilities file, five main pipeline scripts, one downstream item-selection script, and one additional descriptive subset script.

Shared utilities

  • utils_math_pipeline.R

This file stores reusable functions shared across scripts, including:

  • core helpers for assertions, saving, loading, ID normalization, and global_id creation
  • item-column helpers for finding, ordering, coercing, aligning, and counting item responses
  • shared recodes for ethnicity and yes/no variables
  • QID mapping helpers
  • missingness diagnostics
  • QA helpers
  • item-level sample-size summaries. :contentReference[oaicite:7]{index=7}

Main pipeline scripts

1. math_v2_umgc1ua2-anSamp2-2022_qa_pipeline.R

Creates clean 2022 wide datasets for UMGC1, UA2, and their combined file from the already-wide AnSamp2 math dataset.

Purpose

  1. Load the already-wide 2022 UMGC1 + UA2 math dataset
  2. Rename item columns from legacy qid_ua2 names to final QIDs
  3. Standardize key demographic and metadata variables
  4. Create clean UMGC1, UA2, and combined wide datasets
  5. Run item- and dataset-level QA summaries. :contentReference[oaicite:8]{index=8}

Key conventions

  • age_d24: TCAUS if age < 24, AUS if age >= 24
  • ethnicity: White / Asian / Black / Hispanic / Other
  • pell: No / Yes
  • military: No / Yes
  • transfer: continuous transferred credits
  • mathTime: seconds. :contentReference[oaicite:9]{index=9}

Unit of analysis

  • final outputs: one row per student (global_id). :contentReference[oaicite:10]{index=10}

2. math_v2_ua23umgc-2022-24_qa_pipeline.R

Builds clean long and wide datasets for UA23, UMGC, and their combined 2022–2024 file from raw institution-level and item-level files.

Purpose

  1. Load institution-level and item-level raw files
  2. Standardize institution-level demographics and identifiers
  3. Standardize item-level response files and attach final QIDs
  4. Exclude students with fewer than 18 answered items
  5. Derive student-level mathCompletionDate and mathTime
  6. Build clean long and wide math datasets for UA23, UMGC, and combined
  7. Export QA summaries and cleaned outputs. :contentReference[oaicite:11]{index=11}

Key conventions

  • age_d24: TCAUS if age < 24, AUS if age >= 24
  • ethnicity: White / Asian / Black / Hispanic / Other
  • pell: No / Yes
  • military: No / Yes
  • transfer: continuous transferred credits
  • mathTime: seconds. :contentReference[oaicite:12]{index=12}

Units of analysis

  • long files: one row per student-item response
  • wide files: one row per student (global_id). :contentReference[oaicite:13]{index=13}

3. math_v2_umgc_ua_22_24_01_combined_qa_stack_and_missingness.R

Stacks the cleaned 2022 and 2022–2024 wide math files, harmonizes columns, and runs missingness diagnostics on the combined dataset. :contentReference[oaicite:14]{index=14}

Main tasks

  • load the two cleaned wide files
  • add provenance via source_file
  • harmonize types and columns
  • align both files to a common master column structure
  • stack them into one combined raw file
  • run missingness diagnostics on the stacked data
  • save the combined raw dataset and missingness outputs. :contentReference[oaicite:15]{index=15}

4. math_v2_umgc_ua_22_24_02_combined_qa_duplicate_diagnostics.R

Detects duplicate students and duplicate score patterns in the stacked combined dataset and prepares deduplication artifacts for the finalization step. Despite the opening comment saying “reading dataset,” the script is part of the math workflow and uses the combined math stacked file. :contentReference[oaicite:16]{index=16}

Main tasks

  • detect duplicate global_ids
  • detect fully duplicated rows
  • detect duplicate substantive rows
  • detect duplicate score-only response patterns
  • group rows by identical score pattern
  • summarize whether duplicate-score rows are likely to represent the same person
  • prepare outputs such as rows_to_remove and patched UMGC rows for the finalization script. :contentReference[oaicite:17]{index=17}

5. math_v2_umgc_ua_22_24_03_combined_qa_finalize_and_describe.R

Applies duplicate removals, patches UMGC rows using non-missing values from matched UMGC1 rows, removes rows with almost entirely missing demographics, runs final missingness diagnostics, creates descriptive summaries, and saves the final cleaned combined math dataset. :contentReference[oaicite:18]{index=18}

Main tasks

  • load the stacked raw combined file
  • load rows_to_remove
  • load umgc_rows_patched
  • replace original UMGC rows with patched versions
  • remove selected UMGC1 duplicate rows
  • remove 7 rows missing all key demographics except military
  • enforce one row per global_id
  • run final missingness diagnostics
  • create categorical and numeric descriptives
  • generate item-level sample-size summaries
  • save the final cleaned dataset and QA outputs. :contentReference[oaicite:19]{index=19}

Additional analytic scripts

6. math_v2_ua_22_24_17-18_describe.R

Creates a focused descriptive subset for younger UAlbany students aged 17–18 after the final combined dataset has been created.

Purpose

  • subset the final cleaned dataset to ages 17–18
  • further subset to ua2 and ua23
  • run final missingness diagnostics
  • create descriptive summaries
  • create item-level sample-size tables
  • save the younger UAlbany subset and outputs. :contentReference[oaicite:20]{index=20}

This is a downstream descriptive script, not a core preprocessing pipeline. It depends on the final cleaned dataset produced by Script 5. :contentReference[oaicite:21]{index=21}

7. math_summary_table_for_item_selection.R

Creates six summary tables for item selection in the planned math study by counting eligible items across six math domains and four difficulty categories under response-count thresholds. The script derives domain from the lowercase letter embedded in QID:

  • g = geometry
  • l = lines_and_functions
  • n = number_and_calculation
  • s = statistics
  • v = variables_and_equations
  • w = word_problems. :contentReference[oaicite:22]{index=22}

Purpose

  1. Read item-level response-count summaries from item_sample_size_by_demo.csv
  2. Derive each item’s domain from QID
  3. Derive each item’s difficulty category from QID
  4. Build six domain-by-difficulty summary tables for:
    • IRT 2PL Fit: 300 responses
    • IRT 2PL Fit: 150 responses
    • Multivariable DIF: 200 per group
    • Multivariable DIF: 50 per group
    • Factor Level: 100 per level
    • Factor Level: 40 per level
  5. Save the six tables to Excel and CSV. :contentReference[oaicite:23]{index=23}

This is a downstream summary/reporting script rather than a data-cleaning pipeline. It relies on outputs from the combined final dataset workflow. :contentReference[oaicite:24]{index=24}

Recommended execution order

Run the scripts in this order:

  1. math_v2_umgc1ua2-anSamp2-2022_qa_pipeline.R
  2. math_v2_ua23umgc-2022-24_qa_pipeline.R
  3. math_v2_umgc_ua_22_24_01_combined_qa_stack_and_missingness.R
  4. math_v2_umgc_ua_22_24_02_combined_qa_duplicate_diagnostics.R
  5. math_v2_umgc_ua_22_24_03_combined_qa_finalize_and_describe.R
  6. math_v2_ua_22_24_17-18_describe.R
  7. math_summary_table_for_item_selection.R

Scripts 1–5 are the main end-to-end math preparation and QA pipeline. Scripts 6–7 are downstream analytic/reporting scripts built on the final cleaned outputs. :contentReference[oaicite:25]{index=25} :contentReference[oaicite:26]{index=26} :contentReference[oaicite:27]{index=27} :contentReference[oaicite:28]{index=28} :contentReference[oaicite:29]{index=29} :contentReference[oaicite:30]{index=30} :contentReference[oaicite:31]{index=31}

Key variable conventions

Across scripts, the main variables are standardized as follows:

  • global_id: unique student identifier
  • DAACS_ID: normalized student ID within source data
  • age_d24:
    • TCAUS if age < 24
    • AUS if age >= 24
  • ethnicity:
    • White
    • Asian
    • Black
    • Hispanic
    • Other
  • pell:
    • No
    • Yes
  • military:
    • No
    • Yes
  • transfer: continuous transferred credits
  • mathTime: seconds
  • item variables: final QID columns ordered consistently across datasets. :contentReference[oaicite:32]{index=32} :contentReference[oaicite:33]{index=33} :contentReference[oaicite:34]{index=34}

QID construction

Final item IDs (QID) were created by mapping source-specific item identifiers to a single common naming system.
Depending on the source file, the mapping uses either:

  • qid_ua2 -> QID for already-wide 2022 files, or
  • question_id -> QID for raw item-level 2022–2024 files.

The QID is the harmonized item identifier used across all cleaned datasets. It also encodes item information in the label itself. For math items, the lowercase letter in the middle of the QID indicates the domain (g, l, n, s, v, w), and the suffix indicates assigned difficulty (E, M, H, H_p). For example, Q025gH_p identifies item 25, geometry domain, hard-plus category.

Main outputs

Typical output folders created by the scripts include:

  • math_v2_umgc1ua2-anSamp2-2022_qa_outputs
  • math_v2_ua23umgc-2022-24_qa_outputs
  • math_v2_umgc_ua_22_24_combined_qa_outputs
  • math_v2_ua_22_24_17-18_describe. :contentReference[oaicite:35]{index=35} :contentReference[oaicite:36]{index=36} :contentReference[oaicite:37]{index=37} :contentReference[oaicite:38]{index=38}

Representative final outputs include:

  • cleaned wide 2022 UMGC1/UA2 datasets
  • cleaned long and wide 2022–2024 UA23/UMGC datasets
  • stacked combined raw file
  • duplicate diagnostics tables
  • rows_to_remove
  • umgc_rows_patched
  • final cleaned combined file math_v2_umgc_ua_22_24
  • missingness plots and tables
  • descriptive summaries
  • item-level sample-size summaries
  • younger UAlbany 17–18 subset outputs
  • item-selection summary tables for the planned study. :contentReference[oaicite:39]{index=39} :contentReference[oaicite:40]{index=40} :contentReference[oaicite:41]{index=41} :contentReference[oaicite:42]{index=42}

About

R scripts for preparing, QA-checking, combining, and summarizing DAACS mathematics assessment data across institutions and waves.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages