This repository contains R scripts for preparing, quality-checking, combining, and summarizing DAACS mathematics assessment data across multiple institutions and waves. The project builds clean student-level math datasets, diagnoses missingness and duplicate-score patterns, patches selected rows before deduplication, produces final descriptive summaries, and generates item-selection tables for the planned Bayesian/frequentist study. The repository also includes a focused descriptive script for the younger UAlbany subset. :contentReference[oaicite:0]{index=0} :contentReference[oaicite:1]{index=1} :contentReference[oaicite:2]{index=2} :contentReference[oaicite:3]{index=3} :contentReference[oaicite:4]{index=4} :contentReference[oaicite:5]{index=5} :contentReference[oaicite:6]{index=6}
The workflow is organized around one shared utilities file, five main pipeline scripts, one downstream item-selection script, and one additional descriptive subset script.
utils_math_pipeline.R
This file stores reusable functions shared across scripts, including:
- core helpers for assertions, saving, loading, ID normalization, and
global_idcreation - item-column helpers for finding, ordering, coercing, aligning, and counting item responses
- shared recodes for ethnicity and yes/no variables
- QID mapping helpers
- missingness diagnostics
- QA helpers
- item-level sample-size summaries. :contentReference[oaicite:7]{index=7}
Creates clean 2022 wide datasets for UMGC1, UA2, and their combined file from the already-wide AnSamp2 math dataset.
Purpose
- Load the already-wide 2022 UMGC1 + UA2 math dataset
- Rename item columns from legacy
qid_ua2names to finalQIDs - Standardize key demographic and metadata variables
- Create clean UMGC1, UA2, and combined wide datasets
- Run item- and dataset-level QA summaries. :contentReference[oaicite:8]{index=8}
Key conventions
age_d24:TCAUSif age < 24,AUSif age >= 24ethnicity:White / Asian / Black / Hispanic / Otherpell:No / Yesmilitary:No / Yestransfer: continuous transferred creditsmathTime: seconds. :contentReference[oaicite:9]{index=9}
Unit of analysis
- final outputs: one row per student (
global_id). :contentReference[oaicite:10]{index=10}
Builds clean long and wide datasets for UA23, UMGC, and their combined 2022–2024 file from raw institution-level and item-level files.
Purpose
- Load institution-level and item-level raw files
- Standardize institution-level demographics and identifiers
- Standardize item-level response files and attach final
QIDs - Exclude students with fewer than 18 answered items
- Derive student-level
mathCompletionDateandmathTime - Build clean long and wide math datasets for UA23, UMGC, and combined
- Export QA summaries and cleaned outputs. :contentReference[oaicite:11]{index=11}
Key conventions
age_d24:TCAUSif age < 24,AUSif age >= 24ethnicity:White / Asian / Black / Hispanic / Otherpell:No / Yesmilitary:No / Yestransfer: continuous transferred creditsmathTime: seconds. :contentReference[oaicite:12]{index=12}
Units of analysis
- long files: one row per student-item response
- wide files: one row per student (
global_id). :contentReference[oaicite:13]{index=13}
Stacks the cleaned 2022 and 2022–2024 wide math files, harmonizes columns, and runs missingness diagnostics on the combined dataset. :contentReference[oaicite:14]{index=14}
Main tasks
- load the two cleaned wide files
- add provenance via
source_file - harmonize types and columns
- align both files to a common master column structure
- stack them into one combined raw file
- run missingness diagnostics on the stacked data
- save the combined raw dataset and missingness outputs. :contentReference[oaicite:15]{index=15}
Detects duplicate students and duplicate score patterns in the stacked combined dataset and prepares deduplication artifacts for the finalization step. Despite the opening comment saying “reading dataset,” the script is part of the math workflow and uses the combined math stacked file. :contentReference[oaicite:16]{index=16}
Main tasks
- detect duplicate
global_ids - detect fully duplicated rows
- detect duplicate substantive rows
- detect duplicate score-only response patterns
- group rows by identical score pattern
- summarize whether duplicate-score rows are likely to represent the same person
- prepare outputs such as
rows_to_removeand patched UMGC rows for the finalization script. :contentReference[oaicite:17]{index=17}
Applies duplicate removals, patches UMGC rows using non-missing values from matched UMGC1 rows, removes rows with almost entirely missing demographics, runs final missingness diagnostics, creates descriptive summaries, and saves the final cleaned combined math dataset. :contentReference[oaicite:18]{index=18}
Main tasks
- load the stacked raw combined file
- load
rows_to_remove - load
umgc_rows_patched - replace original UMGC rows with patched versions
- remove selected UMGC1 duplicate rows
- remove 7 rows missing all key demographics except military
- enforce one row per
global_id - run final missingness diagnostics
- create categorical and numeric descriptives
- generate item-level sample-size summaries
- save the final cleaned dataset and QA outputs. :contentReference[oaicite:19]{index=19}
Creates a focused descriptive subset for younger UAlbany students aged 17–18 after the final combined dataset has been created.
Purpose
- subset the final cleaned dataset to ages 17–18
- further subset to
ua2andua23 - run final missingness diagnostics
- create descriptive summaries
- create item-level sample-size tables
- save the younger UAlbany subset and outputs. :contentReference[oaicite:20]{index=20}
This is a downstream descriptive script, not a core preprocessing pipeline. It depends on the final cleaned dataset produced by Script 5. :contentReference[oaicite:21]{index=21}
Creates six summary tables for item selection in the planned math study by counting eligible items across six math domains and four difficulty categories under response-count thresholds. The script derives domain from the lowercase letter embedded in QID:
g = geometryl = lines_and_functionsn = number_and_calculations = statisticsv = variables_and_equationsw = word_problems. :contentReference[oaicite:22]{index=22}
Purpose
- Read item-level response-count summaries from
item_sample_size_by_demo.csv - Derive each item’s domain from
QID - Derive each item’s difficulty category from
QID - Build six domain-by-difficulty summary tables for:
- IRT 2PL Fit: 300 responses
- IRT 2PL Fit: 150 responses
- Multivariable DIF: 200 per group
- Multivariable DIF: 50 per group
- Factor Level: 100 per level
- Factor Level: 40 per level
- Save the six tables to Excel and CSV. :contentReference[oaicite:23]{index=23}
This is a downstream summary/reporting script rather than a data-cleaning pipeline. It relies on outputs from the combined final dataset workflow. :contentReference[oaicite:24]{index=24}
Run the scripts in this order:
math_v2_umgc1ua2-anSamp2-2022_qa_pipeline.Rmath_v2_ua23umgc-2022-24_qa_pipeline.Rmath_v2_umgc_ua_22_24_01_combined_qa_stack_and_missingness.Rmath_v2_umgc_ua_22_24_02_combined_qa_duplicate_diagnostics.Rmath_v2_umgc_ua_22_24_03_combined_qa_finalize_and_describe.Rmath_v2_ua_22_24_17-18_describe.Rmath_summary_table_for_item_selection.R
Scripts 1–5 are the main end-to-end math preparation and QA pipeline. Scripts 6–7 are downstream analytic/reporting scripts built on the final cleaned outputs. :contentReference[oaicite:25]{index=25} :contentReference[oaicite:26]{index=26} :contentReference[oaicite:27]{index=27} :contentReference[oaicite:28]{index=28} :contentReference[oaicite:29]{index=29} :contentReference[oaicite:30]{index=30} :contentReference[oaicite:31]{index=31}
Across scripts, the main variables are standardized as follows:
global_id: unique student identifierDAACS_ID: normalized student ID within source dataage_d24:TCAUSif age < 24AUSif age >= 24
ethnicity:WhiteAsianBlackHispanicOther
pell:NoYes
military:NoYes
transfer: continuous transferred creditsmathTime: seconds- item variables: final
QIDcolumns ordered consistently across datasets. :contentReference[oaicite:32]{index=32} :contentReference[oaicite:33]{index=33} :contentReference[oaicite:34]{index=34}
Final item IDs (QID) were created by mapping source-specific item identifiers to a single common naming system.
Depending on the source file, the mapping uses either:
qid_ua2 -> QIDfor already-wide 2022 files, orquestion_id -> QIDfor raw item-level 2022–2024 files.
The QID is the harmonized item identifier used across all cleaned datasets. It also encodes item information in the label itself. For math items, the lowercase letter in the middle of the QID indicates the domain (g, l, n, s, v, w), and the suffix indicates assigned difficulty (E, M, H, H_p). For example, Q025gH_p identifies item 25, geometry domain, hard-plus category.
Typical output folders created by the scripts include:
math_v2_umgc1ua2-anSamp2-2022_qa_outputsmath_v2_ua23umgc-2022-24_qa_outputsmath_v2_umgc_ua_22_24_combined_qa_outputsmath_v2_ua_22_24_17-18_describe. :contentReference[oaicite:35]{index=35} :contentReference[oaicite:36]{index=36} :contentReference[oaicite:37]{index=37} :contentReference[oaicite:38]{index=38}
Representative final outputs include:
- cleaned wide 2022 UMGC1/UA2 datasets
- cleaned long and wide 2022–2024 UA23/UMGC datasets
- stacked combined raw file
- duplicate diagnostics tables
rows_to_removeumgc_rows_patched- final cleaned combined file
math_v2_umgc_ua_22_24 - missingness plots and tables
- descriptive summaries
- item-level sample-size summaries
- younger UAlbany 17–18 subset outputs
- item-selection summary tables for the planned study. :contentReference[oaicite:39]{index=39} :contentReference[oaicite:40]{index=40} :contentReference[oaicite:41]{index=41} :contentReference[oaicite:42]{index=42}