DAACS Math Data Preparation and QA Pipelines

This repository contains R scripts for preparing, quality-checking, combining, and summarizing DAACS mathematics assessment data across multiple institutions and waves. The project builds clean student-level math datasets, diagnoses missingness and duplicate-score patterns, patches selected rows before deduplication, produces final descriptive summaries, and generates item-selection tables for the planned Bayesian/frequentist study. The repository also includes a focused descriptive script for the younger UAlbany subset. :contentReference[oaicite:0]{index=0} :contentReference[oaicite:1]{index=1} :contentReference[oaicite:2]{index=2} :contentReference[oaicite:3]{index=3} :contentReference[oaicite:4]{index=4} :contentReference[oaicite:5]{index=5} :contentReference[oaicite:6]{index=6}

Repository structure

The workflow is organized around one shared utilities file, five main pipeline scripts, one downstream item-selection script, and one additional descriptive subset script.

Shared utilities

utils_math_pipeline.R

This file stores reusable functions shared across scripts, including:

core helpers for assertions, saving, loading, ID normalization, and global_id creation
item-column helpers for finding, ordering, coercing, aligning, and counting item responses
shared recodes for ethnicity and yes/no variables
QID mapping helpers
missingness diagnostics
QA helpers
item-level sample-size summaries. :contentReference[oaicite:7]{index=7}

Main pipeline scripts

1. `math_v2_umgc1ua2-anSamp2-2022_qa_pipeline.R`

Creates clean 2022 wide datasets for UMGC1, UA2, and their combined file from the already-wide AnSamp2 math dataset.

Purpose

Load the already-wide 2022 UMGC1 + UA2 math dataset
Rename item columns from legacy qid_ua2 names to final QIDs
Standardize key demographic and metadata variables
Create clean UMGC1, UA2, and combined wide datasets
Run item- and dataset-level QA summaries. :contentReference[oaicite:8]{index=8}

Key conventions

age_d24: TCAUS if age < 24, AUS if age >= 24
ethnicity: White / Asian / Black / Hispanic / Other
pell: No / Yes
military: No / Yes
transfer: continuous transferred credits
mathTime: seconds. :contentReference[oaicite:9]{index=9}

Unit of analysis

final outputs: one row per student (global_id). :contentReference[oaicite:10]{index=10}

2. `math_v2_ua23umgc-2022-24_qa_pipeline.R`

Builds clean long and wide datasets for UA23, UMGC, and their combined 2022–2024 file from raw institution-level and item-level files.

Purpose

Load institution-level and item-level raw files
Standardize institution-level demographics and identifiers
Standardize item-level response files and attach final QIDs
Exclude students with fewer than 18 answered items
Derive student-level mathCompletionDate and mathTime
Build clean long and wide math datasets for UA23, UMGC, and combined
Export QA summaries and cleaned outputs. :contentReference[oaicite:11]{index=11}

Key conventions

age_d24: TCAUS if age < 24, AUS if age >= 24
ethnicity: White / Asian / Black / Hispanic / Other
pell: No / Yes
military: No / Yes
transfer: continuous transferred credits
mathTime: seconds. :contentReference[oaicite:12]{index=12}

Units of analysis

long files: one row per student-item response
wide files: one row per student (global_id). :contentReference[oaicite:13]{index=13}

3. `math_v2_umgc_ua_22_24_01_combined_qa_stack_and_missingness.R`

Stacks the cleaned 2022 and 2022–2024 wide math files, harmonizes columns, and runs missingness diagnostics on the combined dataset. :contentReference[oaicite:14]{index=14}

Main tasks

load the two cleaned wide files
add provenance via source_file
harmonize types and columns
align both files to a common master column structure
stack them into one combined raw file
run missingness diagnostics on the stacked data
save the combined raw dataset and missingness outputs. :contentReference[oaicite:15]{index=15}

4. `math_v2_umgc_ua_22_24_02_combined_qa_duplicate_diagnostics.R`

Detects duplicate students and duplicate score patterns in the stacked combined dataset and prepares deduplication artifacts for the finalization step. Despite the opening comment saying “reading dataset,” the script is part of the math workflow and uses the combined math stacked file. :contentReference[oaicite:16]{index=16}

Main tasks

detect duplicate global_ids
detect fully duplicated rows
detect duplicate substantive rows
detect duplicate score-only response patterns
group rows by identical score pattern
summarize whether duplicate-score rows are likely to represent the same person
prepare outputs such as rows_to_remove and patched UMGC rows for the finalization script. :contentReference[oaicite:17]{index=17}

5. `math_v2_umgc_ua_22_24_03_combined_qa_finalize_and_describe.R`

Applies duplicate removals, patches UMGC rows using non-missing values from matched UMGC1 rows, removes rows with almost entirely missing demographics, runs final missingness diagnostics, creates descriptive summaries, and saves the final cleaned combined math dataset. :contentReference[oaicite:18]{index=18}

Main tasks

load the stacked raw combined file
load rows_to_remove
load umgc_rows_patched
replace original UMGC rows with patched versions
remove selected UMGC1 duplicate rows
remove 7 rows missing all key demographics except military
enforce one row per global_id
run final missingness diagnostics
create categorical and numeric descriptives
generate item-level sample-size summaries
save the final cleaned dataset and QA outputs. :contentReference[oaicite:19]{index=19}

Additional analytic scripts

6. `math_v2_ua_22_24_17-18_describe.R`

Creates a focused descriptive subset for younger UAlbany students aged 17–18 after the final combined dataset has been created.

Purpose

subset the final cleaned dataset to ages 17–18
further subset to ua2 and ua23
run final missingness diagnostics
create descriptive summaries
create item-level sample-size tables
save the younger UAlbany subset and outputs. :contentReference[oaicite:20]{index=20}

This is a downstream descriptive script, not a core preprocessing pipeline. It depends on the final cleaned dataset produced by Script 5. :contentReference[oaicite:21]{index=21}

7. `math_summary_table_for_item_selection.R`

Creates six summary tables for item selection in the planned math study by counting eligible items across six math domains and four difficulty categories under response-count thresholds. The script derives domain from the lowercase letter embedded in QID:

g = geometry
l = lines_and_functions
n = number_and_calculation
s = statistics
v = variables_and_equations
w = word_problems. :contentReference[oaicite:22]{index=22}

Purpose

Read item-level response-count summaries from item_sample_size_by_demo.csv
Derive each item’s domain from QID
Derive each item’s difficulty category from QID
Build six domain-by-difficulty summary tables for:
- IRT 2PL Fit: 300 responses
- IRT 2PL Fit: 150 responses
- Multivariable DIF: 200 per group
- Multivariable DIF: 50 per group
- Factor Level: 100 per level
- Factor Level: 40 per level
Save the six tables to Excel and CSV. :contentReference[oaicite:23]{index=23}

This is a downstream summary/reporting script rather than a data-cleaning pipeline. It relies on outputs from the combined final dataset workflow. :contentReference[oaicite:24]{index=24}

Recommended execution order

Run the scripts in this order:

math_v2_umgc1ua2-anSamp2-2022_qa_pipeline.R
math_v2_ua23umgc-2022-24_qa_pipeline.R
math_v2_umgc_ua_22_24_01_combined_qa_stack_and_missingness.R
math_v2_umgc_ua_22_24_02_combined_qa_duplicate_diagnostics.R
math_v2_umgc_ua_22_24_03_combined_qa_finalize_and_describe.R
math_v2_ua_22_24_17-18_describe.R
math_summary_table_for_item_selection.R

Scripts 1–5 are the main end-to-end math preparation and QA pipeline. Scripts 6–7 are downstream analytic/reporting scripts built on the final cleaned outputs. :contentReference[oaicite:25]{index=25} :contentReference[oaicite:26]{index=26} :contentReference[oaicite:27]{index=27} :contentReference[oaicite:28]{index=28} :contentReference[oaicite:29]{index=29} :contentReference[oaicite:30]{index=30} :contentReference[oaicite:31]{index=31}

Key variable conventions

Across scripts, the main variables are standardized as follows:

global_id: unique student identifier
DAACS_ID: normalized student ID within source data
age_d24:
- TCAUS if age < 24
- AUS if age >= 24
ethnicity:
- White
- Asian
- Black
- Hispanic
- Other
pell:
- No
- Yes
military:
- No
- Yes
transfer: continuous transferred credits
mathTime: seconds
item variables: final QID columns ordered consistently across datasets. :contentReference[oaicite:32]{index=32} :contentReference[oaicite:33]{index=33} :contentReference[oaicite:34]{index=34}

QID construction

Final item IDs (QID) were created by mapping source-specific item identifiers to a single common naming system.
Depending on the source file, the mapping uses either:

qid_ua2 -> QID for already-wide 2022 files, or
question_id -> QID for raw item-level 2022–2024 files.

The QID is the harmonized item identifier used across all cleaned datasets. It also encodes item information in the label itself. For math items, the lowercase letter in the middle of the QID indicates the domain (g, l, n, s, v, w), and the suffix indicates assigned difficulty (E, M, H, H_p). For example, Q025gH_p identifies item 25, geometry domain, hard-plus category.

Main outputs

Typical output folders created by the scripts include:

math_v2_umgc1ua2-anSamp2-2022_qa_outputs
math_v2_ua23umgc-2022-24_qa_outputs
math_v2_umgc_ua_22_24_combined_qa_outputs
math_v2_ua_22_24_17-18_describe. :contentReference[oaicite:35]{index=35} :contentReference[oaicite:36]{index=36} :contentReference[oaicite:37]{index=37} :contentReference[oaicite:38]{index=38}

Representative final outputs include:

cleaned wide 2022 UMGC1/UA2 datasets
cleaned long and wide 2022–2024 UA23/UMGC datasets
stacked combined raw file
duplicate diagnostics tables
rows_to_remove
umgc_rows_patched
final cleaned combined file math_v2_umgc_ua_22_24
missingness plots and tables
descriptive summaries
item-level sample-size summaries
younger UAlbany 17–18 subset outputs
item-selection summary tables for the planned study. :contentReference[oaicite:39]{index=39} :contentReference[oaicite:40]{index=40} :contentReference[oaicite:41]{index=41} :contentReference[oaicite:42]{index=42}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DAACS Math Data Preparation and QA Pipelines

Repository structure

Shared utilities

Main pipeline scripts

1. `math_v2_umgc1ua2-anSamp2-2022_qa_pipeline.R`

2. `math_v2_ua23umgc-2022-24_qa_pipeline.R`

3. `math_v2_umgc_ua_22_24_01_combined_qa_stack_and_missingness.R`

4. `math_v2_umgc_ua_22_24_02_combined_qa_duplicate_diagnostics.R`

5. `math_v2_umgc_ua_22_24_03_combined_qa_finalize_and_describe.R`

Additional analytic scripts

6. `math_v2_ua_22_24_17-18_describe.R`

7. `math_summary_table_for_item_selection.R`

Recommended execution order

Key variable conventions

QID construction

Main outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
math_summary_table_for_item_selection.R		math_summary_table_for_item_selection.R
math_v2_ua23umgc-2022-24_qa_pipeline.R		math_v2_ua23umgc-2022-24_qa_pipeline.R
math_v2_ua_22_24_17-18_describe.R		math_v2_ua_22_24_17-18_describe.R
math_v2_umgc1ua2-anSamp2-2022_qa_pipeline.R		math_v2_umgc1ua2-anSamp2-2022_qa_pipeline.R
math_v2_umgc_ua_22_24_01_combined_qa_stack_and_missingness.R		math_v2_umgc_ua_22_24_01_combined_qa_stack_and_missingness.R
math_v2_umgc_ua_22_24_02_combined_qa_duplicate_diagnostics.R		math_v2_umgc_ua_22_24_02_combined_qa_duplicate_diagnostics.R
math_v2_umgc_ua_22_24_03_combined_qa_finalize_and_describe.R		math_v2_umgc_ua_22_24_03_combined_qa_finalize_and_describe.R
utils_math_pipeline.R		utils_math_pipeline.R

Folders and files

Latest commit

History

Repository files navigation

DAACS Math Data Preparation and QA Pipelines

Repository structure

Shared utilities

Main pipeline scripts

1. math_v2_umgc1ua2-anSamp2-2022_qa_pipeline.R

2. math_v2_ua23umgc-2022-24_qa_pipeline.R

3. math_v2_umgc_ua_22_24_01_combined_qa_stack_and_missingness.R

4. math_v2_umgc_ua_22_24_02_combined_qa_duplicate_diagnostics.R

5. math_v2_umgc_ua_22_24_03_combined_qa_finalize_and_describe.R

Additional analytic scripts

6. math_v2_ua_22_24_17-18_describe.R

7. math_summary_table_for_item_selection.R

Recommended execution order

Key variable conventions

QID construction

Main outputs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `math_v2_umgc1ua2-anSamp2-2022_qa_pipeline.R`

2. `math_v2_ua23umgc-2022-24_qa_pipeline.R`

3. `math_v2_umgc_ua_22_24_01_combined_qa_stack_and_missingness.R`

4. `math_v2_umgc_ua_22_24_02_combined_qa_duplicate_diagnostics.R`

5. `math_v2_umgc_ua_22_24_03_combined_qa_finalize_and_describe.R`

6. `math_v2_ua_22_24_17-18_describe.R`

7. `math_summary_table_for_item_selection.R`

Packages