-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy path02_PracticalDataConcerns_challenge_markdown.Rmd
More file actions
96 lines (56 loc) · 1.96 KB
/
02_PracticalDataConcerns_challenge_markdown.Rmd
File metadata and controls
96 lines (56 loc) · 1.96 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
title: 'PPAS Challenge: Practical data concerns'
output:
html_document:
toc: True
pdf_document:
toc: True
---
## Background
In this challenge we use a [Kaggle dataset](https://www.kaggle.com/mazharkarimi/heart-disease-and-stroke-prevention/metadata) with data on the prevalence of cardiovascular disease and risk factors. We have created a synthetic, derivative dataset for the purposes of this challenge. The synthetic dataset contains 5 years of seriatim data on heart attack rates by state, year, sex, age, and race.
## Data license
The database license and content license that govern the original dataset can be found in a document in the PPAS GitHub repository.
## Goals
Prepare data for modeling and validation.
## Load data and packages
```{r Load, warning = F, message = F}
library(dplyr)
library(car)
library(stringr)
heartattack <- readRDS("heartdiseasedataset_modified.RDS")
```
## Review data
```{r DataReview}
head(heartattack)
str(heartattack)
```
## Challenges
### 1) Deal with outlier and missing values
a) Find the number of missing values in each field.
```{r 1a_NumNA}
# YOUR CODE HERE ####
```
b) Does the missingness in any field correlate to the response variable or to other fields' values? Consider using the table() function for cross-tabulating two categorical variables.
```{r 1b_NACorrelation}
# YOUR CODE HERE ####
```
c) Impute the missing values.
```{r 1c_Imputation}
# YOUR CODE HERE ####
```
### 2) Derive a new variable for "geographic region" of USA to reduce dimensionality of that field.
```{r 2_DeriveRegion}
# YOUR CODE HERE ####
```
### 3) Partition off a 30% holdout subset
```{r 3_PartitionHoldout}
# YOUR CODE HERE ####
```
### 4) Fit logistic regression to estimate heart attack odds given region, year, age, sex, and race.
```{r 4_FitLogisticModel}
# YOUR CODE HERE ####
```
### 5) Test for multicollinearity between the predictor variables.
```{r 5_TestMulticollinearity}
# YOUR CODE HERE ####
```