-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path02-DataVisualization.qmd
More file actions
448 lines (314 loc) · 20 KB
/
02-DataVisualization.qmd
File metadata and controls
448 lines (314 loc) · 20 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
---
title: "Data Visualization with ggplot2"
knitr:
opts_chunk:
warning: false
message: false
error: false
---
## Learning Objectives
1. Become familiar with the components of a `ggplot2` graph (i.e. the grammar of graphics)
1. Use best practices for different graph types
1. Understand how to use visualizations to tell a data story
## Getting set up
1. Go to `File > New Project`
2. In `Create project from` menu choose `Existing Directory`
3. Browse to `Desktop > Session02_DataVisualization`
4. Select the check box that says `Open in New Session`
4. Open the companion script called `02-DataVisualization.R`
5. Use the `library()` function to load the `tidyverse`, `viridis`, and `usmap` packages.
6. Use the `read_csv()` function to import the `yearly_rates_joined` and the `region_summary` csv files.
::: {.callout-tip appearence="minimal" collapse="true"}
### Follow Along at Home
Posit (RStudio) Cloud is a browser-based version of RStudio. It will
allow you to use RStudio without needing to download anything to your
computer. Posit Cloud automatically organizes things into Projects. You can also easily share your R projects with others.
Get Started:
1. Create your free RStudio Cloud account at
<https://posit.cloud/plans/free> if you haven't already.
2. Go to the class project <https://posit.cloud/content/8458074>
3. Note the text that marks this as a Temporary Copy. Select the
`Save a Permanent Copy` button to begin working!
:::
```{r}
library(tidyverse)
library(viridis)
library(usmap)
yearly_rates_joined <- read_csv("data/yearly_rates_joined.csv")
region_summary <- read_csv("data/region_summary.csv")
```
## Why Data Visualization?
Visualization is an important process which can help us explore, understand, analyze, and communicate about data. Visualizations, including many kinds of graphs, charts, maps, animations, and infographics, can be far more effective at quickly communicating important points than raw numbers alone. But visualizations also have the power to mislead. And so throughout this class, we'll be covering some good data visualization practices. Slides accompanying this section can be found here: <https://osf.io/yk5bx>^[Slides created by the [Visualizing the Future project](https://visualizingthefuture.github.io/), made possible in part by the Institute of Museum and Library Services, RE-73-18-0059-18.]
## The Data for This Lesson
In this lesson, we will continue to use the measles data we were working with in the Data Wrangling lesson. By the end of that lesson, we had joined our measles tibble with another tibble of population data. This enabled us to calculate the incidence rate of measles in each state and each year. We also combined that data with the states tibble so we could look at regional and division trends as well.
## **`ggplot2`** Basics
Next, we will learn about **`ggplot2`** - a tidyverse package for visualizing data. It is a powerful and flexible R package that allows you to create fully customizable, publication quality graphics. The `gg` in **`ggplot2`** stands for grammar of graphics. The grammar of graphics is the underlying philosophy of the package. It focuses on creating graphics in layers. Start with the data – map the data to the axes and to aesthetic qualities like size, shape, and color and geometries like dots, lines, and polygons. Further refine the appearance of your plot by adjusting scales and legends, labels, coordinate systems, and adding annotations.
All `ggplot2` graphs start with the same basic template:
```
<DATA> %>%
ggplot(aes(<MAPPINGS>)) +
<GEOM_FUNCTION>() +
<Additional GEOMS, SCALES, THEMES, etc. . . >
```
All graphs start with the ggplot function and the data. We’ll use the pipe to pipe the data to the function.
```{r}
region_summary %>%
ggplot()
```
We see that even this initializes the plot area of RStudio.
Next, we define a mapping (using the aesthetic, or `aes()`, function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc. Here we will say that the x axis should contain the affiliation variable. Note how the x-axis populates with some numbers and tick marks.
```{r}
region_summary %>%
ggplot(aes(x=region, y=avg_rate))
```
Next we need to add `‘geoms’` – graphical representations of the data in the plot (points, lines, bars). **`ggplot2`** offers many different geoms for common graph types. To add a geom to the plot use the `+` operator.
```{r}
region_summary %>%
ggplot(aes(x=region, y=avg_rate)) +
geom_bar(stat = "identity")
```
If you want the y axis to display something other than count, you need to make a couple of small adjustments. First - specify the `y` variable in the `aes()` function, and change the stat argument from it’s default of “count” to “identity” This tells it to base the y axis on the specified variable.
## Setting vs mapping aesthetics
When working with **`ggplot2`**, it's important to understand the difference between *setting* aesthetic properties and *mapping* them. All geoms have certain visual attributes that can be modified. Polygons like bars, have the properties fill and color. You can change the inside color of a bar with `fill`, and the border with `color`. We can modify the defaults with the `fill` and `color` arguments in the `geom_bar()` layer. (I've also increased the `linewidth` to make it easier to see the border color)
```{r}
region_summary %>%
ggplot(aes(x=region, y=avg_rate)) +
geom_bar(stat = "identity",
fill="blue",
color="purple",
linewidth=1.5,
width = 0.8)
```
::: {.callout-note}
How did we know the color names "blue" and "purple" would work in the code above? R has 657 (!!) built in color names. You can see them by calling the function `colors()`. You can also specify colors using rgb and hexadecimal codes.
:::
Now we have manually set a value for the fill and color. To create our initial graph, we used the `mapping` argument and the `aes()` function to map the x axis to the `region` variable. Watch what happens if we map the fill property to the `region` variable as well.
```{r}
region_summary %>%
ggplot(aes(x=region, y=avg_rate, fill=region)) +
geom_bar(stat = "identity")
```
As we'll see later in this lesson, mapping a variable to an aesthetic will be especially helpful when we have a third variable to display.
::: {.callout-note}
When you map an aesthetic with `aes()` in the `ggplot()` function it is inherited by all subsequent layers. When you map in a `geom_*()` function it is applied only to that layer.
:::
## Telling a data story
Now let's start using ggplot2 to help us answer our research question - how did the introduction of the vaccine affect measles rates in the country? We'll do this with a line graph, which is useful for showing change over time.
First, we need to use our `**dplyr**` skills to summarize the data.
```{r}
yearly_count <- yearly_rates_joined %>%
group_by(Year) %>%
summarize(TotalCount = sum(TotalCount))
```
We pipe to `ggplot()` and assign `Year` to the x-axis and `TotalCount` to the y-axis with the `aes()` function. The canvas and axes are ready.
```{r}
yearly_count %>%
ggplot(aes(x=Year, y=TotalCount))
```
Now we can add a `geom` layer to add our line. Let's also be sure to save our work to an object.
```{r}
yearly_count %>%
ggplot(aes(x=Year, y=TotalCount)) +
geom_line()
```
It might be nice to see where each data point falls on the line. To do this we can add another geometry layer.
```{r}
yearly_count %>%
ggplot(aes(x=Year, y=TotalCount)) +
geom_line() +
geom_point()
```
There are many ways to customize your plot, like changing the color or line type, adding labels and annotations. One thing that would make our graph easier to read is tick marks at each decade on the x-axis. There are a number of functions in **`ggplot2`** for altering the scale. We want to alter the x-axis scale, which holds continuous data, so we can use the `scale_x_continuous()` function. Note that when you start to write the name of the function, RStudio will supply you with other similarly named functions.
`scale_x_continuous()` has an argument called `breaks` which allows you to alter where the axis tick marks occur. We can use that together with `seq()` to say put a tick mark every 10 places between 1900 and 2000.
```{r}
yearly_count %>%
ggplot(aes(x=Year, y=TotalCount)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(from=1900, to=2000, by=10))
```
Now we can move beyond basic exploration and start to use our graph to analyze and tell stories about our data. One important trend we might notice, is the sharp decrease in cases in the 1960s. The measles vaccine was introduced in 1963. We can use our visualization to tell the story of the vaccine's impact.
Let's drop a reference line at 1963 to clearly indicate on the graph when the vaccine was introduced. To do this we add a `geom_vline()` and the `annotate()` function. There are multiple ways of adding lines and text to a plot, but these will serve us well for this case. Note that you can change features of lines such as color, type, and size. We can supply coordinates to `annotate()` to position the annotation where we want.
```{r}
yearly_count %>%
ggplot(aes(x=Year, y=TotalCount)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(from=1900, to=2000, by=10)) +
geom_vline(xintercept = 1963, color = "red", linetype= "dashed") +
annotate(geom = "label", x=1963, y=800000, label="1963: vaccine introduced")
```
Finally, let's add a title and axis labels to our plot with the `labs()` function. Note that axis labels will automatically be supplied from the column names, but you can use this function to override those defaults.
```{r}
yearly_count_line <- yearly_count %>%
ggplot(aes(x=Year, y=TotalCount)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(from=1900, to=2000, by=10)) +
geom_vline(xintercept = 1963, color = "red", linetype= "dashed") +
annotate(geom = "label", x=1963, y=800000, label="1963: vaccine introduced") +
labs(title = "Measles Cases Decrease After Vaccine Introduced", x = "Year", y = "Total Measles Case Count")
```
Now, we have a pretty nice looking graph. Finally, let's save our plot to a png file, so we can share it or put it in reports. To do this we use the function called `ggsave()`.
```{r save plot, eval=FALSE, echo=TRUE}
ggsave("figures/yearly_measles_count.png", plot = yearly_count_line)
```
## Working with Three Variables
If we have different groups in our data, we might want to use the graph to compare them. Let's create another line graph with a line for each region. Since the regions are different sizes, let's compare the average rate instead of the total count.
First, let's summarize our data
```{r}
regional_rates <- yearly_rates_joined %>%
group_by(Year, region) %>%
summarize(avg_rate = mean(epi_rate, na.rm=TRUE))
```
```{r}
regional_rates %>%
ggplot(aes(x=Year, y=avg_rate, group=region, color=region)) +
geom_line()
```
::: {.callout-note title="Working with color palettes"}
While **`ggplot`** comes with a default color palette, there are numerous other palettes out there you can use, such as:
- [**`RColorBrewer`**](https://r-graph-gallery.com/38-rcolorbrewers-palettes.html)
- [**`viridis`**](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html)
- [**`ggthemes`**](https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/)
- [**`ggsci`**](https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html)
- [**`wesanderson`**](https://github.com/karthik/wesanderson#readme)
:::
Let’s try applying a viridis palette. viridis was designed to be especially robust for many forms of color-blindness. It is also meant to print well in grey scale.
To do this, we need add another `scale_` function. This time `scale_color_viridis()`.
```{r}
regional_rates %>%
ggplot(aes(x=Year, y=avg_rate, group=region, color=region)) +
geom_line(linewidth=1) +
scale_color_viridis(discrete=TRUE)
```
::: {.callout-note}
Learn more from the [**`viridis`** documentation](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html#the-color-scales)
:::
Let's adjust tick marks and add labels again
```{r}
regional_rates %>%
ggplot(aes(x=Year, y=avg_rate, group=region, color=region)) +
geom_line(linewidth=1) +
scale_color_viridis(discrete=TRUE) +
scale_x_continuous(breaks = seq(from=1900, to=2000, by=10)) +
labs(x="", y= "Incidence Rate of measles \n per 100,000 persons", color="Region")
```
## Changing the theme
The theme of a **`ggplot2`** graph controls the overall look and all non-data elements of the plot. There are several built-in themes which can be applied as another layer. Start typing `theme_` in RStudio to see a list of themes. You can also use the `theme()` function to modify aspects of an existing theme. Here we apply `theme_classic()` which removes the grid lines and grey background of the default theme.
```{r}
regional_rates %>%
ggplot(aes(x=Year, y=avg_rate, group=region, color=region)) +
geom_line(linewidth=1) +
scale_color_viridis(discrete=TRUE) +
scale_x_continuous(breaks = seq(from=1900, to=2000, by=10)) +
labs(title = "Measles Cases in the 20th Century", x="", y= "Average rate\nper 100,000", color="Region") +
theme_classic()
```
In addition to setting an overall theme, we can tinker with individual elements of a theme with the `theme()` function. Check out the [`ggplot2` documentation](https://ggplot2.tidyverse.org/) for all the elements that can be adjusted. Here we are going to make an adjustment to the y axis label. It's good practice to make as much of your text horizontal as possible for ease of reading.
```{r}
regional_rates %>%
ggplot(aes(x=Year, y=avg_rate, group=region, color=region)) +
geom_line(linewidth=1) +
scale_color_viridis(discrete=TRUE) +
scale_x_continuous(breaks = seq(from=1900, to=2000, by=10)) +
labs(title = "Measles Cases in the 20th Century", x="", y= "Average rate\nper 100,000", color="Region") +
theme_classic() +
theme(axis.title.y = element_text(angle=0, vjust = 0.5, hjust = 0.5))
```
## Faceting and Small Multiples
Even with the adjustments we made, it can be difficult to understand a graph with too much data. Even with just five lines it can be hard to see what's happening. A good practice is to break out each group into individual graphs called small multiples or facets.
```{r}
regional_rates %>%
ggplot(aes(x=Year, y=avg_rate, group=region, color=region)) +
geom_line(linewidth=1) +
scale_color_viridis(discrete=TRUE) +
scale_x_continuous(breaks = seq(from=1900, to=2000, by=10)) +
labs(title = "Measles Cases in the 20th Century", x="", y= "Average rate\nper 100,000", color="Region") +
facet_wrap(~region, nrow=2) +
theme_classic() +
theme(axis.title.y = element_text(angle=0, vjust = 0.5, hjust = 0.5))
```
::: {.callout-note collapse=true title="Highlighting"}
We could also use highlighting to do away with noise in a line graph.
Then we can create two `geom_line` layers and highlight just the one in the facet.
```{r}
tmp <- regional_rates %>%
mutate(region2=region)
tmp %>%
ggplot(aes(x=Year, y=avg_rate)) +
geom_line(data=tmp %>% dplyr::select(-region), aes(group=region2), color="grey", linewidth=0.5, alpha=0.5) +
geom_line(aes(color=region), color="#69b3a2", linewidth=1.2 ) +
scale_x_continuous(breaks = seq(from=1900, to=2000, by=10)) +
theme_minimal() +
theme(
legend.position="none",
plot.title = element_text(size=14),
panel.grid = element_blank()
) +
ggtitle("A comparison of measles cases by Region") +
facet_wrap(~region, ncol = 2)
```
:::
## Maps
While we were successful at creating a bar chart to compare measles rates in each state, it is often more helpful to use a map to visualize geographic data. There are multiple types of map-based visualizations in R and tools for creating them. While it is possible to make interactive and animated maps in R, in this lesson, we will only cover static maps.
In this lesson, we will focus on creating **choropleths**. Despite the funny name, this is a visualization you have likely seen many many times. A choropleth is a map that links geographic areas or boundaries to some numeric variable.
**`ggplot2`** needs a little help to make map visualizations. Depending on the geographies you want to map, you may need to find geoJSON or shapefiles. There are also several packages in R that come pre-loaded with background maps of common geographies. We'll be using one in this lesson called **`usmap`**. There are several advantages to this package:
1. It contains maps of the US with both state and county boundaries.
2. You can create maps based on census regions and divisions. 3. Alaska and Hawaii are included, while many map packages only have a map of the continental US.
4. It creates the map as a **`ggplot2`** object, so you can customize the visualization with **`ggplot2`** functions (i.e. the things you've been learning in this lesson!)
We've installed **`usmap`** in your RStudio Cloud project, so now let's load it into our session.
```{r}
library(usmap)
```
The main function in this package is `plot_usmap`. When you call it without any arguments, you get the background map of the US.
```{r}
plot_usmap()
```
By default it shows state boundaries, but we could also ask it to show county boundaries
```{r}
plot_usmap(regions="counties")
```
Since we do not have that level of data in our dataset, we'll use the default option. There are two required arguments to `plot_usmap()`.
1. The first is a data frame specified with the `data` argument. This data frame must have a column called `state` or `fips` which contains state names or FIPS (Federal Information Processing) codes. FIPS codes must be used for county level data. This data frame must also have a column of values for each state or FIPS.
2. The second argument is the name of the column that contains the values, specified by the `value` argument.
Let's first create a data frame with just our 1963 data.
```{r}
measles1963df <- yearly_rates_joined %>%
filter(Year==1963)
```
Now let's plot our data with `plot_usmap()`. Remember it's important to use rate here rather than our raw count numbers since we are dealing with areas of vastly different populations.
```{r}
plot_usmap(data=measles1963df, values = "epi_rate")
```
Let's try with our `viridis` color palette.
```{r}
plot_usmap(data=measles1963df, values = "epi_rate") +
scale_fill_viridis()
```
Note how the brighter areas seem to highlight the areas of greater concern.
If you prefer the darker colors to represent higher rates, and lighter to represent lower, we can switch the direction of the palette with the `direction` argument.
```{r}
plot_usmap(data=measles1963df, values = "epi_rate") +
scale_fill_viridis(direction = -1)
```
Let's try another of the **`viridis`** palettes.
```{r}
plot_usmap(data=measles1963df, values = "epi_rate") +
scale_fill_viridis(option = "rocket", direction = -1)
```
Let's add a title, assign to an object, and save to a png file.
```{r eval=FALSE, echo=TRUE}
map_1963 <- plot_usmap(data=measles1963df, values = "epi_rate") +
scale_fill_viridis(option = "rocket", direction = -1) +
labs(title = "Incidence Rate of Measles per 100,000 people in 1963")
ggsave(filename = "figures/map_1963.png", plot = map_1963, bg = "white")
```
## Next Steps: From BeginnR to PractitionR
I hope you enjoyed this very brief introduction to R. You may be wondering - where do you go from here?
There are tons of R classes and tutorials on the internet, but the best way to learn R is to use it! I recommend picking a data set and just playing around. There's no harm in making mistakes along the way. It's much easier to find a useful tutorial if you look for ones that teach a specific task you want to accomplish.
Also, check out these helpful resources:
1. [R for Data Science](https://r4ds.hadley.nz/), by Hadley Wickham
1. [Tidyverse documentation](https://www.tidyverse.org/)
1. [R Graph Gallery](https://r-graph-gallery.com/)
1. [R Graphics Cookbook](https://r-graphics.org/)