caa/caa.Rmd at master · nancyum/caa · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
---
title: "caa"
author: Nancy Um
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

This file accompanies the article "What Do We Know about the Future of Art
History? Let’s Start by Looking at Its Past, Sixty Years of Dissertations," by Nancy Um, published in *caa.reviews*. It draws on 17 years of data about art history dissertations drawn from the College Art Association's dissertation roster, published in [caa.reviews](http://www.caareviews.org/dissertations). The steps taken to generate the visualizations are detailed below.

---

## Bringing in the data

The file `caaTOTAL_OR.csv` was harvested from the dissertations roster using the script `caa.py` written by Kenneth Chiu of Binghamton University. It yielded the dissertations that were completed from 2002 to 2018. The dataset was then extensively cleaned with OpenRefine and combined with some of the entries that did not populate due to formatting errors, and had to be input by hand.

```{r}

# Activate packages
library(tidyverse)
library(ggplot2)

# Read in `caaTOTAL_OR.csv`
diss <- read.csv(file = "caaTOTAL_OR.csv")

# Isolate the unique dissertations, eliminating duplicates.
uniq_diss <- arrange(diss, desc(Year), desc(Last.Name), desc(First.Name))
uniq_diss <- distinct(uniq_diss, Last.Name, First.Name, .keep_all = TRUE)

```

## Generating the Visualizations

Each chunk provides the code to generate one of the figures that was published in the *caa.reviews* article.

## Figure 5 - Dissertations by Year, 2002-2018

This visualization explores how many unique dissertations were completed each year.

```{r}

#New df that calculates unique dissertations by year
dissbyyear <- uniq_diss %>%
  group_by(Year) %>%
  summarise(Freq = n())

#Figure 5
fig5 <- ggplot(data=dissbyyear, aes(x=Year, y=Freq, fill = Freq)) +
  geom_bar(stat="identity") +
  geom_text(aes(label=Freq), vjust = 2,color="white") +
  theme(axis.text.x = element_text(angle = 90), axis.title.y = element_blank()) +
  ggtitle("Dissertations by Year, 2002-2018")

#Print fig5
fig5 + guides(fill=guide_legend(title="Range"))

```

## Figure 6 - Dissertations by Institution, 2002-2018

This visualization looks at all seventy-five institutions and how many dissertations were completed at each over the 17-year period.

```{r}

# Calculate the total unique dissertations by institution
dissbyinst <- uniq_diss %>%
  group_by(Institution) %>%
  summarise(Freq = n())

# Figure 6
fig6 <- ggplot(data=dissbyinst, aes(x=Institution, y=Freq, fill = Freq)) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 90), axis.title.y = element_blank()) +
   coord_flip() +
  ggtitle("Dissertations by Institution, 2002-2018")

# Print fig6
fig6 + guides(fill=guide_legend(title="Range"))  + theme(axis.title.x = element_blank(), axis.title.y = element_blank())
```

## Figure 7 - Institutions with 50 or more dissertations, 2002-2018

This visualization streamlines the data that appeared in figure 6, by filtering only those institutions that completed 50 or more dissertations during the 17-year period.

```{r}

# Filter the institutions that produced 50 or more dissertations during this time period.
dissbyinst <- arrange(dissbyinst, desc(Institution), desc(Freq))
HiFreqInst <- filter(dissbyinst, Freq > 49)

# Figure 7
fig7 <- ggplot(data=HiFreqInst, aes(x = Institution, y = Freq, fill = Freq)) +
    geom_bar(stat = 'identity') +
    coord_flip() +
    geom_text(aes(label=Freq), nudge_y = -8, vjust = .5,color="white", size = 3) +
  theme(axis.text.x = element_text(angle = 90), axis.title.y = element_blank()) +
  ggtitle("Institutions with 50 or more dissertations, 2002-2018")

# Print fig7
fig7 + guides(fill=guide_legend(title="Range")) + theme(axis.title.x = element_blank(), axis.title.y = element_blank())
```

## Figure 8 - Primary Advisers of 17 or more dissertations, 2002-2018

This visualization looks at primary advisors, of which there were 958 in total. It should be noted that some dissertations listed more than one advisor, up to four maximum. The first name listed was taken as the primary advisor in each case. It includes only those advisors who supervised 17 or more dissertations.

```{r}

# Isolate the unique advisers and their most recent institutions. By doing so, each adviser only appears once, even if they have had more than one affiliation over the past 17 years.
uniq_advis <- arrange(diss, desc(Year), Advisor.1, Institution)
uniq_advis <- distinct(uniq_advis, Advisor.1, .keep_all = TRUE)
uniq_advis <- select(uniq_advis, Advisor.1, Institution)

# Make a new dataframe which singles out those who have advised 17 or more dissertations over the past 17 years.
Freqs <- uniq_diss %>%
  group_by(Advisor.1) %>%
  summarise(Freq = n())
Freqs <- arrange(Freqs, desc(Freq))
HiFreqs <- filter(Freqs, Freq > 16)

# Join the two dataframes so that each of the top advisers appears with their most recent institutional affiliation.
HiFreqTotal <- left_join(HiFreqs, uniq_advis, by="Advisor.1")

# For a proper visualization, change the order of the names on the y axis. They need to be reordered alphabetically by last name, not by first initial. Need to perform two regex functions to reorder them properly.
HiFreqOrder <- mutate(HiFreqTotal, advis_last = Advisor.1)
HiFreqOrder <- mutate(HiFreqOrder, advis_last = str_replace(HiFreqOrder$advis_last, "([A-Z]\\.)", ""))
HiFreqOrder <- mutate(HiFreqOrder, advis_last = str_replace(HiFreqOrder$advis_last, "([A-Z]\\-)", ""))
HiFreqOrder <- arrange(HiFreqOrder, advis_last)
HiFreqOrder <- mutate(HiFreqOrder, Num=row_number())

# This line of script forces a new order of the y axis, so that the elements are listed alphabetically by last name, rather than by first initial.
HiFreqOrder$Advisor.1 <- factor(HiFreqOrder$Advisor.1, levels = HiFreqOrder$Advisor.1[order(HiFreqOrder$Num)])

# Figure 8
fig8 <- ggplot(data = HiFreqOrder, aes(x= Advisor.1, y = Freq, fill = Institution)) +
  geom_bar(stat = 'identity') +
    coord_flip()+
  geom_text(aes(label=Institution), nudge_y = -3, vjust = .5, color = "white", size = 3.5) +
  theme(axis.title.x = element_blank(), axis.title.y = element_blank()) +
  ggtitle("Primary Advisers of 17 or more dissertations, 2002-2018")

# Print fig8
fig8 + theme(axis.title.x = element_blank(), axis.title.y = element_blank())
```


## Figure 9 - Top Terms in Art History Dissertation Titles, 2002-2018

For this visualization, all of the unique dissertation titles were analyzed for the most commonly appearing terms in them. Stopwords were removed and only those that were featured 50 times or more were included (although a few abbreviations, such as ca and de, were excluded).

```{r}

# Activate packages
library(tokenizers)
library(dplyr)
library(stringr)
library(tidytext)
library(tibble)
library(tidyr)

# Start by tokenizing
analysis <- mutate(uniq_diss, word = tokenize_words(as.character(Title)))

# Unnest tokens so that each word has its own row.
analysis <- unnest(analysis, word)

# Each word has its own row and is attached to the metadata. Calculate the words by frequency of occurrence.
analysis2 <- analysis %>%
  group_by(word)%>%
  summarize(count = n()) %>%
  arrange(-count)

# Read in the stopword list
stopwordlist <- stop_words %>%
  filter(lexicon == "SMART") %>%
  select(word)

# Anti-join the stopword list with the dissertation terms to eliminate stopwords.
analysis3 <- anti_join(analysis2, stopwordlist, by="word") %>%
  arrange(-count) %>%
  top_n(50)

# Filter out incomplete words like "de" and "ca"
analysis3 <- filter(analysis3, word != "de")
analysis3 <- filter(analysis3, word != "ca")

# Figure 9
ggplot(analysis3, aes(x = word, y = count, color = word)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90), axis.title.x = element_blank(), axis.title.y = element_blank(), legend.position = "none") +
  coord_flip() +
   ggtitle("Top Terms in Art History Dissertation Titles, 2002-2018")

```

The final set of visualizations are based on the tricky category of subject areas, which changed frequently during the 17 years under consideration. In order to make sense of the various categories, we need to divide them up into the three different grouping rubrics, Chronology, Geography, and Subject Area. For more information on these rubrics see the [Dissertation Submission Guidelines](http://www.caareviews.org/about/dissertations). Each subject field was coded in the file `subjects.csv`.

```{r}

# Code the subjects according to rubric and then read the file `subjects.csv` back in.
subjectscoded <- read.csv(file = "subjects.csv")

#Join it with the diss dataset, so that a new column called caaCAT appears.
diss <- left_join(diss, subjectscoded, by="Subject")

```

## Figure 10

```{r}

# Filter only the chronological categories
ChronologyOnly <- filter(diss, caaCAT == "Chronology")

# Figure 10
ggplot(ChronologyOnly, aes(x=Year, y=SubjectReconciled)) +
   theme(axis.text.x = element_text(angle = 90), axis.title.y = element_blank()) +
  geom_count()

```

## Figure 11

```{r}

# New df that includes only geographic fields
GeographyOnly <- filter(diss, caaCAT == "Geographic")

# Figure11
ggplot(GeographyOnly, aes(x=Year, y=SubjectReconciled)) +
   theme(axis.text.x = element_text(angle = 90), axis.title.y = element_blank()) +
  geom_count()

```

## Figure 12

```{r}

# New df which includes only subject fields
SubjectOnly <- filter(diss, caaCAT == "Subject")

# Figure 12
ggplot(SubjectOnly, aes(x=Year, y=SubjectReconciled)) +
   theme(axis.text.x = element_text(angle = 90), axis.title.y = element_blank()) +
  geom_count()

```