metabarcoding/02_metacoder_heat_trees.Rmd at main · Ednava/metabarcoding · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: "02 Metacoder heat trees"
author: "Daniel"
date: "19/09/2022"
output: github_document
#editor_options:
  #chunk_output_type: console
---

This script takes the previously merged and formatted data from mBrave (produced through script 01) and does preliminary analyses with Metacoder.
The idea of this script is to have a preliminary estimate of diversity within the metabarcoding samples. Due to the nature of mBrave's classification associated with BOLD databases, there are a number of BINs which have multiple, uncertain, identifications. I have (manually) cleaned the data further, now we are only including species name for the 1st BOLD BIN hit. This is normally either, the species form the BCI database (the first dataset used to classify through mBRAVE), or the species most frequently 'named' that name. E.g. Agelenopsis pennsylvanica, potteri, emertoni, katsoni, acutosa, is now Agelenopsis pennsylvanica. The file used is the final dataset with ~3500 BINs (finaldat_merged.csv).

```{r message=FALSE, warning=FALSE}
rm(list=ls()) #I always start scritps this way to make sure I have a clean R environment
library('tidyr')
library('dplyr')
library('metacoder')
dat<- read.csv('data/finaldat_merged.csv', header = TRUE)
sample <- read.csv('data/location_ctrl.csv')
obj <- parse_tax_data(dat,
                      class_cols = "classification",
                      class_sep = ";",
                      class_regex = "^([a-z]{0,1})_{0,2}(.*)$",
                      class_key = c("tax_rank" = "taxon_rank", "name" = "taxon_name"))
print(obj)

```
Above is the print out of a taxmap object. The R console output shows that there are 3,651 unique taxa and lists their ID, but this includes many NA's (unassigned taxonomies), Better to look at the tbl_df which shows a 'tibble' of 3,173 rows (assigned classifications) in 43 columns (40 samples + taxon_id, bin_uri and classification). Each row shows how many reads for that taxon was in each sample.

```{r}
obj$data$tax_data <- obj$data$tax_data[c("taxon_id","bin_uri", sample$sampleID)]
obj$data$tax_data <- zero_low_counts(obj, data = "tax_data", min_count = 10, cols= sample$sampleID, other_cols = TRUE)
#zeroing 5811 of 139800 counts with less than 10 reads (depending how stringent this can be changed in the min_count flag above
no_reads <- rowSums(obj$data$tax_data[, sample$sampleID]) == 0
sum(no_reads)
#there are 1259 taxon_id's with less than 10 reads which we will remove in the next step
obj <- filter_obs(obj, data = "tax_data", ! no_reads, drop_taxa = TRUE)
print(obj) #2236 taxa in 40 samples (it says 43 columns because of columns taxon, OTU_ID and bin_uri, the classification column is not really needed right now)
```
We now have cleaned data set (with our chosen minimum number of reads to >10)
We can add new 'tibbles' to obj$data such as proportion of total reads, or abundance
```{r}
obj$data$tax_props <- calc_obs_props(obj, "tax_data", cols= sample$sampleID, other_cols = TRUE)
obj$data$tax_abund <- calc_taxon_abund(obj, "tax_data", cols = sample$sampleID)
print(obj)

```
Notice we now have 3 tibbles, one for counts (tax_data - this could be changed with something like: >names(obj$data) <- "otu_counts"), one for proportion (tax_props) and one for abundance (tax_abund). We can, and will, add more refined data sets afterwards, all linked to the original parsing so we never loose the information of what taxonid actually means.

we can now view distribution of reads in the samples:
```{r HistogramReads}
hist(colSums(obj$data$tax_data[ , sample$sampleID]))
```

Sampling is quite uneven with one samples with < 100,000 reads. We will rarefy our data to simulate even number of reads per sample (18914 is the minimum depth) keep in mind that this discards a lot of data - we could probably instead remove some of the samples with the lowest reads instead.
We can add the rarefied data set as a separate data frame as we did above with tax_props and tax_abund.
If you want, you can print(obj$data$tax_rarefied) to see what it looks like. We can also make a rarefaction curve with metacoder, but look at script 03_diversity_and_ordination where we do rarefaction curves with ggplot instead.

```{r RarefactionCurves}
obj$data$tax_rarefied <- rarefy_obs(obj, "tax_data", cols = sample$sampleID, other_cols = TRUE)
library(vegan)
rarecurve(t(obj$data$tax_data[, sample$sampleID]), step = 1000,
          sample = min(colSums(obj$data$tax_data[, sample$sampleID])),
          col = "blue", cex = 0.8)
```

Below an OPTIONAL chunk to visualize a very rudimentary heat tree for all counts for metabaroded data.
It is meant for illustration purposes only and its not very informative, it also takes a long time to run. See figure allcounts.pdf in the project.

```{r allcounts}
set.seed(99) # This makes the plot appear the same each time it is run
heat_tree(obj,
          node_label = taxon_names,
          node_size = n_obs,
          node_color = n_obs,
          node_size_axis_label = "BIN count",
          node_color_axis_label = "Samples with reads",
          node_label_size_range = c( 0.005, 0.03),
          layout = "davidson-harel", # The primary layout algorithm
          initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations

# this takes > 30 minutes to run
```

Far more informative would be to see the differences between seasons in this case. We can easily calculate the number of samples that have reads for each taxon according to different groups. We can also compare statistically using Wilcoxon Rank Sum test, significant differences between taxa samples in wet or dry season.

```{r warning=FALSE}
obj$data$tax_occ <- calc_n_samples(obj, "tax_abund", groups = sample$SEASON, cols = sample$sampleID)
print(obj$data$tax_occ)

obj$data$diff_table <- compare_groups(obj, data = "tax_abund",
                                      cols = sample$sampleID, # What columns of sample data to use
                                      groups = sample$SEASON) # What category each sample is assigned to
obj <- mutate_obs(obj, "diff_table",
                  wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr"))
obj$data$diff_table$log2_median_ratio[obj$data$diff_table$wilcox_p_value > 0.05] <- 0 #get significant values for group comparison

```

And we can plot these differences in a metacoder heat tree.

```{r quickHeatTreeMEANDIFFAll}
heat_tree(obj,
          node_label = taxon_names,
          node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
          node_color = log2_median_ratio, # A column from `obj$data$diff_table`
          node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
          node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
          node_color_digits = 1,
          node_size_axis_label = "BIN count",
          node_color_axis_label = "Mean difference in sample proportion",
          node_label_size_range = c( 0.005, 0.03),
          layout = "davidson-harel", # The primary layout algorithm
          initial_layout = "reingold-tilford")

#>30 minutes to run
```

To understand the colouring scheme, read this paragraph carefully from the metacoder tutorial:

### What color corresponds to each group depends on the order they were given in the compare_groups function. Since “leaf” is “treatment_1” in the “diff_table”, and “log2_median_ratio” is defined as “log2(treatment_1 / treatment_2)”, when a taxon has more counts in leaf samples, the ratio is positive, therefore taxa more abundant in leafs are colored magenta in this case.

their code has 'node_color_range = c("cyan", "gray", "magenta")###

If we look at print(obj$data$diff_table) above the plot, we can see that in our case, treatment_1 is 'WET'. The log2 median ratio is defined as"log2(wet / dry). When a taxon has more counts in the wet season, the ratio is positive, therefore taxa more abundant in the wet season are coloured 'steelblue' in our case (not magenta).


But more interesting for us, to separate them according to focal groups. In orange for dry season, in blue for wet season. The following are the differences between seasons using wilcox_p_vlue function we ran on the diff_table above.

```{r HeatTreesSTATDIFFBetweenSeasonsFocalSpecies}
obj %>%
  metacoder::filter_taxa(taxon_names == "Coleoptera",#here is to fliter the figure by groups
              subtaxa = TRUE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
            node_color = log2_median_ratio, # A column from `obj$data$diff_table`
            node_color_trans = 'linear',
            node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
            node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
            node_color_digits = 1,
            node_size_axis_label = "BIN count",
            node_color_axis_label = "log 2 ratio of median counts",
            node_label_size_range = c( 0.005, 0.03),
            layout = "davidson-harel", # The primary layout algorithm
            initial_layout = "reingold-tilford")

obj %>%
  metacoder::filter_taxa(taxon_names =="Lepidoptera",#here is to fliter the figure by groups
              subtaxa = TRUE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
            node_color = log2_median_ratio, # A column from `obj$data$diff_table`
            node_color_trans = 'linear',
            node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
            node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
            node_color_digits = 1,
            node_size_axis_label = "BIN count",
            node_color_axis_label = "log 2 ratio of median counts",
            node_label_size_range = c( 0.005, 0.03),
            layout = "davidson-harel", # The primary layout algorithm
            initial_layout = "reingold-tilford")

obj %>%
  metacoder::filter_taxa(taxon_names == "Hemiptera",#here is to fliter the figure by groups
              subtaxa = TRUE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
            node_color = log2_median_ratio, # A column from `obj$data$diff_table`
            node_color_trans = 'linear',
            node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
            node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
            node_color_digits = 1,
            node_size_axis_label = "BIN count",
            node_color_axis_label = "log 2 ratio of median counts",
            node_label_size_range = c( 0.005, 0.03),
            layout = "davidson-harel", # The primary layout algorithm
            initial_layout = "reingold-tilford")

obj %>%
  metacoder::filter_taxa(taxon_names == "Hymenoptera",#here is to fliter the figure by groups
              subtaxa = TRUE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
            node_color = log2_median_ratio, # A column from `obj$data$diff_table`
            node_color_trans = 'linear',
            node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
            node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
            node_color_digits = 1,
            node_size_axis_label = "BIN count",
            node_color_axis_label = "log 2 ratio of median counts",
            node_label_size_range = c( 0.005, 0.03),
            layout = "davidson-harel", # The primary layout algorithm
            initial_layout = "reingold-tilford")

obj %>%
  metacoder::filter_taxa(taxon_names == "Diptera",#here is to fliter the figure by groups
              subtaxa = TRUE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
            node_color = log2_median_ratio, # A column from `obj$data$diff_table`
            node_color_trans = 'linear',
            node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
            node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
            node_color_digits = 1,
            node_size_axis_label = "BIN count",
            node_color_axis_label = "log 2 ratio of median counts",
            node_label_size_range = c( 0.005, 0.03),
            layout = "davidson-harel", # The primary layout algorithm
            initial_layout = "reingold-tilford")
obj %>%
  metacoder::filter_taxa(taxon_names == "Trichoptera",#here is to fliter the figure by groups
              subtaxa = TRUE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
            node_color = log2_median_ratio, # A column from `obj$data$diff_table`
            node_color_trans = 'linear',
            node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
            node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
            node_color_digits = 1,
            node_size_axis_label = "BIN count",
            node_color_axis_label = "log 2 ratio of median counts",
            node_label_size_range = c( 0.005, 0.03),
            layout = "davidson-harel", # The primary layout algorithm
            initial_layout = "reingold-tilford")

obj %>%
  metacoder::filter_taxa(taxon_names == "Blattodea",#here is to fliter the figure by groups
              subtaxa = TRUE) %>%
  heat_tree(node_label = taxon_names,
            node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
            node_color = log2_median_ratio, # A column from `obj$data$diff_table`
            node_color_trans = 'linear',
            node_color_interval = c(-0.5, 0.5), # The range of `mean_diff` to display
            node_color_range = c("goldenrod", "gray", "steelblue"), # The color palette used
            node_color_digits = 1,
            node_size_axis_label = "BIN count",
            node_color_axis_label = "log 2 ratio of median counts",
            node_label_size_range = c( 0.005, 0.03),
            layout = "davidson-harel", # The primary layout algorithm
            initial_layout = "reingold-tilford")

```
Now we can see some clear differences between seasons. As expected, the wet season has higher diversity compared to the dry season, and here we can see where the differences are concentrated (branches in blue indicate that those taxa are significantly more abundant in the wet season when compared to the dry). Gray branches indicate no significant difference between number of reads for that taxa in wet or dry season.

These are visually interesting results but we still need to compare these data statistically through diversity and ordination analyses which will be covered in the following script.

```{r save_filtered_taxmap_object}

saveRDS(object = obj, file = "data/taxmap_object.rds") #this will save the clean, filtered taxmap object for future uses

```