demo-R-iris/iris.Rmd at master · phively/demo-R-iris · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
title: "R Markdown demo with iris dataset"
author: "Paul Hively"
date: "March 29, 2016"
output: html_document
---

Experimenting with R Markdown. The source is available [here](https://github.com/phively/demo-R-iris/blob/master/iris.Rmd). We'll use the iris dataset:

```{r}
## Read in and summarize the data
data(iris)
str(iris)
summary(iris)
```

We see that there are 150 total observations of anatomical characteristics of three iris species. A visualization using `ggplot2`:

```{r correlations, echo=F}
library(ggplot2) #needs to be installed if unavailable
library(grid)
library(gridExtra) #lay out graphics in a matrix

# ggplot objects
g1 <- ggplot(data=iris, aes(y=Sepal.Width, x=Sepal.Length, color=Species)) + geom_point(alpha=.5)
g2 <- ggplot(data=iris, aes(y=Petal.Length, x=Sepal.Length, color=Species)) + geom_point(alpha=.5)
g3 <- ggplot(data=iris, aes(y=Petal.Width, x=Sepal.Length, color=Species)) + geom_point(alpha=.5)
g4 <- ggplot(data=iris, aes(y=Petal.Length, x=Sepal.Width, color=Species)) + geom_point(alpha=.5)
g5 <- ggplot(data=iris, aes(y=Petal.Width, x=Sepal.Width, color=Species)) + geom_point(alpha=.5)
g6 <- ggplot(data=iris, aes(y=Petal.Width, x=Petal.Length, color=Species)) + geom_point(alpha=.5)
gbl <- ggplot(data=iris) + theme(panel.background = element_rect(fill="white"))

# Share a legend between ggplot objects; see https://github.com/hadley/ggplot2/wiki/Share-a-legend-between-two-ggplot2-graphs
grid_arrange_shared_legend <- function(...) {
    plots <- list(...)
    g <- ggplotGrob(plots[[1]] + theme(legend.position="bottom"))$grobs
    legend <- g[[which(sapply(g, function(x) x$name) == "guide-box")]]
    lheight <- sum(legend$height)
    grid.arrange(
        do.call(arrangeGrob, lapply(plots, function(x)
            x + theme(legend.position="none"))),
        legend,
        ncol = 1,
        heights = unit.c(unit(1, "npc") - lheight, lheight))
}

# Arrange in a grid
grid_arrange_shared_legend(g1, gbl, gbl, g2, g4, gbl, g3, g5, g6)
```

Observations:

  - The *setosa* species is not like the others; it can be distinguished just on the basis of petal size
  - It doesn't look like *versicolor* and *virginica* can be completely separated

Try the linear discriminant as implemented in `MASS`.

$$h(x)=\operatorname{argmax}\Big[\hat{\mu}_{c}^{T}\Sigma^{-1}x-\frac{1}{2}\hat{\mu}_{c}^{T}\hat\Sigma^{-1}{\mu}_{c}\Big]$$

```{r}
library(MASS)
(iris.lda <- lda(Species ~ ., data=iris)) #enclosing in () makes R print the output
# Confusion matrix
table(Predicted=predict(iris.lda, iris[,1:4])$class,
      Actual=iris$Species)
```

Not half bad. But from the scatter plots we can get away with just petal length and width:

```{r}
(iris.lda <- lda(Species ~ Petal.Length + Petal.Width, data=iris))
# Confusion matrix
table(Predicted=predict(iris.lda, iris[,1:4])$class,
      Actual=iris$Species)
```

Not quite as clean, but cutting in half the number of features to be measured is (probably?) a win. Let's visualize the (approximate) decision boundaries:

```{r centered-boundaries, echo=F}
library(ggplot2)

# Function to manually calculate the LDA coefficients (is there a way to do this from lda()?)
cf <- function(x1, y1, x2, y2){
  center = c((x1+x2)/2, (y1+y2)/2)
  slope = -1/((y2-y1)/(x2-x1))
  intercept = center[2] - slope*center[1]
  return(list(center=center, slope=slope, intercept=intercept))
}

# Coefficients
beta1 <- cf(iris.lda$means["setosa",2], iris.lda$means["setosa",1], iris.lda$means["versicolor",2], iris.lda$means["versicolor",1])
beta2 <- cf(iris.lda$means["versicolor",2], iris.lda$means["versicolor",1], iris.lda$means["virginica",2], iris.lda$means["virginica",1])
# Plot
ggplot(data=iris, aes(x=Petal.Width, y=Petal.Length, color=Species)) + geom_point(alpha=.5) +
  # Connect the center of classes 1 and 2
  geom_segment(aes(x=iris.lda$means["setosa",2], xend=iris.lda$means["versicolor",2], y=iris.lda$means["setosa",1],  yend=iris.lda$means["versicolor",1]), linetype="dotted") +
  # Connect the center of classes 2 and 3
  geom_segment(aes(x=iris.lda$means["versicolor",2], xend=iris.lda$means["virginica",2], y=iris.lda$means["versicolor",1], yend=iris.lda$means["virginica",1]), linetype="dotted") +
  # Decision boundaries
  geom_abline(intercept=beta1$intercept, slope=beta1$slope, alpha=.25) + #class 1-2 boundary
  geom_abline(intercept=beta2$intercept, slope=beta2$slope, alpha=.25) + #class 2-3 boundary
  coord_fixed() #fix the aspect ratio
```

The darker points indicate overplotting, so the visual count of errors does line up with the confusion matrix above.

To show the actual boundaries I'd need to account for the relative covariances of the classes; not sure how to grab this from the model object but it can be done by hand. Doesn't seem worthwhile for this example, though, so I'll just trust `MASS` is doing its job, right?