online_learning_data_analysis_toolkit/classification_algorithms.Rmd at master · elenase/online_learning_data_analysis_toolkit · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
title: "Classification & Clustering"
output: html_document
runtime: shiny
---

***
## General Information


> Classification is a major step in creating maps. Behind every classification mechanism (that might be done automatically by most GIS Software) a complex mathematical algorithm is used.


> Regarding to which classification algorithm you choose the product will look totally different as the data breaks between the classes change. To decide which classification is the best you should understand what happens behind.

## Understanding Classification in ArcGIS

> [Explanation of automated classification methods in ArcGIS](https://pro.arcgis.com/en/pro-app/help/mapping/symbols-and-styles/data-classification-methods.htm)

## Unsupervised Classification Algorithms
> Explanation

  * [Clustering: Basics](https://www.youtube.com/watch?v=6R16reLVl3I)
  * [K-Means](https://www.youtube.com/watch?v=_aWzGGNrcic)


    + PRO: Entire feature space can be classified
    + CONTRA: Description of cluster determined by radial shape, starting points must be selected as cluster center

  * [Mean Shift](https://www.youtube.com/watch?v=kmaQAsotT9s)
  * [Hierarchical Clustering](https://www.youtube.com/watch?v=OcoE7JlbXvY)

  Hierarchical clustering creates a [Dendrogram](http://www.ncss.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Hierarchical_Clustering-Dendrograms.pdf) which is a tree diagram. It shows the euklidian distance between data points. The Dendrogram can be used to decide how many clusters you want to create. You can choose between different possibilities to define the distance between clusters:


      + Single link = Minimal euklidian distance
      + Average link = Average euklidian distance
      + Complete link = Maximum euklidian distance

  * [Mixture of Gaussian](https://www.youtube.com/watch?v=qMTuMa86NzU)

## Supervised Classification Algorithms
> Explanation

  * [Naive Bayes](https://www.youtube.com/watch?v=OqmJhPQYRc8)
PROBLEM: Classes often overlap
IDEA of Naive Bayes Theorem: Probability maximisation of correct classification of each data point
GIVEN PARAMETER:

    + A-priori probability that each data point is realated to a specific class
    + Feature probability that one data point within a specific class has a specific feature vector

  * [Maximum Likelihood Classification: Used if a-priori probability isn't given](https://www.youtube.com/watch?v=RPtYRm2tboA)
  * [K-nn = K-nearest neighbor](https://www.youtube.com/watch?v=k_7gMp5wh5A)

A [Voronoi Diagramm](http://www.pi6.fernuni-hagen.de/downloads/publ/tr198.pdf) decribes areas, that are nearest to a certain data point (within a defined distance)
1. For the k-nn classification you choose an uneven number of neighbors=k
2. The algorithm starts at a random point
3. This starting point gets the value that most appears in the "neighborhood" (for that reason k should always be uneven)


    + PRO: Easy concept
    + CONTRA: Distance parameter often difficult

  * [Support Vektor Machines (SVM)](https://www.youtube.com/watch?v=1NxnPkZM9bc)

    + PRO: Unlinear classification possible, less parameter required, no a-priori knowledge required, applicable for large feature space
    + CONTRA: Slow classification

  * [AdaBoost](https://www.youtube.com/watch?v=ix6IvwbVpw0)
is a combination of multiple weak classifiers to build one strong classifier