-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathMCA_notes.txt
More file actions
55 lines (49 loc) · 2.79 KB
/
MCA_notes.txt
File metadata and controls
55 lines (49 loc) · 2.79 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
Notes on MCA
In order to reduce the data load, we need to find an efficient way of doing MCA over LOTS of texts.
This should be alright, because MCA is just CA over the indicator or Burt matrix of the data (http://statmath.wu.ac.at/courses/CAandRelMeth/CARME5.pdf).
The Burt matrix is a JxJ matrix with all levels of all variables = J (does this mean there has to be a level for each occurrence AND each non-occurrence?)
It should be possible to continuously update this matrix as new texts are processed.
Existing implementations of MCA in Python do not seem to directly work, but we can dynamically update the Burt matrix and then do CA on it.
The code below (https://github.com/MaxHalford/prince#correspondence-analysis-ca) does just that.
I have used a dummy Burt matrix with large numbers in the cells (hundreds of billions), and that does not seem to cause computational problems.
According Wikipedia (https://en.wikipedia.org/wiki/Multiple_correspondence_analysis):
"Analyzing the Burt table is a more natural generalization of simple correspondence analysis, and individuals or the means of groups of individuals can be added as supplementary points to the graphical display."
import pandas as pd
pd.set_option('display.float_format', lambda x: '{:.6f}'.format(x))
X = pd.DataFrame(
data=[
[1190000000000,0,0,0,0,270000000000,280000000000,300000000000,220000000000,120000000000],
[0,3220000000000,0,0,0,380000000000,740000000000,840000000000,960000000000,300000000000],
[0,0,2040000000000,0,0,30000000000,480000000000,630000000000,730000000000,170000000000],
[0,0,0,1780000000000,0,30000000000,210000000000,230000000000,790000000000,520000000000],
[0,0,0,0,480000000000,0,30000000000,50000000000,110000000000,290000000000],
[270000000000,280000000000,30000000000,30000000000,0,710000000000,0,0,0,0],
[280000000000,740000000000,480000000000,210000000000,30000000000,0,1740000000000,0,0,0],
[300000000000,840000000000,630000000000,230000000000,50000000000,0,0,2030000000000,0,0],
[220000000000,960000000000,730000000000,790000000000,110000000000,0,0,0,2810000000000,0],
[120000000000,300000000000,170000000000,520000000000,290000000000,0,0,0,0,1400000000000]
],
columns=pd.Series(['V1-1', 'V1-2', 'V1-3', 'V1-4', 'V1-5', 'V2-1', 'V2-2', 'V2-3', 'V2-4', 'V2-5']),
index=pd.Series(['V1-1', 'V1-2', 'V1-3', 'V1-4', 'V1-5', 'V2-1', 'V2-2', 'V2-3', 'V2-4', 'V2-5'])
)
import prince
ca = prince.CA(
n_components=2,
n_iter=3,
copy=True,
check_input=True,
engine='auto',
random_state=42
)
X.columns.rename('Variables', inplace=True)
X.index.rename('Variables', inplace=True)
ca = ca.fit(X)
ax = ca.plot_coordinates(
X=X,
ax=None,
figsize=(6, 6),
x_component=0,
y_component=1,
show_row_labels=True,
show_col_labels=True
)