-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex-1.html
More file actions
254 lines (254 loc) · 13.3 KB
/
index-1.html
File metadata and controls
254 lines (254 loc) · 13.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><title>알 수 없음 </title></head><body>
<h1 id="gam-smoothing-vs-gps-models">GAM smoothing vs GPS models</h1>
<h2 id="overview">Overview</h2>
<p>There are two parts that are covered:
<em> Weekly average of PM<sub>10</sub> Measurement using GAM
</em> Understanding GPS data</p>
<h2 id="part-1-pm10-measurement-using-gam">Part 1: PM<sub>10</sub> Measurement using GAM</h2>
<p>According to the Datacamp course, a generalized additive model (GAM) is "a generalized linear model in which the linear predictor depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions."</p>
<h3 id="1-personal-rationale">1. Personal rationale</h3>
<p>It is worthwhile to use this approach because the smoothing functions of geo-environmental elements (e.g. DEM, XY coordinates, land use, buildings) might explain the complexity of pollution flow. In R CRAN, there are two packages called called <strong>MAPGAM</strong> and <strong>mgcv</strong>. MAPGAM (Mapping Smoothed Effect Estimates from Individual-Level Data) uses GAM for <em>mapping</em> odds ratios, other effect estimates using individual-level data such as case-control study data, while mgcv focuses on the mathematical perspective of GAM. After trying both, I decided to use mgcv for its miscellaneous applications e.g. convert to raster, analyse GIS files without pre-processing effort (MAPGAM does need some effort to bring shapefiles).</p>
<h3 id="2-gam-formula">2. GAM Formula</h3>
<p>Wiki says...</p>
<blockquote>
<p>The model relates a univariate response variable, Y, to some predictor variables, xi. An exponential family distribution is specified for Y (for example normal, binomial or Poisson distributions) along with a link function g (for example the identity or log functions) relating the expected value of Y to the predictor variables via a structure such as <br><br>
<a data-flickr-embed="true" href="https://www.flickr.com/photos/139313330@N06/46291402691/in/dateposted-public/" title="gam"><img src="https://farm5.staticflickr.com/4823/46291402691_0e11701b66_z.jpg" width="567" height="32" alt="gam"></a> <br><br>
The functions fi may be functions with a specified parametric form (for example a polynomial, or an un-penalized regression spline of a variable) or may be specified non-parametrically, or semi-parametrically, simply as 'smooth functions', to be estimated by non-parametric means. So a typical GAM might use a scatterplot smoothing function, such as a locally weighted mean, for f1(x1), and then use a factor model for f2(x2). This flexibility to allow non-parametric fits with relaxed assumptions on the actual relationship between response and predictor, provides the potential for better fits to data than purely parametric models, but arguably with some loss of interpretability.</p>
</blockquote>
<p>https://en.wikipedia.org/wiki/Generalized_additive_model</p>
<h3 id="3-data-preparation">3. Data preparation</h3>
<p>Here is the weekly Average data shown in a data table. Click to enlarge image <br>
<a data-flickr-embed="true" href="https://www.flickr.com/photos/139313330@N06/46291336301/in/dateposted-public/" title="WeeklyAvPM10"><img src="https://farm5.staticflickr.com/4855/46291336301_36c9e8fb51_z.jpg" width="640" height="208" alt="WeeklyAvPM10"></a>
<br/><br/></p>
<p>Note that <code>pm2013_XX</code> refers to: <br>
<em> <code>pm2013_30</code>: 2013 July week 4
</em> <code>pm2013_31</code>: 2013 August week 1
<em> <code>pm2013_32</code>: 2013 August week 2
</em> <code>pm2013_33</code>: 2013 August week 3
<em> <code>pm2013_34</code>: 2013 August week 4
</em> <code>pm2013_35</code>: 2013 September week 1
<em> <code>pm2013_36</code>: 2013 September week 2
</em> <code>pm2013_37</code>: 2013 September week 3
<em> <code>pm2013_38</code>: 2013 September week 4
</em> <code>pm2013_39</code>: 2013 September week 5</p>
<p>Here is a map of 46 stations considered in this study
<a data-flickr-embed="true" href="https://www.flickr.com/photos/139313330@N06/32419692788/in/dateposted-public/" title="Seoul Pollution Station"><img src="https://farm5.staticflickr.com/4816/32419692788_74eab3dbb3_z.jpg" width="400" height="370" alt="Seoul Pollution Station"></a></p>
<h3 id="4-gam-calculation">4. GAM calculation</h3>
<p>I tested the GAM equation by using <code>gam()</code> function. The basic approach is to put the smoothing argument <code>s()</code> to the relevant variables. In my case, I put the <code>s(X,Y)</code> XY coordinates together as an interaction factor, and add a DEM smoother <code>s(DEM)</code>. <br></p>
<p>The <code>"method = REML"</code> is a mathematical option that is used as default. From stackoverflow, REML is an acronym of REstricted Maximum Likelihood. Here is one of the online user's comments:</p>
<blockquote>
<blockquote>
<p>Indeed, the REML estimator of a variance component is usually (approximately) unbiased, while the ML estimator is negatively biased. However, the ML estimator usually has lower mean-squared error (MSE) than the REML estimator. So, if you want to be right on average, go with REML, but you pay for this with larger variability in the estimates. If you want to be closer to the true value on average, go with ML, but you pay for this with negative bias.</p>
</blockquote>
</blockquote>
<p>Since we have only 46 stations, I will leave the method as <code>REML</code>.
<code>r
jul.w4 <- gam(pm2013_30 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
aug.w1 <- gam(pm2013_31 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
aug.w2 <- gam(pm2013_32 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
aug.w3 <- gam(pm2013_33 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
aug.w4 <- gam(pm2013_34 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
sep.w1 <- gam(pm2013_35 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
sep.w2 <- gam(pm2013_36 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
sep.w3 <- gam(pm2013_37 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
sep.w4 <- gam(pm2013_38 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")
sep.w5 <- gam(pm2013_39 ~ s(X, Y) + s(DEM), data = aq.pm10, method = "REML")</code></p>
<p>Here is the R<sup>2</sup> of each model:
<em> July w4: .25
</em> August w1: .23
<em> August w2: .204
</em> August w3: .077
<em> August w4: .232
</em> September w1: .16
<em> September w2: .292
</em> September w3: .208
<em> September w4: .346
</em> September w5: .327</p>
<p>The default model used gaussian distribution for model fitting. However, we can change the options as mentioned below:
<em> <code>binomial(link = "logit")</code>
</em> <code>gaussian(link = "identity")</code>
<em> <code>Gamma(link = "inverse")</code>
</em> <code>inverse.gaussian(link = "1/mu^2")</code>
<em> <code>poisson(link = "log")</code>
</em> <code>quasi(link = "identity", variance = "constant")</code>
<em> <code>quasibinomial(link = "logit")</code>
</em> <code>quasipoisson(link = "log")</code></p>
<p>All options need further investigation, but so far, <code>inverse.gaussian(link = "1/mu^2")</code> got the highest R<sup>2</sup> value. For example, September w5 was increased to .392.</p>
<h3 id="5-gam-on-a-spatial-grid">5. GAM on a spatial grid</h3>
<p>Now we got all the GAM models ready. But how will the model be translated to a map? The answer is to build a grid. As a start, I used 200 x 200 x 200 grid area of X, Y, Z(DEM).</p>
<p>According to Simon Wood's book "Generalized Additive Model 2nd Edition 2017", he noted that the model containing more than one smoothing function(e.g. f(x), f(y)) are only estimable within the additive constant. I will come back to that later, but so far, adding DEM into the model increased the R<sup>2</sup>. Click to enlarge image </p>
<p><a data-flickr-embed="true" href="https://www.flickr.com/photos/139313330@N06/46241500612/in/dateposted-public/" title="Rplot05"><img src="https://farm5.staticflickr.com/4876/46241500612_0c59c97c39_z.jpg" width="640" height="267" alt="Rplot05"></a></p>
<h2 id="part-2-understanding-gps-data">Part 2: Understanding GPS data</h2>
<p>Here is an example of the GPS data. As can be seen below, there are 13 columns in total and all measurements are taken every minute.</p>
<table>
<thead>
<tr>
<th align="center">period</th>
<th align="center">who</th>
<th align="center">group</th>
<th align="center">loc1</th>
<th align="center">loc1_1</th>
<th align="center">time</th>
<th align="center">pm10</th>
<th align="center">pm2_5</th>
<th align="center">pm1</th>
<th align="center">long</th>
<th align="center">lat</th>
<th align="center">X</th>
<th align="left">Y</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">2</td>
<td align="center">1</td>
<td align="center">2</td>
<td align="center">1</td>
<td align="center">32</td>
<td align="center">2013-07-25 20</td>
<td align="center">57</td>
<td align="center">104.6</td>
<td align="center">97</td>
<td align="center">89.7</td>
<td align="center">126.9822844</td>
<td align="center">37.31476048</td>
<td align="left">198429.6307</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">1</td>
<td align="center">2</td>
<td align="center">1</td>
<td align="center">32</td>
<td align="center">2013-07-25 20</td>
<td align="center">48</td>
<td align="center">106.6</td>
<td align="center">83</td>
<td align="center">78.2</td>
<td align="center">126.9822908</td>
<td align="center">37.31488669</td>
<td align="left">198430.1956</td>
</tr>
</tbody>
</table>
<p>Here is a brief code info. I removed data from the pilot study.
<em> Period
* Pilot: 1
* Summer: 2
* Winter: 3
</em> who: researcher 1-5
<em> group: cluster 1-9
</em> loc1(location category)
* indoors(no movement): 1
* outdoors(no movement): 2
* movement: 3
* loc1_1(specific location)
* Restaurant: 10
* Coffee shop: 11
* Bbq grill: 12
* Bar: 13
* Office: 14
* Traditional market: 15
* Superstore: 16
* Department store: 17
* Shopping complex: 18
* Other shops(including convenience store): 19
* Workplace: 20
* Bank: 21
* School: 22
* Academy(educational): 23
* Bookshop: 24
* Rest centre for seniors: 25
* Stroll: 26
* Walking: 27
* Bus: 28
* Subway: 29
* Taxi: 30
* Personal vehicle: 31
* At home: 32
* Missing data: 999</p>
<h3 id="plots-and-stats">Plots and Stats</h3>
<p>This GPS track plot is derived from one researcher who commutes from Suwon to Seoul (1.5 hour distance).
Click to enlarge image.<br>
<a data-flickr-embed="true" href="https://www.flickr.com/photos/139313330@N06/45569159254/in/dateposted-public/" title="������1"><img src="https://farm5.staticflickr.com/4902/45569159254_2d8acb57a9_z.jpg" width="640" height="240" alt="������1"></a></p>
<ul>
<li>July (count by minutes)
<code>PM10 < 50 < 100 < 150 < 200 > 200
0 147 113 40 0 0
1 115 96 28 0 1
2 120 102 18 0 0
3 140 87 12 0 1
4 122 60 58 0 0
5 124 59 57 0 0
6 121 92 27 0 0
7 112 68 60 0 0
8 88 89 63 0 0
9 96 121 19 2 2
10 85 148 5 0 0
11 63 151 7 3 0
12 116 90 20 1 0
13 81 69 9 0 0
14 26 126 6 0 0
15 60 118 2 0 0
16 69 67 0 0 0
17 120 60 0 0 0
18 155 37 1 0 0
19 167 10 2 0 1
20 140 66 3 0 0
21 119 121 25 5 0
22 136 117 47 0 0
23 161 116 23 0 0</code></li>
<li>August (count by minutes)
<code>PM10 < 50 < 100 < 150 < 200 > 200
0 134 4 8 31 3
1 190 1 0 15 0
2 180 0 0 0 0
3 197 1 0 0 1
4 198 2 0 24 16
5 180 0 39 20 1
6 159 9 54 16 2
7 136 25 6 62 0
8 169 18 19 34 0
9 124 64 0 0 0
10 120 60 0 0 0
11 110 63 1 0 0
12 81 95 0 0 1
13 156 47 4 0 0
14 176 42 4 1 1
15 236 4 0 0 0
16 221 17 1 1 0
17 222 13 2 1 0
18 236 4 0 0 0
19 165 33 10 1 12
20 159 18 2 1 0
21 141 9 16 24 19
22 120 0 6 10 44
23 120 0 9 18 33</code></li>
<li>September (count by minutes)
<code>PM10 < 50 < 100 < 150 < 200 > 200
0 134 4 8 31 3
1 190 1 0 15 0
2 180 0 0 0 0
3 197 1 0 0 1
4 198 2 0 24 16
5 180 0 39 20 1
6 159 9 54 16 2
7 136 25 6 62 0
8 169 18 19 34 0
9 124 64 0 0 0
10 120 60 0 0 0
11 110 63 1 0 0
12 81 95 0 0 1
13 156 47 4 0 0
14 176 42 4 1 1
15 236 4 0 0 0
16 221 17 1 1 0
17 222 13 2 1 0
18 236 4 0 0 0
19 165 33 10 1 12
20 159 18 2 1 0
21 141 9 16 24 19
22 120 0 6 10 44
23 120 0 9 18 33</code></li>
</ul>
</body></html>