-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathwriteup.tex
More file actions
1956 lines (1820 loc) · 51.5 KB
/
writeup.tex
File metadata and controls
1956 lines (1820 loc) · 51.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[11pt]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{lmodern}
\usepackage{microtype}
\usepackage{amsmath,amssymb,amsthm,mathtools,bm}
\usepackage[a4paper,margin=1in]{geometry}
\usepackage{enumitem}
\usepackage{hyperref}
\usepackage{booktabs}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\be}{{\bm{\eta}}}
\newcommand{\bx}{\bm{x}}
\newcommand{\bT}{\bm{T}}
\newcommand{\bNabla}{\bm{\nabla}}
\begin{document}
\section{Gradient descent, natural gradient descent, mirror gradient descent, and curved families}
We work in the ambient exponential family
\[
p(\bx\mid \be)=h(\bx)\exp\!\big(\langle \be,\bT(\bx)\rangle-A(\be)\big),
\qquad
\be\in\mathcal N\subset\R^d,
\]
with dual coordinates
\[
\be^\ast=\bNabla A(\be)\in\mathcal N^\ast,
\qquad
\be=\bNabla A^\ast(\be^\ast).
\]
For data \(\bx_1,\dots,\bx_N\), define
\[
\bar{\bT}=\frac{1}{N}\sum_{n=1}^N \bT(\bx_n),
\]
and consider the negative log-likelihood
\[
L(\be)=A(\be)-\langle \be,\bar{\bT}\rangle.
\]
Its gradient is
\[
\bNabla L(\be)=\be^\ast-\bar{\bT},
\]
so the optimum is characterized by moment matching,
\[
\be^\ast=\bar{\bT}.
\]
\subsection{Gradient descent}
As usual, take an exponential family
\[
p(\bx\mid \be)=h(\bx)\exp\!\big(\langle \be,\bT(\bx)\rangle-A(\be)\big),
\qquad
\be\in\mathcal N\subset\R^d,
\]
with dual coordinates
\[
\be^\ast=\bNabla A(\be)\in\mathcal N^\ast,
\qquad
\be=\bNabla A^\ast(\be^\ast).
\]
For data \(\bx_1,\dots,\bx_N\), define
\[
\bar{\bT}=\frac{1}{N}\sum_{n=1}^N \bT(\bx_n),
\]
and consider the negative log-likelihood
\[
L(\be)=A(\be)-\langle \be,\bar{\bT}\rangle,
\qquad
\bNabla_\be L(\be)=\be^\ast-\bar{\bT},
\]
so the optimum is characterized by the usual moment matching
\[
\be^\ast=\bar{\bT}.
\]
Ordinary gradient descent (GD) updates \(\be\) by subtracting the gradient in the chosen coordinates:
\[
\be^i_{t+1}
=
\be^i_t-\alpha_t\,\delta^{ij}\partial_jL(\be_t).
\]
Equivalently,
\[
\be_{t+1}
=
\arg\min_{\be}
\left\{
\langle \bNabla_\be L(\be_t),\be-\be_t\rangle
+\frac{1}{2\alpha_t}\|\be-\be_t\|^2
\right\}.
\]
This uses the Euclidean metric \(\delta_{ij}\) as an external choice. In index language, \(\partial_jL\) is naturally a covector, and the update inserts \(\delta^{ij}\) to force it into a vector. Ordinary GD is therefore coordinate-dependent.
For the exponential-family likelihood this becomes
\[
\be^i_{t+1}
=
\be^i_t-\alpha_t\,\delta^{ij}\bigl(\be^\ast_{t,j}-\bar T_j\bigr).
\]
It does not reflect the underlying geometry of the family.
\subsection{Natural gradient descent}
The exponential family carries the Fisher metric
\[
g_{ij}(\be)=\partial_i\partial_jA(\be),
\qquad
g^{ij}(\be^\ast)=\partial^{\ast i}\partial^{\ast j}A^\ast(\be^\ast).
\]
Natural gradient descent (NGD) replaces the Euclidean metric by this intrinsic one:
\[
\be^i_{t+1}
=
\be^i_t-\alpha_t\,g^{ij}(\be_t)\,\partial_jL(\be_t).
\]
Equivalently,
\[
\be_{t+1}
=
\arg\min_{\be}
\left\{
\langle \bNabla_\be L(\be_t),\be-\be_t\rangle
+\frac{1}{2\alpha_t}
(\be-\be_t)^i g_{ij}(\be_t)(\be-\be_t)^j
\right\}.
\]
The metric is now intrinsic, but the step is still a local linear update in the primal coordinates. The natural-gradient direction is reparametrization invariant, while the finite-step update remains a local discretization.
For the exponential-family likelihood,
\[
\be^i_{t+1}
=
\be^i_t-\alpha_t\,g^{ij}(\be_t)\bigl(\be^\ast_{t,j}-\bar T_j\bigr).
\]
Here, we use the tangent plane of the statistical manifold as guide, but the linear approximation is not guaranteed to be accurate or respect parameter bounds.
\subsection{Mirror gradient descent}
Mirror gradient descent (MGD) replaces the Euclidean quadratic penalty by the Bregman divergence generated by \(A\),
\[
D_A(\bx_i\|\be)=A(\bx_i)-A(\be)-\langle \bNabla A(\be),\bx_i-\be\rangle.
\]
The update is
\[
\be_{t+1}
=
\arg\min_{\be}
\left\{
\langle \bNabla_\be L(\be_t),\be-\be_t\rangle
+\frac{1}{\alpha_t}D_A(\be\|\be_t)
\right\}.
\]
The first-order condition induced by the Bregman divergence is
\[
\bNabla A(\be_{t+1})=\bNabla A(\be_t)-\alpha_t\,\bNabla_\be L(\be_t),
\]
that is,
\[
\be^\ast_{t+1}=\be^\ast_t-\alpha_t\,\bNabla_\be L(\be_t).
\]
Thus MGD induces a linear update in the dual coordinates \(\be^\ast\). It uses the covector \(\partial_iL\) with the correct index position, without inserting \(\delta^{ij}\) or \(g^{ij}\) by hand. Of course, dual mean coordinates could be used without the information-geometric background. What we have obtained, however, is an answer to the question that ordinary GD cannot answer, namely, which coordinate system to choose. In the exponential family, dual coordinates are flat, hence geodesic projections are linear equations. Non-trivial problems arise via embedding curvature of curved exponential (sub-)families.
For the exponential-family likelihood one obtains
\[
\be^\ast_{t+1}
=
\be^\ast_t-\alpha_t(\be^\ast_t-\bar{\bT})
=
(1-\alpha_t)\be^\ast_t+\alpha_t\bar{\bT}.
\]
Hence MGD moves along the \(m\)-geodesic joining \(\be_t^\ast\) and \(\bar{\bT}\), and the next primal iterate is
\[
\be_{t+1}=\bNabla A^\ast(\be^\ast_{t+1}).
\]
\subsection{Boosting of MGD}
We now formulate a boosting algorithm that approximates mirror gradient descent by updating the dual coordinate \(\eta^\ast(x)\) directly. The construction separates three levels:
\begin{enumerate}[leftmargin=2em]
\item the population objective under the true data-generating distribution \(P\),
\item the exact MGD flow in the dual coordinate \(\eta^\ast\),
\item a weighted empirical tree algorithm based on unbiased pseudo-responses.
\end{enumerate}
Throughout, expectations under the true distribution are written as \(\mathbb E[\cdot]\), while weighted empirical averages are
\[
\mathbb E_n^w[g]
:=
\frac{1}{W}\sum_{i=1}^n w_i\,g_i,
\qquad
W:=\sum_{i=1}^n w_i,
\qquad
w_i\ge 0.
\]
\paragraph{Population optimum.}
Consider a conditional exponential family
\[
p(y\mid x,\eta(x))
=
h(y)\exp\!\big(\langle \eta(x),T(y)\rangle-A(\eta(x))\big),
\]
with dual coordinate
\[
\eta^\ast(x)=\nabla A(\eta(x)).
\]
The population negative log-likelihood is
\[
\mathcal L[\eta]
=
\mathbb E\!\left[A(\eta(X))-\langle \eta(X),T(Y)\rangle-\log h(Y)\right].
\]
For an infinitely expressive model, the minimizer is pointwise and satisfies
\[
\eta^\ast(x)=\mathbb E[T(Y)\mid X=x].
\]
Thus the optimal dual coordinate is the conditional mean of the sufficient statistic under the true distribution.
\paragraph{Population MGD.}
At iteration \(t\), bias the local objective by the Bregman divergence to the previous iterate:
\[
\eta_{t+1}(x)
=
\arg\min_{\eta}
\left\{
A(\eta)-\langle \eta,\eta^\ast(x)\rangle
+\frac{1}{\alpha}D_A(\eta\|\eta_t(x))
\right\},
\]
with constant learning rate \(\alpha\in(0,1]\). The first-order condition gives
\[
\eta^\ast_{t+1}(x)
=
(1-\alpha)\eta^\ast_t(x)+\alpha\,\eta^\ast(x).
\]
Equivalently,
\[
\eta^\ast_{t+1}(x)-\eta^\ast_t(x)
=
\alpha\big(\eta^\ast(x)-\eta^\ast_t(x)\big).
\]
So the exact mirror-descent direction in dual coordinates is
\[
r_t(x):=\eta^\ast(x)-\eta^\ast_t(x).
\]
\paragraph{Residual sufficient statistic as an unbiased target.}
Since
\[
\eta^\ast(x)=\mathbb E[T(Y)\mid X=x],
\]
the residual sufficient statistic
\[
R_t:=T(Y)-\hat\eta_t^\ast(X)
\]
satisfies
\[
\mathbb E[R_t\mid X=x]
=
\eta^\ast(x)-\hat\eta_t^\ast(x).
\]
Thus \(R_t\) is an unbiased estimator of the exact dual MGD direction.
This is the key observation: one can approximate the population MGD step by learning the conditional mean of \(R_t\) and updating \(\hat\eta_t^\ast\) additively.
\paragraph{Infinitely expressive regression class.}
If the regression class is rich enough to fit the conditional mean exactly,
\[
f_{t+1}(x)=\mathbb E[R_t\mid X=x]
=\eta^\ast(x)-\hat\eta_t^\ast(x),
\]
then the additive update
\[
\hat\eta^\ast_{t+1}(x)=\hat\eta^\ast_t(x)+\alpha\,f_{t+1}(x)
\]
becomes
\[
\hat\eta^\ast_{t+1}(x)
=
(1-\alpha)\hat\eta^\ast_t(x)+\alpha\,\eta^\ast(x),
\]
which is exactly the population MGD recursion. Hence the construction is exact in the infinitely expressive limit.
\paragraph{Weighted empirical boosting algorithm.}
Let \((x_i,y_i,w_i)\) be a weighted sample. Initialize
\[
\hat\eta_0^\ast(x)=m_0,
\]
with \(m_0\in\mathcal N^\ast\) any admissible baseline (often \(m_0=0\) when \(0\in\mathcal N^\ast\)). Define the weighted pseudo-response at round \(t\) by
\[
R_i^{(t)}:=T(y_i)-\hat\eta_t^\ast(x_i).
\]
Then
\[
\mathbb E[R_i^{(t)}\mid x_i]
=
\eta^\ast(x_i)-\hat\eta_t^\ast(x_i).
\]
Fit a tree \(f_{t+1}\) to the pairs \((x_i,R_i^{(t)})\) with event weights \(w_i\), using any regression criterion whose population minimizer is the conditional mean. The simplest choice is weighted least squares,
\[
f_{t+1}
=
\arg\min_{f\in\mathcal F_{\rm tree}}
\sum_{i=1}^n w_i\,\|R_i^{(t)}-f(x_i)\|^2.
\]
For a leaf \(J\), the optimal constant prediction is the weighted mean residual
\[
f_{t+1,J}
=
\frac{\sum_{i\in J} w_i\,R_i^{(t)}}{\sum_{i\in J} w_i}.
\]
Update
\[
\hat\eta_{t+1}^\ast(x)=\hat\eta_t^\ast(x)+\alpha\,f_{t+1}(x),
\]
or, eventwise,
\[
R_i^{(t+1)}=R_i^{(t)}-\alpha\,f_{t+1}(x_i).
\]
After \(B\) rounds,
\[
\hat\eta_B^\ast(x)=m_0+\sum_{b=1}^B \alpha\,f_b(x),
\qquad
\hat\eta_B(x)=\nabla A^\ast(\hat\eta_B^\ast(x)).
\]
\paragraph{Why this is not natural gradient descent in disguise.}
This construction does \emph{not} reintroduce natural gradient descent through the back door. It starts from the exact mirror-descent recursion
\[
\eta^\ast_{t+1}
=
(1-\alpha)\eta^\ast_t+\alpha\,\eta^\ast
\]
in the distinguished coordinate system \(\eta^\ast\) selected by the Bregman geometry of the family. The residual target
\[
T(Y)-\hat\eta_t^\ast(X)
\]
is an unbiased estimator of the \emph{dual} MGD direction itself. It is not, in general, the first term in a Taylor expansion of the exact nonlinear leafwise problem in the primal coordinate \(\eta\), nor does it amount to Fisher-preconditioned gradient descent in \(\eta\). Only in the infinitesimal-step regime do all first-order schemes coincide locally. Here the learning rate is applied directly in the dual coordinate \(\eta^\ast\), and the update rule is globally valid throughout the exponential family because it is induced by the mirror map \(\eta^\ast=\nabla A(\eta)\).
\paragraph{Split criterion from leaf predictions: empirical algorithm.}
The fitted leaf values immediately define the split criterion. Let \(J\) be a parent node and \(J=L\cup R\) a candidate split. Define
\[
W_C:=\sum_{i\in C} w_i,
\qquad
\bar R_C^{(t)}
:=
\frac{1}{W_C}\sum_{i\in C} w_i\,R_i^{(t)},
\qquad C\in\{J,L,R\}.
\]
Then the optimal constant predictions are exactly
\[
f_{t+1,C}=\bar R_C^{(t)}.
\]
For weighted least squares, the reduction in within-node residual loss is
\[
\mathrm{Gain}^{\rm reg}_t(J\to L,R)
=
W_L\|\bar R_L^{(t)}\|^2
+
W_R\|\bar R_R^{(t)}\|^2
-
W_J\|\bar R_J^{(t)}\|^2.
\]
So the split can be chosen using the child leaf predictions alone. This is the direct vector-valued analogue of the usual CART variance reduction criterion.
\paragraph{Split criterion in the infinitely expressive limit.}
At population level, define
\[
r_C^{(t)}
:=
\mathbb E[R_t\mid X\in C]
=
\mathbb E[T(Y)-\eta_t^\ast(X)\mid X\in C],
\qquad
p_C:=P_X(X\in C).
\]
Then the best constant predictor on \(C\) is \(r_C^{(t)}\), and the population split gain is
\[
\mathrm{Gain}^{\rm reg}_{P,t}(J\to L,R)
=
p_L\|r_L^{(t)}\|^2
+
p_R\|r_R^{(t)}\|^2
-
p_J\|r_J^{(t)}\|^2.
\]
Thus the empirical split rule above consistently estimates the corresponding population improvement in the dual MGD direction.
\paragraph{Entropy and the exact leafwise NLL criterion.}
There is a second, complementary viewpoint. If one fits a tree \emph{exactly} by weighted negative log-likelihood on a fixed partition \(\mathcal P=\{J\}\), then in each leaf
\[
\eta_J^\ast
=
\frac{\sum_{i\in J} w_i\,T(y_i)}{\sum_{i\in J} w_i}.
\]
Inserting this back gives, up to the carrier term,
\[
\mathbb E_n^w\!\big[A(\eta(X))-\langle \eta(X),T(Y)\rangle\big]
=
-\sum_{J\in\mathcal P}\frac{W_J}{W}\,A^\ast(\eta_J^\ast).
\]
At population level the analogous expression is
\[
-\sum_{J\in\mathcal P} P_X(J)\,A^\ast\!\big(\mathbb E[T(Y)\mid X\in J]\big).
\]
So, for an exactly fitted tree, the optimized negative log-likelihood is a phase-space average of \(-A^\ast(\eta^\ast)\). This is the generalized entropy story: \(-A^\ast(\eta^\ast)\) is the non-carrier part of the entropy in the exponential family.
Accordingly, an exact leafwise split gain is
\[
\mathrm{Gain}^{\rm ent}(J\to L,R)
=
\frac{W_L}{W}A^\ast(\eta_L^\ast)
+
\frac{W_R}{W}A^\ast(\eta_R^\ast)
-
\frac{W_J}{W}A^\ast(\eta_J^\ast)
\]
empirically, and
\[
\mathrm{Gain}^{\rm ent}_P(J\to L,R)
=
p_L A^\ast(\eta_L^\ast)
+
p_R A^\ast(\eta_R^\ast)
-
p_J A^\ast(\eta_J^\ast)
\]
at population level, where
\[
\eta_C^\ast=\mathbb E[T(Y)\mid X\in C].
\]
This entropy gain is exact for a fully refit tree on a given partition and, in particular, for the first tree when no previous prediction is present. The residual-boosting algorithm above should be understood as a stagewise approximation to the global MGD flow, with \(\mathrm{Gain}^{\rm reg}\) as the practical split rule and \(\mathrm{Gain}^{\rm ent}\) as the exact refit criterion for a standalone tree or partition.
\paragraph{Monitoring during training.}
The natural global quantity to monitor for the boosted model itself is the weighted empirical negative log-likelihood
\[
\widehat{\mathcal L}_n^w[\hat\eta_t]
=
\frac{1}{W}\sum_{i=1}^n
w_i\Big[A(\hat\eta_t(x_i))-\langle \hat\eta_t(x_i),T(y_i)\rangle-\log h(y_i)\Big].
\]
This is the empirical proxy for the population risk \(\mathcal L[\eta_t]\), and it directly measures predictive performance.
The entropy story provides an additional geometric diagnostic. For a fully refit partition \(\mathcal P\), the optimized NLL is, up to the carrier term,
\[
-\sum_{J\in\mathcal P}\frac{W_J}{W}A^\ast(\eta_J^\ast).
\]
Thus one may also monitor the corresponding generalized empirical entropy
\[
\widehat{\mathcal H}_{A,n}^w(\mathcal P)
:=
-\sum_{J\in\mathcal P}\frac{W_J}{W}A^\ast(\eta_J^\ast).
\]
For a boosted model this entropy quantity is no longer itself the training objective unless the leaves are fully refit, but it remains useful as an interpretable summary of how much conditional structure the partition captures.
\paragraph{Implementation summary for coding.}
Assume the following objects are available:
\[
T(y),\qquad
\nabla A^\ast(\cdot)\ \text{to map }\eta^\ast\mapsto \eta,\qquad
\{(x_i,y_i,w_i)\}_{i=1}^n,\qquad
\alpha,\qquad
\mathcal F_{\rm tree}.
\]
Then the MGD boosting loop is:
\[
\hat\eta_0^\ast(x)=m_0,
\]
for \(t=0,1,\dots\),
\[
R_i^{(t)}=T(y_i)-\hat\eta_t^\ast(x_i),
\]
fit a weighted regression tree to the targets \(R_i^{(t)}\),
\[
f_{t+1}
=
\arg\min_{f\in\mathcal F_{\rm tree}}
\sum_i w_i\,\|R_i^{(t)}-f(x_i)\|^2,
\]
with leaf values
\[
f_{t+1,J}=\frac{\sum_{i\in J}w_i\,R_i^{(t)}}{\sum_{i\in J}w_i},
\]
update the prediction in the \emph{dual} coordinate,
\[
\hat\eta_{t+1}^\ast(x)=\hat\eta_t^\ast(x)+\alpha f_{t+1}(x),
\]
optionally update residuals in place,
\[
R_i^{(t+1)}=R_i^{(t)}-\alpha f_{t+1}(x_i),
\]
and recover the natural coordinate only when needed,
\[
\hat\eta_{t+1}(x)=\nabla A^\ast(\hat\eta_{t+1}^\ast(x)).
\]
To choose splits in a node \(J\), use
\[
\mathrm{Gain}^{\rm reg}_t(J\to L,R)
=
W_L\|\bar R_L^{(t)}\|^2+W_R\|\bar R_R^{(t)}\|^2-W_J\|\bar R_J^{(t)}\|^2.
\]
To monitor training globally, use
\[
\widehat{\mathcal L}_n^w[\hat\eta_t]
=
\frac{1}{W}\sum_i
w_i\Big[A(\hat\eta_t(x_i))-\langle \hat\eta_t(x_i),T(y_i)\rangle-\log h(y_i)\Big].
\]
\subsection{Boosting of NGD}
Natural-gradient descent suggests a second stagewise boosting construction, now based on updating the natural coordinate \(\eta(x)\) itself. In contrast to MGD, this construction is local: the Fisher metric is evaluated at the current iterate and used to precondition the sufficient-statistic residual.
We again consider the conditional exponential family
\[
p(y\mid x,\eta(x))
=
h(y)\exp\!\big(\langle \eta(x),T(y)\rangle-A(\eta(x))\big),
\]
with dual coordinate
\[
\eta^\ast(x)=\nabla A(\eta(x)).
\]
The population negative log-likelihood is
\[
\mathcal L_P[\eta]
=
\mathbb E_P\!\left[A(\eta(X))-\langle \eta(X),T(Y)\rangle-\log h(Y)\right].
\]
Its pointwise Euclidean gradient in the natural parameter is
\[
\nabla_\eta \mathcal L_P(x)
=
\eta^\ast(x)-\mathbb E_P[T(Y)\mid X=x].
\]
The Fisher metric of the family is
\[
g(\eta)=\nabla^2 A(\eta).
\]
Hence the negative natural-gradient direction is
\[
v_t(x)
=
-\,g(\eta_t(x))^{-1}\nabla_\eta \mathcal L_P(x)
=
g(\eta_t(x))^{-1}\Big(\mathbb E_P[T(Y)\mid X=x]-\eta_t^\ast(x)\Big).
\]
The population NGD Euler step is therefore
\[
\eta_{t+1}(x)=\eta_t(x)+\alpha\,v_t(x).
\]
\paragraph{Residual sufficient statistic as an unbiased natural-gradient target.}
Define
\[
U_t:=g(\hat\eta_t(X))^{-1}\Big(T(Y)-\hat\eta_t^\ast(X)\Big).
\]
Then
\[
\mathbb E_P[U_t\mid X=x]
=
g(\hat\eta_t(x))^{-1}\Big(\mathbb E_P[T(Y)\mid X=x]-\hat\eta_t^\ast(x)\Big)
=
v_t(x).
\]
Thus \(U_t\) is an unbiased estimator of the population NGD direction.
If the regression class is infinitely expressive and can fit the conditional mean exactly, then
\[
f_{t+1}(x)=\mathbb E_P[U_t\mid X=x]=v_t(x)
\]
and the stagewise update
\[
\hat\eta_{t+1}(x)=\hat\eta_t(x)+\alpha\,f_{t+1}(x)
\]
reproduces the exact population NGD step.
\paragraph{Weighted empirical algorithm.}
Let \((x_i,y_i,w_i)\) be a weighted sample and define
\[
W:=\sum_{i=1}^n w_i,
\qquad
\hat\eta_t^\ast(x_i)=\nabla A(\hat\eta_t(x_i)),
\qquad
g_i^{(t)}:=g(\hat\eta_t(x_i)).
\]
The empirical NGD pseudo-response is
\[
U_i^{(t)}
=
\bigl(g_i^{(t)}\bigr)^{-1}\Big(T(y_i)-\hat\eta_t^\ast(x_i)\Big).
\]
Fit a regression tree \(f_{t+1}\) to the pairs \((x_i,U_i^{(t)})\) with event weights \(w_i\). The simplest choice is weighted least squares:
\[
f_{t+1}
=
\arg\min_{f\in\mathcal F_{\rm tree}}
\sum_{i=1}^n w_i\,\|U_i^{(t)}-f(x_i)\|^2.
\]
For a leaf \(J\), the optimal constant leaf value is
\[
f_{t+1,J}
=
\frac{\sum_{i\in J} w_i\,U_i^{(t)}}{\sum_{i\in J} w_i}.
\]
The update is then
\[
\hat\eta_{t+1}(x)=\hat\eta_t(x)+\alpha\,f_{t+1}(x).
\]
The corresponding dual prediction must then be recomputed through the Legendre map:
\[
\hat\eta_{t+1}^\ast(x)=\nabla A(\hat\eta_{t+1}(x)).
\]
In contrast to MGD, the residual shift is therefore not linear in the fitted tree. The exact updated sufficient-statistic residual is
\[
R_{t+1}=T(Y)-\hat\eta_{t+1}^\ast(X)
=
T(Y)-\nabla A\bigl(\hat\eta_t(X)+\alpha f_{t+1}(X)\bigr),
\]
and the next unbiased NGD target is obtained by recomputing
\[
U_{t+1}=g(\hat\eta_{t+1}(X))^{-1}R_{t+1}.
\]
\paragraph{Leafwise split criterion.}
Define
\[
W_C:=\sum_{i\in C}w_i,
\qquad
\bar U_C^{(t)}
:=
\frac{\sum_{i\in C}w_i\,U_i^{(t)}}{W_C},
\qquad
C\in\{J,L,R\}.
\]
Then the usual weighted regression gain is
\[
\mathrm{Gain}^{\rm NGD}_t(J\to L,R)
=
W_L\|\bar U_L^{(t)}\|^2
+
W_R\|\bar U_R^{(t)}\|^2
-
W_J\|\bar U_J^{(t)}\|^2.
\]
Thus NGD boosting uses the same tree machinery as ordinary weighted regression, but on a pseudo-response that has been preconditioned by the inverse Fisher metric.
\paragraph{Local versus global geometry.}
MGD becomes linear in the global dual coordinate \(\eta^\ast\), while NGD remains a local tangent-plane approximation in the natural coordinate \(\eta\). Indeed,
\[
\hat\eta_{t+1}^\ast
=
\nabla A(\hat\eta_t+\alpha v_t)
=
\hat\eta_t^\ast+\alpha\,g(\hat\eta_t)v_t+O(\alpha^2)
=
\hat\eta_t^\ast+\alpha\big(\eta^\ast-\hat\eta_t^\ast\big)+O(\alpha^2),
\]
so NGD and MGD agree to first order in \(\alpha\), but differ at finite step size. In this sense, MGD is globally adapted to the exponential-family geometry, whereas NGD is its local quadratic approximation.
\paragraph{Monitoring during training.}
As for MGD, the natural global quantity to monitor is the weighted empirical negative log-likelihood
\[
\widehat{\mathcal L}_n^w[\hat\eta_t]
=
\frac{1}{W}\sum_{i=1}^n
w_i\Big[A(\hat\eta_t(x_i))-\langle \hat\eta_t(x_i),T(y_i)\rangle-\log h(y_i)\Big].
\]
This directly measures predictive performance. One may again compare it to the exact partition-level entropy criterion
\[
\widehat{\mathcal H}_{A,n}^w(\mathcal P)
=
-\sum_{J\in\mathcal P}\frac{W_J}{W}A^\ast(\eta_J^\ast),
\qquad
\eta_J^\ast=\frac{\sum_{i\in J}w_i\,T(y_i)}{\sum_{i\in J}w_i},
\]
but for NGD boosting this entropy is an interpretive diagnostic, not the stagewise training objective.
\paragraph{Implementation summary for coding.}
Assume the following objects are available:
\[
T(y),\qquad
\nabla A(\cdot)\ \text{to map }\eta\mapsto \eta^\ast,\qquad
\nabla^2 A(\cdot)\ \text{to compute }g(\eta),\qquad
\{(x_i,y_i,w_i)\}_{i=1}^n,\qquad
\alpha,\qquad
\mathcal F_{\rm tree}.
\]
Then the NGD boosting loop is:
initialize a natural-coordinate model \(\hat\eta_0(x)\), compute
\[
\hat\eta_0^\ast(x_i)=\nabla A(\hat\eta_0(x_i)),
\qquad
g_i^{(0)}=\nabla^2 A(\hat\eta_0(x_i)),
\]
and for \(t=0,1,\dots\),
\[
U_i^{(t)}=\bigl(g_i^{(t)}\bigr)^{-1}\bigl(T(y_i)-\hat\eta_t^\ast(x_i)\bigr),
\]
fit a weighted regression tree to the targets \(U_i^{(t)}\),
\[
f_{t+1}
=
\arg\min_{f\in\mathcal F_{\rm tree}}
\sum_i w_i\,\|U_i^{(t)}-f(x_i)\|^2,
\]
with leaf values
\[
f_{t+1,J}=\frac{\sum_{i\in J}w_i\,U_i^{(t)}}{\sum_{i\in J}w_i},
\]
update the prediction in the \emph{natural} coordinate,
\[
\hat\eta_{t+1}(x)=\hat\eta_t(x)+\alpha f_{t+1}(x),
\]
then recompute
\[
\hat\eta_{t+1}^\ast(x_i)=\nabla A(\hat\eta_{t+1}(x_i)),
\qquad
g_i^{(t+1)}=\nabla^2 A(\hat\eta_{t+1}(x_i)).
\]
To choose splits in a node \(J\), use
\[
\mathrm{Gain}^{\rm NGD}_t(J\to L,R)
=
W_L\|\bar U_L^{(t)}\|^2+W_R\|\bar U_R^{(t)}\|^2-W_J\|\bar U_J^{(t)}\|^2.
\]
To monitor training globally, use
\[
\widehat{\mathcal L}_n^w[\hat\eta_t]
=
\frac{1}{W}\sum_i
w_i\Big[A(\hat\eta_t(x_i))-\langle \hat\eta_t(x_i),T(y_i)\rangle-\log h(y_i)\Big].
\]
\paragraph{Algorithmic simplification: recovering MGD from an NGD implementation.}
If an NGD implementation already exists, then the tree-fitting machinery can be reused almost verbatim for MGD. What changes is the coordinate bookkeeping. In NGD, one forms
\[
U_i^{(t)}=\bigl(g_i^{(t)}\bigr)^{-1}\bigl(T(y_i)-\hat\eta_t^\ast(x_i)\bigr),
\]
fits a tree in the natural coordinate, updates \(\hat\eta_t\), and recomputes \(\hat\eta_t^\ast=\nabla A(\hat\eta_t)\). To obtain MGD instead, one drops the Fisher-preconditioning factor \(g_i^{-1}\), works directly with
\[
R_i^{(t)}=T(y_i)-\hat\eta_t^\ast(x_i),
\]
fits the tree in the \emph{dual} coordinate, updates
\[
\hat\eta_{t+1}^\ast=\hat\eta_t^\ast+\alpha f_{t+1},
\]
and maps back only when needed through
\[
\hat\eta_{t+1}=\nabla A^\ast(\hat\eta_{t+1}^\ast).
\]
Equivalently: starting from an NGD code path, MGD is obtained by removing Fisher preconditioning, changing the fitted coordinate from \(\eta\) to \(\eta^\ast\), and replacing the nonlinear residual recomputation by the exact additive dual-residual update
\[
R_i^{(t+1)}=R_i^{(t)}-\alpha f_{t+1}(x_i).
\]
So MGD is not literally the same algorithm as NGD, but algorithmically it is the simpler special case once the exponential-family maps are available.
\subsection{Primal--dual pairs and boosted realizations}
The discussion above shows that, at the level of pointwise optimization in an unrestricted function space, natural-gradient descent and mirror-gradient descent come in two primal--dual pairs. Boosting changes this picture because the model class is no longer closed under the nonlinear Legendre map \(\eta^\ast=\nabla A(\eta)\). This yields four practically distinct boosted algorithms: two \emph{matched} constructions, where the additive tree expansion is written in the coordinate in which the corresponding update law is simple, and two \emph{crossed} constructions, where the update law is represented in the other coordinate.
\subsubsection{Two pointwise update laws and their primal--dual pairings}
Let \(F(\eta)\) be a smooth objective in the natural coordinate, and let
\[
\widetilde F(\eta^\ast):=F(\nabla A^\ast(\eta^\ast))
\]
be its reparametrization in the dual coordinate. Since
\[
g(\eta)=\nabla^2 A(\eta),
\qquad
g^\ast(\eta^\ast)=\nabla^2 A^\ast(\eta^\ast)=g(\eta)^{-1},
\]
the chain rule gives
\[
\nabla_{\eta^\ast}\widetilde F(\eta^\ast)=g(\eta)^{-1}\nabla_\eta F(\eta).
\]
There are then two distinct pointwise update laws.
\paragraph{The \(M\)-law.}
Mirror descent in the primal coordinate \(\eta\) with mirror potential \(A\) gives
\[
\eta^\ast_{+}
=
\eta^\ast-\alpha\,\nabla_\eta F(\eta).
\]
Equivalently, natural gradient descent in the dual coordinate \(\eta^\ast\) gives
\[
\eta^\ast_{+}
=
\eta^\ast-\alpha\,(g^\ast(\eta^\ast))^{-1}\nabla_{\eta^\ast}\widetilde F(\eta^\ast)
=
\eta^\ast-\alpha\,\nabla_\eta F(\eta).
\]
Thus
\[
\boxed{\text{primal MGD}=\text{dual NGD}.}
\]
This is the update law that is linear in the dual coordinate \(\eta^\ast\).
\paragraph{The \(N\)-law.}
Natural gradient descent in the primal coordinate \(\eta\) gives
\[
\eta_{+}
=
\eta-\alpha\,g(\eta)^{-1}\nabla_\eta F(\eta).
\]
Equivalently, mirror descent in the dual coordinate \(\eta^\ast\) with mirror potential \(A^\ast\) gives
\[
\eta_{+}
=
\eta-\alpha\,\nabla_{\eta^\ast}\widetilde F(\eta^\ast)
=
\eta-\alpha\,g(\eta)^{-1}\nabla_\eta F(\eta).
\]
Thus
\[
\boxed{\text{primal NGD}=\text{dual MGD}.}
\]
This is the update law that is linear in the natural coordinate \(\eta\).
\paragraph{Exponential-family negative log-likelihood.}
For
\[
F(\eta)=A(\eta)-\langle \eta,\bar T\rangle,
\qquad
\nabla_\eta F(\eta)=\eta^\ast-\bar T,
\]
the \(M\)-law becomes
\[
\eta^\ast_{+}=(1-\alpha)\eta^\ast+\alpha \bar T,
\]
while the \(N\)-law becomes
\[
\eta_{+}
=
\eta-\alpha\,g(\eta)^{-1}\bigl(\eta^\ast-\bar T\bigr).
\]
These are the two pointwise update laws that underlie all four boosted algorithms below.
\subsubsection{Matched and crossed boosted algorithms}
The earlier subsections introduced the two \emph{matched} boosted realizations:
MGD boosting for the \(M\)-law, written as an additive expansion in the dual coordinate \(\eta^\ast\), and NGD boosting for the \(N\)-law, written as an additive expansion in the natural coordinate \(\eta\).
The reason these are the natural constructions is that the corresponding update laws are simple in those coordinates:
\[
\text{\(M\)-law: linear in }\eta^\ast,
\qquad
\text{\(N\)-law: linear in }\eta.
\]
Boosting breaks the primal--dual equivalence because the model class is restricted. Indeed, an additive tree model in \(\eta\),
\[
\hat\eta_B(x)=\hat\eta_0(x)+\sum_{b=1}^B \alpha\,g_b(x),
\]
is in general not mapped by \(\nabla A\) to an additive tree model in \(\eta^\ast\), and conversely an additive tree model in \(\eta^\ast\),
\[
\hat\eta_B^\ast(x)=\hat\eta_0^\ast(x)+\sum_{b=1}^B \alpha\,f_b(x),
\]
is in general not mapped by \(\nabla A^\ast\) to an additive tree model in \(\eta\). Hence each pointwise update law admits two distinct boosted realizations.
\paragraph{Matched realization of the \(M\)-law: MGD boosting in \(\eta^\ast\).}
This is the construction developed above:
\[
\hat\eta^\ast_{t+1}(x)=\hat\eta^\ast_t(x)+\alpha f_{t+1}(x),
\]
with residual sufficient-statistic target
\[
R_t=T(Y)-\hat\eta_t^\ast(X),
\qquad
\mathbb E[R_t\mid X=x]=\eta^\ast(x)-\hat\eta_t^\ast(x).
\]
This realization is exact in the infinitely expressive limit and admits the exact in-place residual recursion
\[
R_{t+1}=R_t-\alpha f_{t+1}(X).
\]
\paragraph{Matched realization of the \(N\)-law: NGD boosting in \(\eta\).}
This is the second construction developed above:
\[
\hat\eta_{t+1}(x)=\hat\eta_t(x)+\alpha g_{t+1}(x),
\]
with Fisher-preconditioned pseudo-response
\[
U_t=g(\hat\eta_t(X))^{-1}\bigl(T(Y)-\hat\eta_t^\ast(X)\bigr),
\qquad
\mathbb E[U_t\mid X=x]
=
g(\hat\eta_t(x))^{-1}\bigl(\eta^\ast(x)-\hat\eta_t^\ast(x)\bigr).
\]
This realization is exact in the infinitely expressive limit for the \(N\)-law, but its residual transport is nonlinear because after the primal update one must recompute
\[
\hat\eta^\ast_{t+1}(x)=\nabla A(\hat\eta_{t+1}(x)).
\]
\paragraph{Crossed realization of the \(M\)-law: additive trees in \(\eta\).}
Here the ideal pointwise target is still the \(M\)-law
\[
\tilde\eta^\ast_{t+1}(x)
=
(1-\alpha)\hat\eta_t^\ast(x)+\alpha\,\eta^\ast(x),
\]
but the boosted model is represented additively in the natural coordinate:
\[
\hat\eta_{t+1}(x)=\hat\eta_t(x)+\alpha g_{t+1}(x).
\]
The corresponding ideal primal increment is therefore
\[
q_t^{(M,\eta)}(x)
:=
\frac{1}{\alpha}
\left[
\nabla A^\ast\!\Big((1-\alpha)\hat\eta_t^\ast(x)+\alpha\,\eta^\ast(x)\Big)
-
\hat\eta_t(x)
\right].
\]
If one could fit \(q_t^{(M,\eta)}(x)\) exactly, this would realize the \(M\)-law in a primal additive model. In general, however, there is no simple unbiased sample-level pseudo-response for \(q_t^{(M,\eta)}\), precisely because the map \(\nabla A^\ast\) is nonlinear. This is why the matched dual-coordinate realization is algorithmically much cleaner.
\paragraph{Crossed realization of the \(N\)-law: additive trees in \(\eta^\ast\).}
Here the ideal pointwise target is the \(N\)-law
\[
\tilde\eta_{t+1}(x)
=
\hat\eta_t(x)
+
\alpha\,g(\hat\eta_t(x))^{-1}\bigl(\eta^\ast(x)-\hat\eta_t^\ast(x)\bigr),
\]
but the boosted model is represented additively in the dual coordinate:
\[
\hat\eta^\ast_{t+1}(x)=\hat\eta_t^\ast(x)+\alpha f_{t+1}(x).
\]
The corresponding ideal dual increment is
\[
q_t^{(N,\eta^\ast)}(x)
:=
\frac{1}{\alpha}
\left[
\nabla A\!\Big(
\hat\eta_t(x)+\alpha\,g(\hat\eta_t(x))^{-1}\bigl(\eta^\ast(x)-\hat\eta_t^\ast(x)\bigr)
\Big)
-
\hat\eta_t^\ast(x)
\right].
\]
Again, if one could fit \(q_t^{(N,\eta^\ast)}(x)\) exactly, this would realize the \(N\)-law in a dual additive model. In general there is no simple unbiased pseudo-response because of the nonlinear map \(\nabla A\). This is the dual counterpart of the previous crossed construction.
\paragraph{Consequences.}
Thus the unrestricted pointwise duality collapses to four distinct boosted algorithms:
\begin{enumerate}[leftmargin=2em]
\item \(M\)-law in the dual coordinate \(\eta^\ast\) (matched; earlier MGD boosting),
\item \(N\)-law in the natural coordinate \(\eta\) (matched; earlier NGD boosting),
\item \(M\)-law in the natural coordinate \(\eta\) (crossed),
\item \(N\)-law in the dual coordinate \(\eta^\ast\) (crossed).
\end{enumerate}
The matched versions are the natural ones because they preserve the simple linear structure of the corresponding pointwise update laws. The crossed versions are meaningful, but they lose the simple unbiased pseudo-responses and exact residual recursions that made the matched algorithms attractive.
\subsubsection{Poisson example}
For the Poisson family,
\[
A(\eta)=e^\eta,
\qquad
\eta^\ast=\mu=e^\eta,
\qquad
\eta=\log\mu,
\qquad
g(\eta)=\mu.
\]
The two pointwise laws are
\[
\text{\(M\)-law: }\quad
\mu_{+}=(1-\alpha)\mu+\alpha \bar T,
\]
and
\[
\text{\(N\)-law: }\quad
\eta_{+}
=
\eta-\alpha\,\frac{\mu-\bar T}{\mu}.
\]
In Poisson language \(\bar T\) is just the target mean.
The four boosted realizations are then all distinct.
\paragraph{(i) \(M\)-law in \(\mu\): matched MGD boosting.}
The model is additive in \(\mu\):
\[
\hat\mu_{t+1}(x)=\hat\mu_t(x)+\alpha f_{t+1}(x),
\]