CMGD-Tree is a small playground for probabilistic tree boosting with:
- streamed toy data
- histogram-based trees
- CPU or GPU training
- CPU or GPU prediction
- family-side MGD and NGD updates
If you want one mental model for the codebase, use this:
- choose a
family - choose the input and output dimensions
- choose the tree size and number of boosting rounds
- run the demo
The main entry point is:
python fit_single_tree_hist_demo.pyWriteup:
Start with the default multi-dimensional Gaussian example:
python fit_single_tree_hist_demo.pyThat runs the normal_identity family with:
n_features = 32n_classes = 4- shallow trees
- 2 boosting rounds
If you want to see the fitted trees:
python fit_single_tree_hist_demo.py --print-treesIf you want timing output:
python fit_single_tree_hist_demo.py --profileIf you want plots:
python fit_single_tree_hist_demo.py --plotA more realistic first run is a slightly larger multi-dimensional Gaussian:
python fit_single_tree_hist_demo.py \
--print-trees \
--modify \
family normal_identity \
n_features 8 \
n_classes 4 \
max_depth 3 \
max_leaves 8 \
n_boost_rounds 20 \
learning_rate 0.2Read that command like this:
family normal_identitychooses the probabilistic modeln_features 8sets the input dimensionn_classes 4sets the output dimensionmax_depth 3andmax_leaves 8make the trees more expressiven_boost_rounds 20fits a larger ensemblelearning_rate 0.2makes boosting more conservative
That is usually the easiest place to start changing things.
There are three groups of settings.
This answers: what distribution or statistical problem am I fitting?
Current families:
normal_identitypoissonpoisson_ngdgammanegative_binomialheteroskedastic_normalheteroskedastic_normal_ngd
Typical first changes:
python fit_single_tree_hist_demo.py --modify family poisson
python fit_single_tree_hist_demo.py --modify family gamma
python fit_single_tree_hist_demo.py --modify family negative_binomialThis answers: how many inputs, how many outputs, and how much data?
Use:
--modify \
n_features 8 \
n_classes 4 \
batch_size 65536 \
n_batches 12Meaning:
n_featuresinput dimensionn_classesoutput dimension of the fitted target statisticsbatch_sizeevents per streamed batchn_batchesnumber of streamed batches
Total training events are:
batch_size * n_batches
Example:
python fit_single_tree_hist_demo.py \
--modify n_features 16 n_classes 4 batch_size 32768 n_batches 24This answers: how large should the trees be, and how aggressively should boosting update?
Use:
--modify \
max_depth 4 \
max_leaves 16 \
max_bin 64 \
n_boost_rounds 50 \
learning_rate 0.1Meaning:
max_depthmaximum tree depthmax_leavesmaximum number of leavesmax_binhistogram resolution for split searchn_boost_roundsnumber of boosting iterationslearning_rateshrinkage per tree
Example:
python fit_single_tree_hist_demo.py \
--modify max_depth 4 max_leaves 16 n_boost_rounds 50 learning_rate 0.1Run a Gaussian example with more dimensions:
python fit_single_tree_hist_demo.py \
--modify family normal_identity n_features 16 n_classes 8Switch to Poisson:
python fit_single_tree_hist_demo.py \
--modify family poisson n_features 8 n_classes 4Run the NGD Poisson example:
python fit_single_tree_hist_demo.py \
--modify family poisson_ngd n_features 8 n_classes 4Run everything on CPU:
python fit_single_tree_hist_demo.py \
--modify training_backend cpu predict_method cpu cpu_predictor numba_parallelRun GPU training with GPU prediction:
python fit_single_tree_hist_demo.py \
--modify training_backend gpu predict_method gpuRun GPU training but keep prediction on CPU:
python fit_single_tree_hist_demo.py \
--modify training_backend gpu predict_method cpu cpu_predictor numba_parallelTop-level options:
--config path-or-nameload a complete example YAML, e.g.--config poisson--modify key value ...override config values--profileprint timing and memory summaries--plotwrite plots under./plots/<training_id>/--print-treesprint the fitted trees--full-outputcompatibility alias for--plot --print-trees
Example:
python fit_single_tree_hist_demo.py \
--plot \
--print-trees \
--modify family normal_identity n_features 8 n_classes 4 max_depth 3The runnable examples now live in YAML:
configs/default.yamlinternal fallback defaultsconfigs/examples/*.yamlcomplete user-facing example configs
Each example YAML has four top-level groups:
treedatasettrainingplot
Most important keys:
family: statistical modeln_features: input dimensionn_classes: output dimensionbatch_size,n_batches: training sizemax_depth,max_leaves: tree expressivityn_boost_rounds,learning_rate: boosting strengthtraining_backend:auto,gpu, orcpupredict_method:gpuorcpucpu_predictor:index,leaf_mask,numba, ornumba_parallel
Two keys that are useful but easy to misunderstand:
cut_sample_rowsonly controls how many rows are used to estimate feature cutsmax_bincontrols how fine the histogram split search is
Each example YAML is explicit and self-contained.
For example:
configs/examples/heteroskedastic_normal.yamluses the 2D heteroskedastic toy stream and scalar diagnostic plot modeconfigs/examples/gamma.yamlandconfigs/examples/negative_binomial.yamluse their matching toy streams and longer boosted runs
Prediction is an important runtime choice.
Use predict_method=gpu when:
- you are already training on GPU
- your batches are large
- you want cache updates to stay on device
Use predict_method=cpu when:
- you want easier inspection
- you want to compare CPU predictors
- you are running a CPU-only setup
The default CPU predictor is:
numba_parallel
The trainer always fits trees to a family-supplied pseudo-response.
For MGD, that pseudo-response is the plain residual-style target:
T(y) - eta*(x)
For NGD, the family can precondition that target with the Fisher information:
G(x)^{-1} (T(y) - eta*(x))
So the tree code stays the same, while the family changes the geometry.
If you extend the project:
- add a new statistical model in
families/ - add a new toy generator or real loader in
data_providers/ - add a new runnable setup in
configs/examples/
That is the intended workflow for new users as well:
- start from an existing example
- change the family
- change the dimensions
- change the tree and training settings
- run again