Experiments

We performed experiments on XGBDistribution across various datasets for probabilistic regression tasks. Comparison were made with both NGBoost’s NGBRegressor, as well as a standard xgboost XGBRegressor (point estimate only).

Probabilistic regression

For probabilistic regression, within errorbars, XGBDistribution performs essentially identically to NGBRegressor (measured on the negative log likelihood [NLL] of a normal distribution).

However, XGBDistribution is substantially faster, typically at least an order of magnitude. For example, for the MSD dataset, the fit and predict steps took 18 minutes for XGBDistribution vs a full 6.7 hours for NGBRegressor:

		XGBDistribution		NGBRegressor
Dataset	N	NLL	Time	NLL	Time
Boston	506	2.62(26)	0.067(1) (s)	2.55(24)	2.68(45) (s)
Concrete	1,030	3.14(21)	0.13(3) (s)	3.09(13)	5.79(59) (s)
Energy	768	0.58(41)	0.15(3) (s)	0.62(28)	5.33(35) (s)
Naval	11,934	-5.11(6)	5.8(8) (s)	-3.91(2)	43.6(5) (s)
Power	9,568	2.77(11)	1.21(52) (s)	2.77(7)	14.9(3.1) (s)
Protein	45,730	2.81(4)	12.2(4.0) (s)	2.91(1)	146.5(1.8) (s)
Wine	1,588	0.98(15)	0.11(3) (s)	0.93(7)	4.85(99) (s)
Yacht	308	0.89(1.1)	0.093(25) (s)	0.75(64)	4.95(50) (s)
MSD	515,345	3.450(4)	18.9(1.5) (m)	3.526(4)	6.70(7) (h)

Point estimation

For point estimates, we compared XGBDistribution to both the NGBRegressor and the XGBRegressor (measured on the RMSE). Generally, the XGBRegressor will offer the best performance for this task. However, compared with XGBRegressor, XGBDistribution only incurs small penalties on both performance and speed, thus making XGBDistribution a viable “drop-in” replacement to obtain probabilistic predictions.

	XGBDistribution		NGBRegressor		XGBRegressor
Dataset	RMSE	Time	RMSE	Time	RMSE	Time
Boston	3.41(69)	0.067(1) (s)	3.25(66)	2.68(45) (s)	3.27(65)	0.035(1) (s)
Concrete	5.41(74)	0.13(3) (s)	5.62(69)	5.79(59) (s)	4.38(70)	0.09(2) (s)
Energy	0.45(7)	0.15(3) (s)	0.49(7)	5.33(35) (s)	0.40(6)	0.05(2) (s)
Naval	0.0014(1)	5.8(8) (s)	0.0059(1)	43.6(5) (s)	0.00123(5)	1.93(7) (s)
Power	3.79(24)	1.21(52) (s)	3.93(19)	14.9(3.1) (s)	3.31(22)	0.59(19) (s)
Protein	4.35(12)	12.2(4.0) (s)	4.78(5)	146.5(1.8) (s)	4.09(7)	8.26(1.4) (s)
Wine	0.63(4)	0.11(3) (s)	0.62(3)	4.85(99) (s)	0.62(3)	0.035(13) (s)
Yacht	0.76(29)	0.093(25) (s)	0.74(28)	4.95(50) (s)	0.74(37)	0.047(35) (s)
MSD	9.03(4)	18.9(1.5) (m)	9.47(4)	6.70(7) (h)	8.97(3)	16.3(1.7) (m)

Methodology

We used 10-fold cross-validation. In each training fold, 10% of the data were used as a validation set for early stopping. This process was repeated over 5 random seeds. For the MSD dataset, we used a single 5-fold cross-validation.

The negative log-likelihood (NLL) and root mean squared error (RMSE) were estimated for each test fold, the above are the mean and standard deviation of these metrics (across folds and random seeds).

For all estimators, we used default hyperparameters, with the exception of setting max_depth=3 in XGBDistribution and XGBRegressor, since this is the default value of NGBRegressor. For all experiments, XGBDistribution and NGBRegressor estimated normal distributions, with natural gradients.

Please see the experiments script for the full details.