xgboost-distribution

XGBoost for probabilistic prediction. Like NGBoost, but faster, and in the XGBoost scikit-learn API.

Installation

$ pip install xgboost-distribution

Dependencies:

python_requires = >=3.8

install_requires =
    scikit-learn
    xgboost>=2.0.0

Usage

XGBDistribution follows the XGBoost scikit-learn API, with an additional keyword argument specifying the distribution, which is fit via Maximum Likelihood Estimation:

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

from xgboost_distribution import XGBDistribution


data = fetch_california_housing()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y)

model = XGBDistribution(
    distribution="normal",
    n_estimators=500,
    early_stopping_rounds=10
)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)])

See the documentation for all available distributions.

After fitting, we can predict the parameters of the distribution:

preds = model.predict(X_test)
mean, std = preds.loc, preds.scale

Note that this returned a namedtuple of numpy arrays for each parameter of the distribution (we use the scipy stats naming conventions for the parameters, see e.g. scipy.stats.norm for the normal distribution).

NGBoost performance comparison

XGBDistribution follows the method shown in the NGBoost library, using natural gradients to estimate the parameters of the distribution.

Below, we show a performance comparison of XGBDistribution and the NGBoost NGBRegressor, using the California Housing dataset, estimating normal distributions. While the performance of the two models is fairly similar (measured on negative log-likelihood of a normal distribution and the RMSE), XGBDistribution is around 15x faster (timed on both fit and predict steps):

Please see the experiments page for results across various datasets.

Full XGBoost features

XGBDistribution offers the full set of XGBoost features available in the XGBoost scikit-learn API, allowing, for example, probabilistic regression with monotonic constraints:

Acknowledgements

This package would not exist without the excellent work from:

NGBoost - Which demonstrated how gradient boosting with natural gradients can be used to estimate parameters of distributions. Much of the gradient calculations code were adapted from there.
XGBoost - Which provides the gradient boosting algorithms used here, in particular the sklearn APIs were taken as a blue-print.

Note

This project has been set up using PyScaffold 4.0.1. For details and usage information on PyScaffold see https://pyscaffold.org/.