xgboost_distribution.distributions package
Submodules
xgboost_distribution.distributions.base module
Distribution base class
- class xgboost_distribution.distributions.base.BaseDistribution[source]
Bases:
ABCBase class distribution for XGBDistribution.
Note that distributions are stateless, hence a distribution is just a collection of functions that operate on the data (y) and the outputs of the xgboost (params).
- abstract gradient_and_hessian(y, params, natural_gradient=True)[source]
Compute the gradient and hessian of the distribution
- abstract property params
The parameter names of the distribution
xgboost_distribution.distributions.exponential module
Exponential distribution
- class xgboost_distribution.distributions.exponential.Exponential[source]
Bases:
BaseDistributionExponential distribution with log score
Definition:
f(x) = 1 / scale * e^(-x / scale)
We reparameterize scale -> log(scale) = a to ensure scale >= 0. Gradient:
- d/da -log[f(x)] = d/da -log[1/e^a e^(-x / e^a)]
= 1 - x e^-a = 1 - x / scale
The Fisher information = 1 / scale^2, when reparameterized:
1 / scale^2 = I ( d/d(scale) log(scale) )^2 = I ( 1/ scale )^2
Hence we find: I = 1
- property params
The parameter names of the distribution
xgboost_distribution.distributions.laplace module
Laplace distribution
- class xgboost_distribution.distributions.laplace.Laplace[source]
Bases:
BaseDistributionLaplace distribution with log scoring
Definition:
f(x) = 1/2 * exp( -| (x - loc) / scale | ) / scale
- We reparameterize:
loc -> loc = a scale -> log(scale) = b to ensure scale >= 0.
Thus:
f(x) = 1/2 * exp( -| (x - a) / e^b | ) / e^b
We compute the gradients:
d/da -log[f(x)] = (a - x) / (scale * | a-x | ) d/db -log[f(x)] = 1 - | a-x | / scale
To second order:
d2/da2 -log[f(x)] = 2 δ(x-a) / scale d2/db2 -log[f(x)] = | a-x | / scale
The Fisher information is:
I(loc) = 1 / scale^2 I(scale) = 1 / scale^2
which needs to be expressed in reparameterized form:
- 1 / scale^2 = I_r(loc) (d/d(loc) loc) ^2
= I_r(loc)
- 1 / scale^2 = I_r(scale) ( d/d(scale) log(scale) )^2
= I_r(scale) ( 1/ scale )^2
Hence:
I_r(loc) = 1 / scale^2 I_r(scale) = 1
- property params
The parameter names of the distribution
xgboost_distribution.distributions.log_normal module
LogNormal distribution
- class xgboost_distribution.distributions.log_normal.LogNormal[source]
Bases:
BaseDistributionLogNormal distribution with log scoring.
Definition:
f(x) = exp( -[ (log(x) - log(scale)) / (2 s^2) ]^2 / 2 ) / s
with parameters (scale, s).
- We reparameterize:
s -> log(s) = a scale -> log(scale) = b
Note that b essentially becomes the ‘loc’ of the distribution:
log(x/scale) / s = ( log(x) - log(scale) ) / s
which can then be taken analogous to the normal distribution’s
(x - loc) / scale
Hence we can re-use the computations in distribution.normal, exchanging:
y -> log(y) scale -> s
- property params
The parameter names of the distribution
xgboost_distribution.distributions.negative_binomial module
Negative binomial distribution
- class xgboost_distribution.distributions.negative_binomial.NegativeBinomial[source]
Bases:
BaseDistributionNegative binomial distribution with log score
Definition:
f(k) = p^n (1 - p)^k binomial(n + k - 1, n - 1)
with parameter (n, p), where n >= 0 and 1 >= p >= 0
We reparameterize:
n -> log(n) = a | e^a = n p -> log(p/(1-p)) = b | e^b = p / (1-p) | p = 1 / (1 + e^-b)
The gradients are:
- d/da -log[f(k)] = -e^a [ digamma(k+e^a) - digamma(e^a) + log(p) ]
= -n [ digamma(k+n) - digamma(n) + log(p) ]
- d/db -log[f(k)] = (k e^b - e^a) / (e^b + 1)
= (k - e^a e^-b) / (e^-b + 1) = p * (k - e^a e^-b) = p * (k - n e^-b)
The Fisher Information:
I(n) ~ p / [ n (p+1) ] I(p) = n / [ p (1-p)^2 ]
- where we used an approximation for I(n) presented here:
http://erepository.uonbi.ac.ke:8080/xmlui/handle/123456789/33803
In reparameterized form, we find I_r(n) and I_r(p):
- p / [ n (p+1) ] = I_r(n) [ d/dn log(n) ]^2
= I_r(n) ( 1/n )^2
-> I_r(n) = np / (p+1)
- n / [ p (1-p)^2 ] = I_r(p) [ d/dp log(p/(1-p)) ]^2
= I_r(p) ( 1/ [ p (1-p) ] )^2
-> I_r(p) = [ p^2 (1-p)^2 n ] / [ p (1-p)^2 ] = np
Hence the reparameterized Fisher information:
[ np / (p+1), 0 ] [ 0, np ]
Ref:
- property params
The parameter names of the distribution
xgboost_distribution.distributions.normal module
Normal distribution
- class xgboost_distribution.distributions.normal.Normal[source]
Bases:
BaseDistributionNormal distribution with log scoring
Definition:
f(x) = exp( -[ (x-mean) / std ]^2 / 2 ) / std
We reparameterize:
a = mean | a = mean b = log ( std ) | e^b = std
(Note: reparameterizing to log(std) ensures that std >= 0, regardless of what the xgboost booster internally outputs, as std = e^b > 0.)
The gradients are:
d/da -log[f(x)] = e^(-2b) * (x-a) = (x-a) / var d/db -log[f(x)] = 1 - e^(-2b) * (x-a)^2 = 1 - (x-a)^2 / var
as var = std^2 = e^(2b)
The Fisher Information (diagonal):
I(mean) = 1 / var I(std) = 2 / var
In reparameterized form, we find I_r:
1 / var = I_r [ d/d(mean) mean ]^2 = I 2 / var = I_r [ d/d(std) log(std) ]^2 = I ( 1/(std) )^2
Hence the reparameterized Fisher information:
[ 1 / var, 0 ] [ 0, 2 ]
Ref:
- property params
The parameter names of the distribution
xgboost_distribution.distributions.poisson module
Poisson distribution
- class xgboost_distribution.distributions.poisson.Params(mu)
Bases:
tupleCreate new instance of Params(mu,)
- mu
Alias for field number 0
- class xgboost_distribution.distributions.poisson.Poisson[source]
Bases:
BaseDistributionPoisson distribution with log score
Definition:
f(k) = e^(-mu) mu^k / k!
We reparameterize mu -> log(mu) = a to ensure mu >= 0. Gradient:
d/da -log[f(k)] = e^a - k = mu - k
The Fisher information = 1 / mu, which needs to be expressed in the reparameterized form:
1 / mu = I ( d/dmu log(mu) )^2 = I ( 1/ mu )^2
Hence we find: I = mu
- property params
The parameter names of the distribution
xgboost_distribution.distributions.utils module
Utility functions for distributions
- xgboost_distribution.distributions.utils.safe_exp(x)[source]
Like np.exp, but clipped to prevent overflow (in float32 world)
- Ensures that
large numbers cannot hit infinity
small numbers cannot hit precisely zero
NB: The limits are chosen such that we have some stability in subsequent computations. E.g the minimum returned value should be safe in a division with a numerator of size up to ~1e6.