xgboost_distribution.XGBDistribution
- class xgboost_distribution.XGBDistribution(*, distribution: str = None, natural_gradient: bool = True, objective: str = None, **kwargs: Any)[source]
Implementation of XGBoost to estimate distributions (in scikit-learn API). See /python/sklearn_estimator for more information.
- Parameters:
distribution ({'poisson', 'log-normal', 'laplace', 'normal', 'exponential', 'negative-binomial'}, default='normal') –
Which distribution to estimate. Available choices:
”exponential” - parameters: (‘scale’,)
”laplace” - parameters: (‘loc’, ‘scale’)
”log-normal” - parameters: (‘scale’, ‘s’)
”negative-binomial” - parameters: (‘n’, ‘p’)
”normal” - parameters: (‘loc’, ‘scale’)
”poisson” - parameters: (‘mu’,)
Please see scipy.stats for a full description of the parameters of each distribution: https://docs.scipy.org/doc/scipy/reference/stats.html
Note that distributions are fit using Maximum Likelihood Estimation, which internally corresponds to minimising the negative log likelihood (NLL).
natural_gradient (bool, default=True) – Whether or not natural gradients should be used.
n_estimators (Optional[int]) – Number of gradient boosted trees. Equivalent to number of boosting rounds.
max_depth (Optional[int]) – Maximum tree depth for base learners.
max_leaves – Maximum number of leaves; 0 indicates no limit.
max_bin – If using histogram-based algorithm, maximum number of bins per feature
grow_policy – Tree growing policy. 0: favor splitting at nodes closest to the node, i.e. grow depth-wise. 1: favor splitting at nodes with highest loss change.
learning_rate (Optional[float]) – Boosting learning rate (xgb’s “eta”)
verbosity (Optional[int]) – The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
objective (Union[str, Callable[[numpy.ndarray, numpy.ndarray], Tuple[numpy.ndarray, numpy.ndarray]], NoneType]) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
booster (Optional[str]) – Specify which booster to use: gbtree, gblinear or dart.
tree_method (Optional[str]) – Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from the parameters document tree method
n_jobs (Optional[int]) – Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.
gamma (Optional[float]) – (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight (Optional[float]) – Minimum sum of instance weight(hessian) needed in a child.
max_delta_step (Optional[float]) – Maximum delta step we allow each tree’s weight estimation to be.
subsample (Optional[float]) – Subsample ratio of the training instance.
sampling_method –
- Sampling method. Used only by the GPU version of
hist
tree method. uniform
: select random training instances uniformly.gradient_based
select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)
- Sampling method. Used only by the GPU version of
colsample_bytree (Optional[float]) – Subsample ratio of columns when constructing each tree.
colsample_bylevel (Optional[float]) – Subsample ratio of columns for each level.
colsample_bynode (Optional[float]) – Subsample ratio of columns for each split.
reg_alpha (Optional[float]) – L1 regularization term on weights (xgb’s alpha).
reg_lambda (Optional[float]) – L2 regularization term on weights (xgb’s lambda).
scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.
base_score (Optional[float]) – The initial prediction score of all instances, global bias.
random_state (Optional[Union[numpy.random.RandomState, int]]) –
Random number seed.
Note
Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.
missing (float, default np.nan) – Value in the data which needs to be present as a missing value.
num_parallel_tree (Optional[int]) – Used for boosting random forest.
monotone_constraints (Optional[Union[Dict[str, int], str]]) – Constraint of variable monotonicity. See tutorial for more information.
interaction_constraints (Optional[Union[str, List[Tuple[str]]]]) – Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g.
[[0, 1], [2, 3, 4]]
, where each inner list is a group of indices of features that are allowed to interact with each other. See tutorial for more informationimportance_type (Optional[str]) –
The feature importance type for the feature_importances_ property:
For tree model, it’s either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.
For linear model, only “weight” is defined and it’s the normalized coefficients without bias.
device (Optional[str]) –
New in version 2.0.0.
Device ordinal, available options are cpu, cuda, and gpu.
validate_parameters (Optional[bool]) – Give warnings for unknown parameter.
enable_categorical (bool) –
New in version 1.5.0.
Note
This parameter is experimental
Experimental support for categorical data. When enabled, cudf/pandas.DataFrame should be used to specify categorical data type. Also, JSON/UBJSON serialization format is required.
feature_types (Optional[FeatureTypes]) –
New in version 1.7.0.
Used for specifying feature types without constructing a dataframe. See
DMatrix
for details.max_cat_to_onehot (Optional[int]) –
New in version 1.6.0.
Note
This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
max_cat_threshold (Optional[int]) –
New in version 1.7.0.
Note
This parameter is experimental
Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, enable_categorical needs to be set to have categorical feature support. See Categorical Data and Parameters for Categorical Feature for details.
multi_strategy (Optional[str]) –
New in version 2.0.0.
Note
This parameter is working-in-progress.
The strategy used for training multi-target models, including multi-target regression and multi-class classification. See /tutorials/multioutput for more information.
one_output_per_tree
: One model for each target.multi_output_tree
: Use multi-target trees.
eval_metric (Optional[Union[str, List[str], Callable]]) –
New in version 1.6.0.
Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See doc/parameter.rst), one of the metrics in
sklearn.metrics
, or any other user defined metric that looks like sklearn.metrics.If custom objective is also provided, then custom metric should implement the corresponding reverse link function.
Unlike the scoring parameter commonly used in scikit-learn, when a callable object is provided, it’s assumed to be a cost function and by default XGBoost will minimize the result during early stopping.
For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see
xgboost.callback.EarlyStopping
.See Custom Objective and Evaluation Metric for more.
Note
This parameter replaces eval_metric in
fit()
method. The old one receives un-transformed prediction regardless of whether custom objective is being used.from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])
early_stopping_rounds (Optional[int]) –
New in version 1.6.0.
Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in
fit()
.If early stopping occurs, the model will have two additional attributes:
best_score
andbest_iteration
. These are used by thepredict()
andapply()
methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the iteration_range in these inference methods. In addition, other utilities like model plotting can also use the entire model.If you prefer to discard the trees after best_iteration, consider using the callback function
xgboost.callback.EarlyStopping
.If there’s more than one item in eval_set, the last entry will be used for early stopping. If there’s more than one metric in eval_metric, the last metric will be used for early stopping.
Note
This parameter replaces early_stopping_rounds in
fit()
method.callbacks (Optional[List[TrainingCallback]]) –
List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.
Note
States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy.
for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] reg = xgboost.XGBRegressor(**params, callbacks=callbacks) reg.fit(X, y)
kwargs (dict, optional) –
Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here. Attempting to set a parameter via the constructor args and **kwargs dict simultaneously will result in a TypeError.
Note
**kwargs unsupported by scikit-learn
**kwargs is unsupported by scikit-learn. We do not guarantee that parameters passed via this argument will interact properly with scikit-learn.
- __init__(*, distribution: str = None, natural_gradient: bool = True, objective: str = None, **kwargs: Any) None [source]
Methods
__init__
(*[, distribution, ...])apply
(X[, iteration_range])Return the predicted leaf every tree for each sample.
evals_result
()Return the evaluation results.
fit
(X, y, *[, sample_weight, eval_set, ...])Fit gradient boosting distribution model.
get_booster
()Get the underlying xgboost Booster of this model.
get_metadata_routing
()Get metadata routing of this object.
get_num_boosting_rounds
()Gets the number of xgboost boosting rounds.
get_params
([deep])Get parameters.
get_xgb_params
()Get xgboost specific parameters.
load_model
(fname)Load the model from a file or bytearray.
predict
(X[, validate_features, iteration_range])Predict all params of distribution of each X example.
save_model
(fname)Save the model to a file.
score
(X, y[, sample_weight])Return the coefficient of determination of the prediction.
set_fit_request
(*[, callbacks, ...])Request metadata passed to the
fit
method.set_params
(**params)Set the parameters of this estimator.
set_predict_request
(*[, iteration_range, ...])Request metadata passed to the
predict
method.set_score_request
(*[, sample_weight])Request metadata passed to the
score
method.Attributes
best_iteration
The best iteration obtained by early stopping.
best_score
The best score obtained by early stopping.
coef_
Coefficients property
feature_importances_
Feature importances property, return depends on importance_type parameter.
feature_names_in_
Names of features seen during
fit()
.intercept_
Intercept (bias) property
n_features_in_
Number of features seen during
fit()
.- fit(X: Any, y: Any, *, sample_weight: Any | None = None, eval_set: Sequence[Tuple[Any, Any]] | None = None, early_stopping_rounds: int | None = None, verbose: bool | int | None = True, xgb_model: Booster | str | XGBModel | None = None, sample_weight_eval_set: Sequence[Any] | None = None, feature_weights: Any | None = None, callbacks: Sequence[TrainingCallback] | None = None) XGBDistribution [source]
Fit gradient boosting distribution model.
Note that calling
fit()
multiple times will cause the model object to be re-fit from scratch. To resume training from a previous checkpoint, explicitly passxgb_model
argument.- Parameters:
X –
Feature matrix. See Supported data structures for various XGBoost functions for a list of supported types.
When the
tree_method
is set tohist
, internally, theQuantileDMatrix
will be used instead of theDMatrix
for conserving memory. However, this has performance implications when the device of input data is not matched with algorithm. For instance, if the input is a numpy array on CPU butcuda
is used for training, then the data is first processed on CPU then transferred to GPU.y – Labels
sample_weight – instance weights
eval_set – A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed. Validation metrics will help us track the performance of the model.
early_stopping_rounds (int) –
Deprecated since version 1.6.0.
Use early_stopping_rounds in
__init__()
orset_params()
instead.verbose – If verbose is True and an evaluation set is used, the evaluation metric measured on the validation set is printed to stdout at each boosting stage. If verbose is an integer, the evaluation metric is printed at each verbose boosting stage. The last boosting stage / the boosting stage found by using early_stopping_rounds is also printed.
xgb_model – file name of stored XGBoost model or ‘Booster’ instance XGBoost model to be loaded before training (allows training continuation).
sample_weight_eval_set – A list of the form [L_1, L_2, …, L_n], where each L_i is an array like object storing instance weights for the i-th validation set.
feature_weights – Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a ValueError is thrown.
callbacks –
Deprecated since version 1.6.0: Use callbacks in
__init__()
orset_params()
instead.
- load_model(fname: str | bytearray | PathLike) None [source]
Load the model from a file or bytearray. Path to file can be local or as an URI.
The model is loaded from XGBoost format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be loaded when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.load_model("model.json") # or model.load_model("model.ubj")
- Parameters:
fname – Input file name or memory buffer(see also save_raw)
- predict(X: Any, validate_features: bool = True, iteration_range: Tuple[int, int] | None = None) Tuple[ndarray] [source]
Predict all params of distribution of each X example.
- Parameters:
X (ArrayLike) – Feature matrix.
validate_features (bool) – When this is True, validate that the Booster’s and data’s feature_names are identical. Otherwise, it is assumed that the feature_names are the same.
iteration_range – Specifies which layer of trees are used in prediction. For example, if a random forest is trained with 100 rounds. Specifying iteration_range=(10, 20), then only the forests built during [10, 20) (half open set) rounds are used in this prediction.
- Returns:
predictions – A namedtuple of the distribution parameters. Each parameter is a numpy array of shape (n_samples,).
- Return type:
namedtuple
- save_model(fname: str | PathLike) None [source]
Save the model to a file.
The model is saved in an XGBoost internal format which is universal among the various XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as feature_names) will not be saved when using binary format. To save those attributes, use JSON/UBJ instead. See Model IO for more info.
model.save_model("model.json") # or model.save_model("model.ubj")
- Parameters:
fname – Output file name
- set_fit_request(*, callbacks: bool | None | str = '$UNCHANGED$', early_stopping_rounds: bool | None | str = '$UNCHANGED$', eval_set: bool | None | str = '$UNCHANGED$', feature_weights: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$', sample_weight_eval_set: bool | None | str = '$UNCHANGED$', verbose: bool | None | str = '$UNCHANGED$', xgb_model: bool | None | str = '$UNCHANGED$') XGBDistribution
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
callbacks (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
callbacks
parameter infit
.early_stopping_rounds (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
early_stopping_rounds
parameter infit
.eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
eval_set
parameter infit
.feature_weights (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
feature_weights
parameter infit
.sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weight
parameter infit
.sample_weight_eval_set (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weight_eval_set
parameter infit
.verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
verbose
parameter infit
.xgb_model (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
xgb_model
parameter infit
.
- Returns:
self – The updated object.
- Return type:
- set_predict_request(*, iteration_range: bool | None | str = '$UNCHANGED$', validate_features: bool | None | str = '$UNCHANGED$') XGBDistribution
Request metadata passed to the
predict
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
iteration_range (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
iteration_range
parameter inpredict
.validate_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
validate_features
parameter inpredict
.
- Returns:
self – The updated object.
- Return type:
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') XGBDistribution
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.