sklearn.ensemble.HistGradientBoostingRegressor?

class sklearn.ensemble.HistGradientBoostingRegressor(loss='least_squares', *, learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=255, monotonic_cst=None, warm_start=False, early_stopping='auto', scoring='loss', validation_fraction=0.1, n_iter_no_change=10, tol=1e-07, verbose=0, random_state=None)

[源碼]

基于直方圖的梯度提升回歸樹。

對于大數據集(n_samples >= 10,000)，該估計器比梯度提升回歸器GradientBoostingRegressor快得多。

這個估計器對缺失值(nan)有本地支持。在訓練過程中，樹種植根據潛在的增益，在每個分割點學習缺失值的樣本是應該去左子節點還是去右子節點。在進行預測時，缺省值的樣本將被分配到左子節點或右子節點。如果在訓練過程中沒有遇到給定特征的缺失值，那么缺失值的樣本將被映射到擁有最多樣本的子特征。

這個實現是受LightGBM的啟發。

注意，這個估計器目前還處于測試階段: 預測和API可能會在沒有任何棄用周期的情況下發生變化。要使用它，您需要顯式導入enable_hist_gradient_boosting:

>>> # explicitly require this experimental feature
>>> from sklearn.experimental import enable_hist_gradient_boosting  # noqa
>>> # now you can import normally from ensemble
>>> from sklearn.ensemble import HistGradientBoostingClassifier

請參閱用戶指南獲取更多信息。

0.21版本新功能。

參數	說明
loss	{‘least_squares’, ‘least_absolute_deviation’, ‘poisson’}, optional (default=’least_squares’) 在增壓過程中使用的損失函數。請注意，“最小二乘”和“泊松”損失實際上實現了“一半最小二乘損失”和“一半泊松偏差”來簡化梯度的計算。此外，“poisson”損失內部使用一個`log-link`，并要求`y >= 0`
learning_rate	float, optional (default=0.1) 學習率，也稱為縮水率。這被用作葉值的一個乘法因子。使用`1`表示不縮水。
max_iter	int, optional (default=100) 提升過程的最大迭代次數，即樹的最大數量。
max_leaf_nodes	int or None, optional (default=31) 每棵樹的最大葉節點數。必須嚴格大于1。如果沒有，就沒有最大限制。
max_depth	int or None, optional (default=None) 每棵樹的最大深度。樹的深度是指從根到最深葉子的邊數。默認情況下深度沒有限制。
min_samples_leaf	int, optional (default=20) 每個葉子的最小樣本數。對于少于幾百個樣本的小數據集，建議降低這個值，因為只會建立非常淺的樹。
l2_regularization	float, optional (default=0) L2正則化參數。使用0表示不正則化(默認)。
max_bins	int, optional (default=255) 用于非缺失值的最大桶數。在訓練之前，輸入數組X的每個特征都被放入整數值的箱子中，這使得訓練的速度更快。具有少量惟一值的特性可能使用小于`max_bins`。除了`max_bins`外，還會為缺少的值保留一個容器。不能大于255。
monotonic_cst	array-like of int of shape (n_features), default=None 表示要對每個特征執行的單調約束。-1、1、0分別為正約束、負約束和無約束。請參閱用戶指南獲取更多信息。
warm_start	bool, optional (default=False) 當設置為True時，重用前面調用的解決方案，以適應并向集成添加更多的評估器。為了使結果有效，估計器應該只在相同的數據上重新訓練。詳見術語表。
early_stopping	‘auto’ or bool (default=’auto’) 如果使用“auto”，則在樣本大小大于10000時啟用早期停止。如果為True，則啟用早期停止，否則禁用早期停止。
scoring	str or callable or None, optional (default=’loss’) 用于早停的計分參數。它可以是單個字符串(參見The scoring parameter: defining model evaluation rules)，也可以是可調用的(參見Defining your scoring strategy from metric functions)。如果沒有，則使用估計器的默認得分器。如果計分=“損失”，則根據損失值檢查提前停止。僅在提前停止時使用。
validation_fraction	int or float or None, optional (default=0.1) 訓練數據的比例(或絕對大小)，預留為驗證數據，以便早期停止。如果沒有，則對訓練數據進行早期停止。僅在提前停止時使用。
n_iter_no_change	int, optional (default=10) 用來決定什么時候“早停止”。當最后的`n_iter_no_change`得分在一定程度上都沒有優于`n_iter_no_change - 1`的時候，擬合過程就會停止。僅在提前停止時使用。
tol	float or None, optional (default=1e-7) 在比較早期停止期間的分數時使用的絕對容忍度。容忍度越高，我們越有可能提前停止:容忍度越高，意味著后續迭代將更難被認為是參考分數的改進。
verbose	int, optional (default=0) 冗長的水平。如果不是零，打印一些關于擬合過程的信息。
random_state	int, np.random.RandomStateInstance or None, optional (default=None) 偽隨機數生成器，用于控制封裝過程中的子采樣，以及在啟用早期停止時，列車/驗證數據分離。在多個函數調用之間傳遞可重復輸出的int。詳見術語表。

屬性	參數
n_iter_	int 早期停止所選擇的迭代次數，取決于`early_stop`參數。否則它對應`max_iter`。
n_trees_per_iteration_	int 在每次迭代中構建的樹的數量。對于回歸項，這總是1。
train_score_	ndarray, shape (n_iter_+1,) 訓練數據每次迭代時的得分。第一個條目是在第一次迭代之前集合的分數。根據評分參數計算分數。如果評分不是“損失”，則對最多10,000個樣本的子集計算得分。空如果沒有早期停止。
validation_score_	ndarray, shape (n_iter_+1,) 在每一次迭代中顯示的驗證數據的分數。第一個條目是在第一次迭代之前集合的分數。根據評分參數計算分數。如果沒有早期停止，則為空;如果v`alidation_fraction`為`None`。

>>> # To use this experimental feature, we need to explicitly ask for it:
>>> from sklearn.experimental import enable_hist_gradient_boosting  # noqa
>>> from sklearn.ensemble import HistGradientBoostingRegressor
>>> from sklearn.datasets import load_diabetes
>>> X, y = load_diabetes(return_X_y=True)
>>> est = HistGradientBoostingRegressor().fit(X, y)
>>> est.score(X, y)
0.92...

方法

方法	參數
`fit`(X, y[, sample_weight])	擬合梯度助推模型。
`get_params`([deep])	獲取這個估計器的參數。
`predict`(X)	預測X的值。
`score`(X, y[, sample_weight])	返回預測的決定系數R^2。
`set_params`(params)**	設置這個估計器的參數。

__init__(loss='least_squares', *, learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_bins=255, monotonic_cst=None, warm_start=False, early_stopping='auto', scoring='loss', validation_fraction=0.1, n_iter_no_change=10, tol=1e-07, verbose=0, random_state=None)

[源碼]

初始化self。使用help(type(self)) 獲取準確的說明。

fit(X, y, sample_weight=None)

[源碼]

擬合梯度提升模型。

參數	說明
X	array-like of shape (n_samples, n_features) 輸入樣本。
y	array-like of shape (n_samples,) 目標值
sample_weight	array-like of shape (n_samples,) default=None 權重和訓練數據

返回值	說明
self	object

get_params(deep=True)

[源碼]

獲取這個估計器的參數。

參數	說明
deep	bool, default=True 如果為真，將返回此估計器的參數以及包含的作為估計器的子對象。

返回值	說明
params	mapping of string to any 參數名稱映射到它們的值。

predict(X)

[源碼]

預測X的值。

參數	說明
X	array-like, shape (n_samples, n_features) 輸入樣本

返回值	說明
y	ndarray, shape (n_samples,) 預測值

score(X, y, sample_weight=None)

返回預測的決定系數R^2。

定義系數R^2為(1 - u/v)，其中u為(y_true - y_pred) ** 2).sum()的殘差平方和，v為(y_true - y_true.mean()) ** 2).sum()的平方和。最好的可能的分數是1.0，它可能是負的(因為模型可以任意地變糟)。一個常數模型總是預測y的期望值，不考慮輸入特征，得到的R^2得分為0.0。

參數	說明
X	array-like of shape (n_samples, n_features) 測試樣品。對于某些估計器，這可能是一個預先計算的內核矩陣或一列通用對象，而不是形狀= (n_samples, n_samples_fitting)，其中n_samples_fitting是用于擬合估計器的樣本數量。
y	array-like of shape (n_samples,) or (n_samples, n_outputs) X的值
sample_weight	array-like of shape (n_samples,), default=None 樣本權重

返回值	說明
score	float R^2 of self.predict(X) wrt. y.

注意：

調用回歸變量上的score時使用的R2 score來自0.23版本的multioutput='uniform_average'來保持與r2_score的默認值一致。這影響了所有多輸出回歸的評分方法(除了 MultiOutputRegressor)。

set_params(**params)

[源碼]

設置估計器參數

該方法適用于簡單估計量和嵌套對象(如pipline)。后者具有形式為<component>_<parameter>的參數，這樣就讓更新嵌套對象的每個組件成為了可能。

參數	說明
**params	dict 估計器參數

返回值	說明
self	object 估計器實例

sklearn.ensemble.HistGradientBoostingRegressor使用示例?

單調約束 ?

梯度提升回歸 ?

部分依賴圖 ?

scikit-learn 0.23中的發布要點 ?

泊松回歸與非正常損失 ?

使用stacking的組合預測器 ?