sklearn.tree.DecisionTreeClassifier?

class sklearn.tree.DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort='deprecated', ccp_alpha=0.0)

[源碼]

一個構造決策樹的類。

想了解更多請看用戶指南.

參數	說明
criterion	{“gini”, “entropy”}, default=”gini” 這個參數是用來選擇使用何種方法度量樹的切分質量的。當criterion取值為“gini”時采用基尼不純度（Gini impurity）算法構造決策樹，當criterion取值為 “entropy” 時采用信息增益（ information gain）算法構造決策樹.
splitter	{“best”, “random”}, default=”best” 此參數決定了在每個節點上拆分策略的選擇。支持的策略是“best” 選擇“最佳拆分策略”， “random” 選擇“最佳隨機拆分策略”。
max_depth	int, default=None 樹的最大深度。如果取值為None,則將所有節點展開，直到所有的葉子都是純凈的或者直到所有葉子都包含少于min_samples_split個樣本。
min_samples_split	int or float, default=2 拆分內部節點所需的最少樣本數： · 如果取值 int , 則將`min_samples_split`視為最小值。 · 如果為float，則`min_samples_split`是一個分數，而`ceil（min_samples_split * n_samples）`是每個拆分的最小樣本數。 -注釋在版本0.18中更改：增加了分數形式的浮點值。
min_samples_leaf	int or float, default=1 在葉節點處所需的最小樣本數。僅在任何深度的分裂點在左分支和右分支中的每個分支上至少留有`min_samples_leaf`個訓練樣本時，才考慮。這可能具有平滑模型的效果，尤其是在回歸中。 · 如果為int，則將`min_samples_leaf`視為最小值 · 如果為float，則`min_samples_leaf`是一個分數，而`ceil（min_samples_leaf * n_samples）`是每個節點的最小樣本數。 - 注釋：在版本0.18中發生了更改：添加了分數形式的浮點值。
min_weight_fraction_leaf	float, default=0.0 在所有葉節點處（所有輸入樣本）的權重總和中的最小加權分數。如果未提供`sample_weight`，則樣本的權重相等。
max_features	int, float or {“auto”, “sqrt”, “log2”}, default=None 尋找最佳分割時要考慮的特征數量： - 如果為`int`，則在每次拆分時考慮`max_features`功能。 - 如果為`float`，則`max_features`是一個分數，而`int（max_features * n_features）`是每個分割處的特征數量。 - 如果為`“auto”`，則`max_features = sqrt（n_features）`。 - 如果為`“sqrt”`，則`max_features = sqrt（n_features）`。 - 如果為`“log2”`，則`max_features = log2（n_features）`。 - 如果為`None`，則`max_features = n_features`。注意：直到找到至少一個有效的節點樣本分區，分割的搜索才會停止，即使它需要有效檢查的特征數量多于`max_features`也是如此。
random_state	int, RandomState instance, default=None 此參數用來控制估計器的隨機性。即使分割器設置為“最佳”，這些特征也總是在每個分割中隨機排列。當`max_features <n_features`時，該算法將在每個拆分中隨機選擇`max_features`，然后再在其中找到最佳拆分。但是，即使`max_features = n_features`，找到的最佳分割也可能因不同的運行而有所不同。就是這種情況，如果標準的改進對于幾個拆分而言是相同的，并且必須隨機選擇一個拆分。為了在擬合過程中獲得確定性的行為，`random_state`必須固定為整數。有關詳細信息，請參見詞匯表。
max_leaf_nodes	int, default=None 優先以最佳方式生成帶有`max_leaf_nodes`的樹。最佳節點定義為不純度的相對減少。如果為None，則葉節點數不受限制。
min_impurity_decrease	float, default=0.0 如果節點分裂會導致不純度的減少大于或等于該值，則該節點將被分裂。加權不純度減少方程如下： `N_t / N * (impurity - N_t_R / N_t * right_impurity` `- N_t_L / N_t * left_impurity)` 其中`N`是樣本總數，`N_t`是當前節點上的樣本數，`N_t_L`是左子節點中的樣本數，`N_t_R`是右子節點中的樣本數。如果給`sample_weight`傳了值，則`N , N_t , N_t_R` 和 `N_t_L`均指加權總和。在 0.19 版新增。
min_impurity_split	float, default=0 樹模型停止生長的閾值。如果節點的不純度高于閾值，則該節點將分裂，否則為葉節點。警告：從版本0.19開始被棄用:`min_impurity_split`在0.19中被棄用，轉而支持`min_impurity_decrease`。`min_impurity_split`的默認值在0.23中從`1e-7`更改為`0`，在0.25中將被刪除。使用`min_impurity_decrease`代替。
class_weight	dict, list of dict or “balanced”, default=None 以`{class_label: weight}`的形式表示與類別關聯的權重。如果取值None,所有分類的權重為1。對于多輸出問題，可以按照y的列的順序提供一個字典列表。注意多輸出(包括多標簽) ，應在其自己的字典中為每一列的每個類別定義權重。例如：對于四分類多標簽問題，權重應為[{0：1、1：1：1]，{0：1、1：5}，{0：1、1：1：1}，{0：1、1： 1}]，而不是[{1：1}，{2：5}，{3：1}，{4：1}]。 “平衡”模式使用y的值自動將權重與輸入數據中的類頻率成反比地調整為`n_samples /（n_classes * np.bincount（y））`。對于多輸出，y的每一列的權重將相乘。請注意，如果指定了`sample_weight`，則這些權重將與`sample_weight`（通過`fit`方法傳遞）相乘。
presort	deprecated, default=’deprecated’ 此參數已棄用，并將在v0.24中刪除。注意：從0.22版開始已棄用。
ccp_alpha	non-negative float, default=0.0 用于最小化成本復雜性修剪的復雜性參數。將選擇成本復雜度最大且小于ccp_alpha的子樹。默認情況下，不執行修剪。有關詳細信息，請參見最小成本復雜性修剪。

屬性	說明
classes_	ndarray of shape (n_classes,) or list of ndarray 類標簽（單輸出問題）或類標簽數組的列表（多輸出問題）。
feature_importances_	ndarray of shape (n_features,) 返回特征重要程度數據。
max_features_	int `max_features` 的推斷值。
n_classes_	int or list of int 整數的類別數（單輸出問題），或者一個包含所有類別數量的列表（多輸出問題）。
n_features_	int 執行模型擬合訓練時的特征數量。
n_outputs_	int 執行模型擬合訓練時的輸出數量。
tree_	Tree 基礎的Tree對象。請通過 `help(sklearn.tree._tree.Tree)`查看Tree對象的屬性，并了解決策樹的結構以了解這些屬性的基本用法。

另見

DecisionTreeRegressor 一個回歸決策樹.

注意

控制樹模型規模的默認的參數值（例如 max_depth, min_samples_leaf, 等）會導致樹的完全生長和未修剪，在某些數據集上樹的復雜度可能非常大。為了減少內存消耗，應通過設置這些參數值來控制樹的復雜性和大小。

參考文獻

1、https://en.wikipedia.org/wiki/Decision_tree_learning

2、L. Breiman, J. Friedman, R. Olshen, and C. Stone, “Classification and Regression Trees”, Wadsworth, Belmont, CA, 1984.

3、T. Hastie, R. Tibshirani and J. Friedman. “Elements of Statistical Learning”, Springer, 2009.

4、L. Breiman, and A. Cutler, “Random Forests”, https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

示例

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(random_state=0)
>>> iris = load_iris()
>>> cross_val_score(clf, iris.data, iris.target, cv=10)
...                             # doctest: +SKIP
...
array([ 1.     ,  0.93...,  0.86...,  0.93...,  0.93...,
        0.93...,  0.93...,  1.     ,  0.93...,  1.      ])

方法

方法	說明
`apply`(X[, check_input])	返回每個葉子節點上被預測樣本的索引。
`cost_complexity_pruning_path`(X, y[, …])	在最小化成本復雜性修剪期間計算修剪路徑。
`decision_path`(X[, check_input])	返回決策樹的決策路徑。
`fit`(X, y[, sample_weight, check_input, …])	根據訓練集（X，y）建立決策樹分類器。
`get_depth`()	返回決策樹的深度。
`get_n_leaves`()	返回決策樹的葉子數。
`get_params`([deep])	獲取此估算器的參數。
`predict`(X[, check_input])	預測X的類別或回歸值。
`predict_log_proba`(X)	預測輸入樣本X的類對數概率。
`predict_proba`(X[, check_input])	預測輸入樣本X的類別概率。
`score`(X, y[, sample_weight])	返回給定測試數據和標簽上的平均準確度。
`set_params`(**params)	設置此估算器的參數。

__init__(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort='deprecated', ccp_alpha=0.0)

[源碼]

初始化自身對象。獲取準確信息可以使用代碼help(type(self)) 查看。

apply(X, check_input=True)

[源碼]

返回每個葉子節點上被預測樣本的索引。

新增于 0.17 版。

參數	說明
X	{array-like, sparse matrix} of shape (n_samples, n_features) 輸入樣本。在內部，它將轉換為`dtype = np.float32`，并且如果提供給稀疏矩陣將轉化為`csc_matrix`。
check_input	bool, default=True 允許繞過多個輸入檢查。除非您知道自己要做什么，否則請勿使用此參數。

返回值
X_leaves	array-like of shape (n_samples,) 對于X中的每個數據點x，返回以x結尾的葉子的索引。葉子在`[0; self.tree_.node_count）`范圍中，可能在編號上有間隔。

cost_complexity_pruning_path(X, y, sample_weight=None)

[源碼]

在最小化成本復雜性修剪期間計算修剪路徑。

有關修剪過程的詳細信息，請參見 Minimal Cost-Complexity Pruning 。

參數	說明
X	{array-like, sparse matrix} of shape (n_samples, n_features) 訓練輸入樣本。在內部，它將轉換為`dtype = np.float32`，并且如果提供給稀疏矩陣將轉化為`csc_matrix`。
y	array-like of shape (n_samples,) or (n_samples, n_outputs) 目標值（類標簽）為整數或字符串。
sample_weight	array-like of shape (n_samples,), default=None 樣本權重。如果為None，則對樣本進行平均加權。在每個節點中搜索拆分時，將忽略創建凈值為零或負權重的拆分子節點。如果拆分會導致任何單個類在任一子節點中都負權重，則也將忽略拆分。

返回值	說明
ccp_path	`Bunch` 類字典對象，具有以下屬性。
ccp_alphas	ndarray 修剪期間子樹的有效Alpha。
impurities	ndarray 子樹中不純度的總和將用于`ccp_alphas`中的相應`alpha`值。

decision_path(X, check_input=True)

[源碼]

返回樹中的決策路徑。

版本0.18中的新功能。

參數	說明
X	{array-like, sparse matrix} of shape (n_samples, n_features) 輸入樣本。在內部，它將轉換為`dtype = np.float32`，并且如果提供給稀疏矩陣將轉化為`csc_matrix`。
check_input	bool, default=True 允許繞過多個輸入檢查。除非您知道自己要做什么，否則請勿使用此參數。

返回值	說明
indicator	sparse matrix of shape (n_samples, n_nodes) 返回節點指示符CSR矩陣，其中非零元素表示樣本通過節點。

property feature_importances_

返回特征的重要性。

特征的重要性計算為該特征帶來的標準的（標準化）總縮減。這也被稱為基尼重要性。

警告：基于不純度的特征重要性可能會誤導高基數特征（許多唯一值）。另請參見sklearn.inspection.permutation_importance 。

返回值	說明
feature_importances_	ndarray of shape (n_features,) 按照特征（基尼重要性）對規則減少總和做正則化處理

fit(X, y, sample_weight=None, check_input=True, X_idx_sorted=None)

[源碼]

從訓練集(X, y)構建決策樹分類器。

參數	說明
X	{array-like, sparse matrix} of shape (n_samples, n_features) 輸入的訓練集。在內部，它將轉換為`dtype = np.float32`，并且如果提供給稀疏矩陣將轉化為`csc_matrix`。
y	array-like of shape (n_samples,) or (n_samples, n_outputs) 目標值（類標簽）為整數或字符串。
sample_weight	array-like of shape (n_samples,), default=None 樣本權重，如果為None,那么樣本的權重相等。當在每個節點中搜索分割時，將忽略創建具有凈零權值或負權值的子節點的分割。如果分割會導致任何一個類在任一子節點中具有負權值，那么分割也將被忽略。
check_input	bool, default=True 允許繞過多個輸入檢查。除非您知道自己要做什么，否則不要使用此參數。
X_idx_sorted	array-like of shape (n_samples, n_features), default=None 分類后的訓練輸入樣本的索引。如果同一數據集上生長了許多樹，那么就允許在樹之間緩存順序。如果沒有，數據將在這里排序。除非你知道怎么做，否則不要使用這個參數。

返回值	說明
self	DecisionTreeClassifier 擬合估計器。

get_depth()

[源碼]

返回決策樹的深度。

一棵樹的深度是根與任何葉子之間的最大距離。

返回值	說明
self.tree_.max_depth	int 樹的最大深度

get_n_leaves()

[源碼]

返回決策樹的葉子數。

返回值	說明
self.tree_.n_leaves	int 葉子的數量

get_params(deep=True)

[源碼]

獲取這個估計器的參數。

參數	說明
deep	bool, default=True 如果為真，將返回此估計器的參數以及包含的作為估計器的子對象。

返回值	說明
params	mapping of string to any 參數名稱與參數值的映射

predict(X, check_input=True)

[源碼]

預測X的類或回歸值。

對于分類模型，返回X中每個樣本的預測類。對于回歸模型，返回基于X的預測值。

參數	說明
X	{array-like, sparse matrix} of shape (n_samples, n_features) 輸入樣本。在內部，它將被轉換為`dtype = np.float32`，并且如果將稀疏矩陣提供給稀疏的`csr_matrix`。
check_input	bool, default=True 允許繞過多個輸入檢查。除非您知道自己要做什么，否則不要使用此參數。

返回值	說明
y	array-like of shape (n_samples,) or (n_samples, n_outputs) 預測的類或預測值。

predict_log_proba(X)

[源碼]

預測輸入樣本X的類對數概率。

參數	說明
X	{array-like, sparse matrix} of shape (n_samples, n_features) 輸入樣本。在內部，它將被轉換為`dtype = np.float32`，并且如果將稀疏矩陣提供給稀疏的`csr_matrix`。

返回值	說明
proba	ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1 輸入樣本的對數概率。類的順序對應于屬性classes_中的順序。

predict_proba(X, check_input=True)

[源碼]

預測輸入樣本X的類別概率。

預測的類別概率是葉子中相同類別的樣本的分數。

參數	說明
X	{array-like, sparse matrix} of shape (n_samples, n_features) 輸入樣本。在內部，它將被轉換為`dtype = np.float32`，并且如果將稀疏矩陣提供給稀疏的`csr_matrix`。
check_input	bool, default=True 允許繞過多個輸入檢查。除非您知道自己要做什么，否則不要使用此參數

返回值	說明
proba	ndarray of shape (n_samples, n_classes) or list of n_outputs such arrays if n_outputs > 1 輸入樣本的類型概率。類的順序對應于屬性classes_中的順序。

score(X, y, sample_weight=None)

[源碼]

返回給定測試數據在對應標簽上的平均準確度。

在多標簽分類中，返回的是精度子集，這是一個苛刻的指標，因為你需要對每個樣本正確預測每個標簽的精度。

參數	說明
X	array-like of shape (n_samples, n_features) 測試樣本
y	array-like of shape (n_samples,) or (n_samples, n_outputs) X的真實標簽
sample_weight	array-like of shape (n_samples,), default=None 樣本權重