sklearn.cluster.OPTICS?

class sklearn.cluster.OPTICS(*, min_samples=5, max_eps=inf, metric='minkowski', p=2, metric_params=None, cluster_method='xi', eps=None, xi=0.05, predecessor_correction=True, min_cluster_size=None, algorithm='auto', leaf_size=30, n_jobs=None)

[源碼]

從向量數組估計聚類結構

OPTICS(用于確定聚類結構的排序點)與DBSCAN密切相關，它找到了高密度的核心樣本，并從它們中擴展了團簇[R2c55e37003fe-1]。與DBSCAN不同，為可變鄰域半徑保持集群層次結構。與DBSCAN當前的sklearn實現相比，更適合在大型數據集上使用。

然后使用DBSCAN-like方法(cluster_method = ‘dbscan’) 或[R2c55e37003fe-1]中提出的自動技術 (cluster_method = ‘xi’)提取簇。

該實現首先對所有點執行k最近鄰域搜索，以識別核大小，然后在構造簇序時只計算到未處理點的距離，從而偏離了原始OPTICS。請注意，我們沒有使用堆來管理擴展候選項，因此時間復雜度將是O(n^2)。

在用戶指南中閱讀更多內容.

參數	列表
min_samples	int > 1 or float between 0 and 1 (default=5) 一個點被視為核心點的鄰域樣本數。此外，上下陡峭地區不能有超過`min_samples`連續的非陡峭點。表示為樣本數的絕對值或一小部分(四舍五入至少為2)。
max_eps	float, optional (default=np.inf) 兩個樣本之間的最大距離，其中一個被視為另一個樣本的鄰域。`np.inf`默認值將識別所有規模的聚類；減少`max_eps`會縮短運行時間。
metric	str or callable, optional (default=’minkowski’) 用于距離計算的度量。任何來自scikit-learn或scipy.spatial.distance的度量都可以使用。如果度量是可調用的函數，則在每對實例(行)上調用它，并記錄結果值。可調用應該以兩個數組作為輸入，并返回一個值，指示它們之間的距離。這適用于Scipy’s度量，但比將度量名稱作為字符串傳遞的效率要低。如果度量是“precomputed”，則假定X是距離矩陣，并且必須是平方的。度量的有效值是： scikit-learn里面：[‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’] scipy.spatial.distance里面：[‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’] 有關這些度量的詳細信息，請參閱scipy.spatial.distance的文檔。
p	int, optional (default=2) 來自`sklearn.metrics.pairwise_distances`的Minkowski度量的參數。當p=1時，這相當于使用曼哈頓距離(L1); 當p=2，相當于使用歐幾里得距離(L2)。對于任意p，使用minkowski_distance (l_p)。
metric_params	dict, optional (default=None) 度量函數的附加關鍵字參數。
cluster_method	str, optional (default=’xi’) 利用計算的可達性和有序性提取簇的提取方法，可能的值是“xi”和“dbscan”。
eps	float, optional (default=None) 兩個樣本之間的最大距離，其中一個被視為另一個樣本的鄰域內。默認情況下，它假設的值與`max_eps`相同。只有當`cluster_method='dbscan'`才被使用。
xi	float, between 0 and 1, optional (default=0.05) 確定構成聚類邊界的可達性圖的最小陡度。例如，可達圖中的一個向上點定義為從一個點到它的后繼點最多為1-xi的比率。只有當`cluster_method='xi'`才被使用。
predecessor_correction	bool, optional (default=True) 根據OPTICS預先計算的[R2c55e37003fe-2]正確的團簇。此參數對大多數數據集的影響最小。只有當`cluster_method='xi'`才被使用。
min_cluster_size	int > 1 or float between 0 and 1 (default=None) OPTICS聚類中的最小樣本數，表示為樣本數的絕對值或一部分(四舍五入為至少2)。如果為None, `min_samples`的值將被使用。只有當`cluster_method='xi'`才被使用。
algorithm	{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’ NearestNeighbors模塊用于計算點態距離和尋找最近鄰的算法。
p	float, default=None 用于計算點間距離的Minkowski度量的冪。 - ‘ball_tree’將會使用`BallTree` - ‘kd_tree’將會使用`KDtree` - ‘brute’將會使用蠻力搜索 - ‘auto’將嘗試根據傳遞給`fit`方法的值來確定最合適的算法。(默認)
n_jobs	int or None, optional (default=None) 要為鄰居搜索的并行作業數。`None`意味1，除非在`joblib.parallel_backend`環境中。`-1`指使用所有處理器。有關詳細信息，請參Glossary。

屬性	說明
labels_	array, shape (n_samples,) 為fit()提供的數據集中每個點的聚類標簽。不包含在`cluster_hierarchy_`的葉簇中的含噪樣本和點被標記為-1。
reachability_	array, shape (n_samples,) 每個樣本的可達距離，按對象順序索引。使用`clust.achaability_[clust.order_]`按聚類順序訪問。
ordering_	array, shape (n_samples,) 樣本索引的聚類排序列表。
core_distances_	array, shape (n_samples,) 每個樣本成為一個核心點的距離，按對象順序索引。有一個inf的距離點永遠不會成為核心。使用`clust.core_distances_[clust.ordering_]`按聚類排序進行訪問。
predecessor_	array, shape (n_samples,) 指出一個樣本是從中得到的，并按對象順序進行索引。種子點有-1的前身。
cluster_hierarchy_	array, shape (n_clusters, 2) 每一行中`[start，end]`形式的聚類列表，包括所有索引。聚類按照`(end, -start)`(升序)排列，這樣包含較小的簇的更大的簇就在那些較小的簇之后。由于標簽不反映層次結構，通常`len(cluster_hierarchy_) > np.unique(optics.labels_)`。請注意這些索引是`ordering_`。即`X[ordering_][start:end + 1]`形成成一個簇。只有當`cluster_method='xi'`才被使用。

另見

DBSCAN

指定鄰域半徑（eps）的相似聚類。針對運行時間進行了優化。

參考

R2c55e37003fe-1([1],[2]) Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and J?rg Sander. “OPTICS: ordering points to identify the clustering structure.” ACM SIGMOD Record 28, no. 2 (1999): 49-60.

[R2c55e37003fe-2] Schubert, Erich, Michael Gertz. “Improving the Cluster Structure Extracted from OPTICS Plots.” Proc. of the Conference “Lernen, Wissen, Daten, Analysen” (LWDA) (2018): 318-329.

示例

>>> from sklearn.cluster import OPTICS
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 5], [3, 6],
...               [8, 7], [8, 8], [7, 3]])
>>> clustering = OPTICS(min_samples=2).fit(X)
>>> clustering.labels_
array([0, 0, 0, 1, 1, 1])

方法

方法	說明
`fit`(self, X[, y])	執行OPTICS聚類
`fit_predict`(self, X[, y])	在X上執行聚類并返回聚類標簽
`get_params`(self[, deep])	獲取此估計器的參數
`set_params`(self, **params)	設置此估計器的參數

__init__(self, *, min_samples=5, max_eps=inf, metric='minkowski', p=2, metric_params=None, cluster_method='xi', eps=None, xi=0.05, predecessor_correction=True, min_cluster_size=None, algorithm='auto', leaf_size=30, n_jobs=None)

[源碼]

初始化self。請參閱help(type(self))以獲得準確的說明。

fit(self, X, y=None)

[源碼]

執行OPTICS聚類

提取點和可達距離的有序列表，并利用OPTICS對象實例化時指定的max_eps進行初始聚類。

參數	說明
X	array, shape (n_samples, n_features), or (n_samples, n_samples) if metric=’precomputed’ 一個特征數組，或樣本之間的距離數組，如果metric=’precomputed’
y	ignored Ignored

返回值	說明
self	instance of OPTICS 實例

fit_predict(self, X, y=None)

[源碼]

在X上執行聚類并返回聚類標簽。

參數	說明
X	array-like of shape (n_samples, n_features) 輸入數據
y	Ignored 未使用，在此按約定呈現為api一致性。

返回值	說明
labels	ndarray of shape (n_samples,) 聚類標簽

get_params(self, deep=True)

[源碼]

獲取此估計器的參數

表格	說明
deep	bool, default=True 如果為True，則將返回此估計器的參數和所包含的作為估計量的子對象。

返回值	說明
params	mapping of string to any 映射到其值的參數名稱

set_params(self, **params)

[源碼]

設置此估計器的參數

該方法適用于簡單估計器以及嵌套對象(例如pipelines)。后者具有表單的 <component>__<parameter>參數，這樣就可以更新嵌套對象的每個組件。

表格	說明
**params	dict 估計器參數

返回值	說明書
self	object 估計器實例

sklearn.cluster.OPTICS使用示例?

光學聚類算法的演示 ?

toy數據集上不同聚類算法的比較 ?