sklearn.cluster.dbscan?

sklearn.cluster.dbscan(X, eps=0.5, *, min_samples=5, metric='minkowski', metric_params=None, algorithm='auto', leaf_size=30, p=2, sample_weight=None, n_jobs=None)

[源碼]

從向量數組或距離矩陣執行DBSCAN聚類。

在用戶指南中閱讀更多內容。

參數	列表
X	{array-like, sparse (CSR) matrix} of shape (n_samples, n_features) or (n_samples, n_samples) 如果metric=’precomputed’，則是特征數組，或樣本之間的距離數組
eps	float, default=0.5 兩個樣本之間的最大距離，其中一個被視為另一個樣本的鄰域內。這并不是一個簇內點之間距離的最大界限。這是為數據集和距離函數適當選擇的最重要的dbscan參數。
min_samples	int, default=5 一個點被視為核心點的鄰域內的樣本數(或總權重)。這包括要該點本身
metric	string, or callable 在計算特征數組中實例之間的距離時使用的度量。如果度量是字符串或可調用的，則它必須是`sklearn.metrics.pairwise_distances`為其度量參數所允許的選項之一。如果度量是“precomputed”，則假定X是距離矩陣，并且必須是平方的。X可能是Glossary，在這種情況下，只有“非零”元素可以被視為DBSCAN的鄰居。
metric_params	dict, default=None 度量函數的附加關鍵字參數
algorithm	{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’ NearestNeighbors模塊用于計算點距離和尋找最近鄰居的算法。有關詳細信息，請參閱NearestNeighbors模塊文檔。
leaf_size	int, optional (default=30) 傳遞給`BallTree` 或者 `KDTree`。這會影響構造和查詢的速度，以及存儲樹所需的內存。最優值取決于問題的性質。
p	float, default=2 用于計算點間距離的Minkowski度量的冪
sample_weight	array-like of shape (n_samples,), default=None 每個樣本的權重，例如一個權重至少為`min_samples`的樣本本身就是一個核心樣本；一個負權重的樣本可能會抑制它的EPS-鄰居成為核心。注意，權重是絕對的，默認為1。
n_jobs	int or None, optional (default=None) 要為鄰居搜索的并行作業數。`None`意味1，除非在`joblib.parallel_backend`環境中。`-1`指使用所有處理器。有關詳細信息，請參Glossary。

屬性	說明
core_samples	ndarray of shape (n_core_samples,) 核心樣本的索引。
labels_	ndarray of shape (n_samples) 每個點的聚類標簽。有噪音的樣本被標為-1。

注

有關示例，請看examples/cluster/plot_dbscan.py.

此實現批量計算所有鄰域查詢，這會將內存復雜度增加到O(n.d)，其中d是鄰居的平均數量，而原始DBSCAN的內存復雜度為O(n)。根據algorithm的不同，在查詢這些最近的鄰域時，它可能會吸引更高的內存復雜度。

避免查詢復雜性的一種方法是使用 NearestNeighbors.radius_neighbors_graph并設置 mode='distance',預先計算塊中的稀疏鄰域，然后在這里使用 metric='precomputed' 。

另一種減少內存和計算時間的方法是刪除(接近)重復點，并且使用 sample_weight 代替。

參考

Ester, M., H. P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 19.