使用k-means聚類文本文檔?
這是一個示例,展示了如何使用詞袋方法將scikit-learn用于按主題對文檔進行聚類。本示例使用scipy.sparse矩陣存儲要素,而不是標準numpy數組。
在此示例中,可以使用兩種特征提取方法:
TfidfVectorizer使用內存中的詞匯表(python字典)將最頻繁出現的單詞映射到特征索引,從而計算單詞出現頻率(稀疏)矩陣。然后使用在整個語料庫中按特征收集的逆文檔頻率(IDF)向量對單詞頻率進行加權。
HashingVectorizer將單詞出現散列到固定維空間,可能會發生沖突。然后將字數向量標準化為每個l2范數等于1(投影到歐幾里得單位球),這似乎對k均值在高維空間中起作用很重要。
HashingVectorizer不提供IDF加權,因為這是一個無狀態模型(fit方法不執行任何操作)。當需要IDF加權時,可以通過將其輸出通過管道傳遞到TfidfTransformer實例來添加。
演示了兩種算法:普通k均值及其更具可擴展性的表親minibatch k均值。
此外,潛在語義分析還可用于減少維數并發現數據中的潛在模式。
可以注意到,k均值(和小批量k均值)對特征縮放非常敏感,在這種情況下,IDF加權有助于將聚類的質量提高很多,這是針對由20個新聞組數據集的類標簽分配。
這種改進在“輪廓系數”中不可見,該系數對于這兩者都是很小的,因為對于像文本數據這樣的高維數據集,此度量似乎遭受稱為“度量集中”或“維數詛咒”的現象。其他度量(例如V度量和調整的蘭德指數)都是基于信息理論的評估評分:因為它們僅基于聚類分配而不是距離,因此不受維度詛咒的影響。
注意:由于k均值正在優化非凸目標函數,因此它最終可能會達到局部最優值。為了獲得良好的收斂性,可能需要使用獨立的隨機init進行多次運行。
# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>
# Lars Buitinck
# License: BSD 3 clause
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics
from sklearn.cluster import KMeans, MiniBatchKMeans
import logging
from optparse import OptionParser
import sys
from time import time
import numpy as np
# 在標準輸出上顯示進度日志
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s')
# 解析命令行參數
op = OptionParser()
op.add_option("--lsa",
dest="n_components", type="int",
help="Preprocess documents with latent semantic analysis.")
op.add_option("--no-minibatch",
action="store_false", dest="minibatch", default=True,
help="Use ordinary k-means algorithm (in batch mode).")
op.add_option("--no-idf",
action="store_false", dest="use_idf", default=True,
help="Disable Inverse Document Frequency feature weighting.")
op.add_option("--use-hashing",
action="store_true", default=False,
help="Use a hashing feature vectorizer")
op.add_option("--n-features", type=int, default=10000,
help="Maximum number of features (dimensions)"
" to extract from text.")
op.add_option("--verbose",
action="store_true", dest="verbose", default=False,
help="Print progress reports inside k-means algorithm.")
print(__doc__)
op.print_help()
def is_interactive():
return not hasattr(sys.modules['__main__'], '__file__')
# Jupyter Notebook和IPython控制臺的解決方法
argv = [] if is_interactive() else sys.argv[1:]
(opts, args) = op.parse_args(argv)
if len(args) > 0:
op.error("this script takes no arguments.")
sys.exit(1)
# #############################################################################
# 從訓練集中加載一些類別
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
# 下面這行取消注釋以使用更大的注釋集(超過11k個文檔)
# categories = None
print("Loading 20 newsgroups dataset for categories:")
print(categories)
dataset = fetch_20newsgroups(subset='all', categories=categories,
shuffle=True, random_state=42)
print("%d documents" % len(dataset.data))
print("%d categories" % len(dataset.target_names))
print()
labels = dataset.target
true_k = np.unique(labels).shape[0]
print("Extracting features from the training dataset "
"using a sparse vectorizer")
t0 = time()
if opts.use_hashing:
if opts.use_idf:
# 在HashingVectorizer的輸出上執行IDF歸一化
hasher = HashingVectorizer(n_features=opts.n_features,
stop_words='english', alternate_sign=False,
norm=None)
vectorizer = make_pipeline(hasher, TfidfTransformer())
else:
vectorizer = HashingVectorizer(n_features=opts.n_features,
stop_words='english',
alternate_sign=False, norm='l2')
else:
vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
min_df=2, stop_words='english',
use_idf=opts.use_idf)
X = vectorizer.fit_transform(dataset.data)
print("done in %fs" % (time() - t0))
print("n_samples: %d, n_features: %d" % X.shape)
print()
if opts.n_components:
print("Performing dimensionality reduction using LSA")
t0 = time()
# Vectorizer results are normalized, which makes KMeans behave as
# spherical k-means for better results. Since LSA/SVD results are
# not normalized, we have to redo the normalization.
svd = TruncatedSVD(opts.n_components)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)
X = lsa.fit_transform(X)
print("done in %fs" % (time() - t0))
explained_variance = svd.explained_variance_ratio_.sum()
print("Explained variance of the SVD step: {}%".format(
int(explained_variance * 100)))
print()
# #############################################################################
# 做聚類
if opts.minibatch:
km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
init_size=1000, batch_size=1000, verbose=opts.verbose)
else:
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1,
verbose=opts.verbose)
print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X)
print("done in %0.3fs" % (time() - t0))
print()
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Adjusted Rand-Index: %.3f"
% metrics.adjusted_rand_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, km.labels_, sample_size=1000))
print()
if not opts.use_hashing:
print("Top terms per cluster:")
if opts.n_components:
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]
else:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
輸出:
Usage: plot_document_clustering.py [options]
Options:
-h, --help show this help message and exit
--lsa=N_COMPONENTS Preprocess documents with latent semantic analysis.
--no-minibatch Use ordinary k-means algorithm (in batch mode).
--no-idf Disable Inverse Document Frequency feature weighting.
--use-hashing Use a hashing feature vectorizer
--n-features=N_FEATURES
Maximum number of features (dimensions) to extract
from text.
--verbose Print progress reports inside k-means algorithm.
Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
3387 documents
4 categories
Extracting features from the training dataset using a sparse vectorizer
done in 0.913811s
n_samples: 3387, n_features: 10000
Clustering sparse data with MiniBatchKMeans(batch_size=1000, init_size=1000, n_clusters=4, n_init=1,
verbose=False)
done in 0.082s
Homogeneity: 0.412
Completeness: 0.491
V-measure: 0.448
Adjusted Rand-Index: 0.289
Silhouette Coefficient: 0.006
Top terms per cluster:
Cluster 0: graphics image file thanks files 3d university format gif software
Cluster 1: space nasa henry access digex toronto gov pat alaska shuttle
Cluster 2: com god article don people just sandvik university know think
Cluster 3: sgi keith livesey morality jon solntze wpd caltech objective moral
腳本的總運行時間:(0分鐘1.376秒)