sklearn.decomposition.LatentDirichletAllocation?

class sklearn.decomposition.LatentDirichletAllocation(n_components=10, *, doc_topic_prior=None, topic_word_prior=None, learning_method='batch', learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None)

[源碼]

基于在線變分貝葉斯算法的潛在狄利克雷分解

新版本為0.17。

在“用戶指南”中閱讀更多內容

參數	說明
n_components	int, optional (default=10) 數量的話題。在版本0.19中更改:n_topics ' '被重命名為' ' n_components
doc_topic_prior	float, optional (default=None) 之前的主題詞分布`theta`。如果值為None，則默認為`1 / n_components`。在[Re25e5648fc37-1]中，這叫做`alpha`.
topic_word_prior	float, optional (default=None) 之前的主題詞分布beta。如果值為None，則默認為`1 / n_components`。在[Re25e5648fc37-1]中，這被稱為`eta`。
learning_method	'batch'/‘online’, default=’batch' 用于更新`_component`的方法。僅在`fit`中使用。通常，如果數據量很大，在線更新會比批量更新快得多. 有效的選項: “batch”:批量變分貝葉斯方法。在每個EM更新中使用所有的訓練數據舊的“components_”將在每次迭代中被覆蓋。 “online”: 在線變分貝葉斯方法。在每個EM更新中，使用mini-batch更新' ' components_ ' '的訓練數據變量增量。學習率是由' ' learning_decay ' '和' ' learning_offset ' '參數控制。在0.20版本中改變:默認的學習方法現在是“batch”。
learning_decay	float, optional (default=0.7) 它是在線學習方法中控制學習率的一個參數。為保證漸近收斂，取值應在(0.5,1.0)之間。當值為0.0,`batch_size`為`n_samples`時，更新方法與批量學習相同。在這篇文獻中，被稱為kappa。
learning_offset	float, optional (default=10.) 一個(正的)參數，降低在線學習的早期迭代。它應該大于1.0。在文獻中，這叫做tau_0。
max_iter	integer, optional (default=10) 最大迭代次數。
batch_size	int, optional (default=128) 在每次EM迭代中使用的文檔數量。僅用于在線學習。
evaluate_every	int, optional (default=0) 評估困惑頻率。僅在`fit`法中使用。將其設置為0 或負數，在訓練中完全不評估perplexity。評估perplexity可以幫助你檢查訓練過程中的收斂性，但也會增加訓練的總時間。在每次迭代中評估復雜性可能會將訓練時間增加兩倍。
total_samples	int, optional (default=1e6) 文件總數。僅用于`partial_fit`方法。
perp_tol	float, optional (default=1e-1) 批量學習中的困惑容忍度。僅在`evaluate_every`大于0時使用。
mean_change_tol	float, optional (default=1e-3) 停止E-step中更新文檔主題分發的容忍度。
max_doc_update_iter	int (default=100) E-step中更新文檔主題分布的最大迭代次數。
n_jobs	int or None, optional (default=None) 在E-step中使用的作業數量。None就是1，除非在`joblib.parallel_backend` 上下文。`-1`表示使用所有處理器。更多細節請參見Glossary。
verbose	int, optional (default=0) 冗長的水平。
random_state	int, RandomState instance, default=None 在多個函數調用中傳遞可重復的結果。參見Glossary。

屬性	說明
components_	array, [n_components, n_features] 主題詞分布的變分參數。自完整的詞分布狄利克雷條件為話題,`components_ (i, j)`可以被視為`pseudocount`代表單詞的次數`j`,我被分配到的話題。它也可以被視為分布歸一化后的文字為每個主題:`model.components_ / model.components_.sum(axis= 1):,np.newaxis]`。
n_batch_iter_	int EM步驟的迭代次數。
n_iter_	int 傳遞數據集的次數。
bound_	float 訓練集最終perplexity得分。
doc_topic_prior_	float 之前的主題詞分布theta。。如果值為None，則為1 / n_components。
topic_word_prior_	float 之前的主題詞分布beta。如果值為None，則為1 / n_components。

參考文獻：

Re25e5648fc37-1(1,2)

“Online Learning for Latent Dirichlet Allocation”, Matthew D. Hoffman, David M. Blei, Francis Bach, 2010
[2] “Stochastic Variational Inference”, Matthew D. Hoffman, David M. Blei,

Chong Wang, John Paisley, 2013
[3] Matthew D. Hoffman’s onlineldavb code. Link:

https://github.com/blei-lab/onlineldavb

示例：

>>> from sklearn.decomposition import LatentDirichletAllocation
>>> from sklearn.datasets import make_multilabel_classification
>>> # This produces a feature matrix of token counts, similar to what
>>> # CountVectorizer would produce on text.
>>> X, _ = make_multilabel_classification(random_state=0)
>>> lda = LatentDirichletAllocation(n_components=5,
...     random_state=0)
>>> lda.fit(X)
LatentDirichletAllocation(...)
>>> # get topics for some given samples:
>>> lda.transform(X[-2:])
array([[0.00360392, 0.25499205, 0.0036211 , 0.64236448, 0.09541846],
       [0.15297572, 0.00362644, 0.44412786, 0.39568399, 0.003586  ]])

方法：

方法	說明
`fit`(self, X[, y])	用變分貝葉斯方法學習數據X的模型。
`fit_transform`(self, X[, y])	擬合數據，然后轉換它。
`get_params`(self[, deep])	獲取這個估計器的參數。
`partial_fit`(self, X[, y])	在線VB與mini-batch更新。
`perplexity`(self, X[, sub_sampling])	計算數據X的近似perplexity。
`score`(self, X[, y])	計算近似對數似然作為分數。
`set_params`(self, **params)	設置這個估計器的參數。
`transform`(self, X)	根據擬合模型變換數據X。

__init__(n_components=10, *, doc_topic_prior=None, topic_word_prior=None, learning_method='batch', learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None)

[源碼]

初始化self. 請參閱help(type(self))以獲得準確的說明。

fit(self, X, y=None)

[源碼]

用變分貝葉斯方法學習數據X的模型。

當learning_method是“在線”時，使用小批量更新。否則，使用批處理更新。

參數	說明
X	array-like or sparse matrix, shape=(n_samples, n_features) 文檔詞矩陣。
y	Ignored

返回值	說明
self	無

fit_transform(self, X, y=None, *fit_params)

[源碼]

擬合數據，然后轉換它。

使用可選參數fit_params將transformer與X和y匹配，并返回X的轉換版本。

參數	說明
X	{array-like, sparse matrix, dataframe} of shape (n_samples, n_features)
y	ndarray of shape (n_samples,), default=None 目標值
**fit_params	dict 其他擬合參數。

返回值	說明
X_new	ndarray array of shape (n_samples, n_features_new) Transformed array.

get_params(self, deep=True)

[源碼]

獲取這個估計器的參數。

參數	說明
deep	bool, default=True 如果為真，將返回此估計器的參數以及包含的作為估計器的子對象。

返回值	說明
params	mapping of string to any 參數名稱映射到它們的值。

partial_fit(self, X, y=None)

[源碼]

在線VB與Mini-Batch更新。

參數	說明
X	array-like or sparse matrix, shape=(n_samples, n_features) 文檔詞矩陣。
y	Ignored

返回值	說明書
self	無

perplexity(self, X, sub_sampling=False)

[源碼]

計算數據X的近似perplexity。

Perplexity定義為exp(-1. * log-likelihood per word)

Changed in version 0.19: doc_topic_distr argument has been deprecated and is ignored because user no longer has access to unnormalized distribution

參數	說明
X	array-like or sparse matrix, [n_samples, n_features] 文檔詞矩陣。
sub_sampling	bool Do sub-sampling or not.

返回值	說明
score	float 困惑度分數。

score(self, X, y=None)

[源碼]

計算近似對數似然作為分數。

參數	說明
X	array-like or sparse matrix, shape=(n_samples, n_features) 文檔詞矩陣。
y	Ignored

返回值	說明
score	float 使用近似邊界作為分數。

set_params(self, *params)

[源碼]

設置這個估計器的參數。

該方法適用于簡單估計器和嵌套對象(如管道)。后者具有形式為__的參數，這樣就可以更新嵌套對象的每個樣本。

參數	說明
**params	dict 估計器參數

返回值	說明
self	object 估計器實例

transform(self, X)

[源碼]

根據擬合模型變換數據X。

Changed in version 0.18: doc_topic_distr is now normalized

參數	說明
X	array-like or sparse matrix, shape=(n_samples, n_features) 文檔詞矩陣。

返回值	說明
doc_topic_distr	shape=(n_samples, n_components) X的文檔主題分發。

示例 sklearn.decomposition.LatentDirichletAllocation?

非負矩陣分解與潛在Dirichlet分配的主題提取 ?