sklearn.feature_extraction.CountVectorizer?

class sklearn.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

[源碼]

收集的文本文檔轉換為矩陣的令牌數量

這個實現產生的稀疏表示使用scipy.sparse.csr_matrix計數。

如果你不提供一個先天的字典和你不使用一個分析器,某種特征選擇特性的數量就會等于詞匯量大小發現通過分析數據。

在用戶指南中閱讀更多內容。

參數	說明
input	string {‘filename’, ‘file’, ‘content’}, default=’content’ 如果“filename”，作為參數傳遞給fit的序列應該是一個文件名列表，需要讀取這些文件名以獲取要分析的原始內容。如果“file”，序列項必須有一個“read”方法(類文件對象)，該方法被調用來獲取內存中的字節。否則，輸入應該是一個項目序列，類型可以是string或byte。
encoding	string, default=’utf-8’ 如果字節或文件被給予分析，這種編碼被用來解碼。
decode_error	{‘strict’, ‘ignore’, ‘replace’}, default=’strict’ 說明如果給定要分析的字節序列包含不屬于給定`編碼`的字符，該做什么。默認情況下，它是“嚴格的”，這意味著將引發一個UnicodeDecodeError。其他值還有“ignore”和“replace”。
strip_accents	{‘ascii’, ‘unicode’}, default=None 在預處理步驟中刪除重音符號并執行其他字符規范化。' ascii '是一種快速的方法，只對有直接ascii映射的字符有效。“unicode”是一種稍微慢一些的方法，適用于任何字符。None(默認)不執行任何操作。 ' ascii '和' unicode '都使用NFKD標準化`unicodedata.normalize`.。
lowercase	bool, default=True 在標記之前將所有字符轉換為小寫。
preprocessor	callable, default=None 重寫預處理(字符串轉換)階段，同時保留記號化和n字元生成步驟。僅在分析器不可調用時應用。
tokenizer	callable, default=None 重寫字符串記號化步驟，同時保留預處理和n字元生成步驟。只適用于analyzer == 'word'。
stop_words	string {‘english’}, list, default=None 如果“english”，則使用內置的英語停止詞列表。“英語”有幾個已知的問題，你應該考慮另一種選擇(參見Using stop words)。如果“english”，則使用內置的英語停止詞列表。“英語”有幾個已知的問題，你應該考慮另一種選擇(參見使用停止詞)。如果一個列表，則假定該列表包含停止詞，所有這些詞都將從結果標記中刪除。只適用于`analyzer == 'word'`。如果沒有，就不會使用停止語。`max_df`可以設置為范圍[0.7,1.0]的值，根據術語在語料庫文檔內的頻率自動檢測和過濾停止詞。
token_pattern	string 表示什么構成了“記號”的正則表達式，僅在analyzer == 'word'時使用。默認的regexp選擇2個或更多字母數字字符的標記(標點完全被忽略，總是作為標記分隔符處理)。
ngram_range	tuple (min_n, max_n), default=(1, 1) 要提取的不同單詞的n個字符或字符的n個字符的n個值范圍的上邊界。使用`min_n <= n <= max_n`的所有n值。例如，`ngram_range`的(1,1)表示僅使用雙字符，(1,2)表示單字符和雙字符，(2,2)表示僅使用雙字符。僅在分析器不可調用時應用。
analyzer	string, {‘word’, ‘char’, ‘char_wb’} or callable, default=’word’ 該特征是由n個字母組成還是由n個字母組成。選擇“char_wb”創建角色-`gram`只從文本單詞邊界;字格詞帶的邊緣空間。如果傳遞了`callable`，則使用它從原始的、未處理的輸入中提取特性序列。在0.21版本中進行了更改。由于v0.21，如果輸入是文件名或文件，則首先從文件讀取數據，然后傳遞給給定的可調用分析器。
max_df	float in range [0.0, 1.0] or int, default=1.0 在構建詞匯表時，忽略那些文檔頻率嚴格高于給定閾值的術語(特定于語料庫的停止詞)。如果是浮點數，則該參數表示文檔的比例，整數絕對計數。如果詞匯表不是None，則忽略此參數。
min_df	float in range [0.0, 1.0] or int, default=1 在構建詞匯表時，忽略那些文檔頻率嚴格低于給定閾值的術語。這個值在文獻中也稱為`cut-off`。如果是浮點數，則該參數表示文檔的比例，整數絕對計數。如果詞匯表不是`None`，則忽略此參數。
max_features	int, default=None 如果沒有的話，構建一個詞匯表，只考慮根據語料庫中的詞匯頻率排序的頂部max_features。如果詞匯表不是`None`，則忽略此參數。
vocabulary	Mapping or iterable, default=None 一種映射(例如dict)，其中鍵是項，值是特征矩陣中的索引，或者是項上的迭代。如果沒有給出，則從輸入文檔中確定詞匯表。映射中的索引不應該重復，并且0和最大索引之間不應該有任何差距。
binary	bool, default=False 如果為真，則將所有非零計數設置為1。這對于建模二進制事件而不是整數計數的離散概率模型是有用的。
dtype	type, default=np.int64 由`fit_transform()`或`transform()`返回的矩陣的類型。

屬性

屬性	說明
vocabulary_	dict 術語到特征索引的映射。
fixed_vocabulary_	boolean 如果用戶提供了術語到索引映射的固定詞匯表，則為
stop_words_	set 被忽略的術語，因為它們要么: 發生在太多文檔中(max_df) 發生在太少的文檔中(min_df) 通過特征選擇(max_features)進行截斷。只有在沒有給出詞匯表的情況下才可用。

另見:

HashingVectorizer, TfidfVectorizer

注意

在pickle時，stop_words_屬性會變大，增加模型的大小。此屬性僅用于自省，可以使用delattr安全地刪除該屬性，或在pickle之前將其設置為None。

示例

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
>>> vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
>>> X2 = vectorizer2.fit_transform(corpus)
>>> print(vectorizer2.get_feature_names())
['and this', 'document is', 'first document', 'is the', 'is this',
'second document', 'the first', 'the second', 'the third', 'third one',
 'this document', 'this is', 'this the']
 >>> print(X2.toarray())
 [[0 0 1 1 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 1 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 1]]

方法

方法	說明
`build_analyzer`()	返回處理預處理、記號化和生成n個符號的可調用函數。
`build_preprocessor`()	返回一個函數，用于在標記之前對文本進行預處理。
`build_tokenizer`()	返回一個函數，該函數將字符串分割為一系列標記。
`decode`(doc)	將輸入解碼為unicode符號字符串。
`fit`(raw_documents[, y])	學習原始文檔中所有標記的詞匯字典。
`fit_transform`(raw_documents[, y])	學習詞匯表字典并返回文檔術語矩陣。
`get_feature_names`()	從特征整數索引到特征名稱的數組映射。
`get_params`([deep])	獲取這個估計器的參數。
`get_stop_words`()	構建或獲取有效停止詞列表。
`inverse_transform`(X)	返回在X中有非零項的每個文檔的術語。
`set_params`(**params)	設置的參數估計量。
`transform`(raw_documents)	將文檔轉換為文檔術語矩陣。

__init__(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

[源碼]

初始化self. See 請參閱help(type(self))以獲得準確的說明。

[源碼]

build_analyzer()

[源碼]

返回處理預處理、記號化和生成n個符號的可調用函數。

返回值	說明
analyzer	callable 一個處理預處理、記號化和生成n字的函數。

build_preprocessor()

[源碼]

返回一個函數，用于在標記之前對文本進行預處理。

返回值	說明
preprocessor	callable 用于在標記化之前對文本進行預處理的函數。

build_tokenizer()

[源碼]

返回一個函數，該函數將字符串分割為一系列標記。

返回值	說明
tokenizer	callable 一種函數，用于將字符串分割為一系列標記。

decode(doc)

[源碼]

將輸入解碼為unicode符號字符串。

譯碼策略取決于矢量化器的參數。

參數	說明
doc	str 要解碼的字符串。

返回值	說明
doc	str 一串unicode符號。

fit(raw_documents, y=None)

[源碼]

學習原始文檔中所有標記的詞匯字典。

參數	說明
raw_documents	iterable 生成str、unicode或file對象的迭代器。

返回值	說明
self

fit_transform(raw_documents, y=None)

[源碼]

學習詞匯表字典并返回文檔術語矩陣。

這相當于在fit之后進行轉換，但實現起來更有效。

參數	說明
raw_documents	iterable 生成str、unicode或file對象的迭代器。

返回值	說明
X	array of shape (n_samples, n_features) Document-term矩陣。

get_feature_names()

[源碼]

從特征整數索引到特征名稱的數組映射。

返回值	說明
feature_names	list 特征名稱列表。

get_params(deep=True)[source]

[源碼]

獲取這個估計器的參數。

參數	說明
deep	bool, default=True 如果為真，將返回此估計器的參數以及包含的作為估計器的子對象。

返回值	說明
params	mapping of string to any 參數名稱映射到它們的值。

get_stop_words()

[源碼]

構建或獲取有效停止詞列表。

返回值	說明
stop_words	list or None 停止詞的列表。

inverse_transform(X)

[源碼]

返回在X中有非零項的每個文檔的術語。

參數	說明
X	{array-like, sparse matrix} of shape (n_samples, n_features) Document-term矩陣。

返回值	說明
X_inv	list of arrays of shape (n_samples,) 術語數組的列表。

set_params(**params)

[源碼]

設置這個估計器的參數。

該方法適用于簡單估計量和嵌套對象。后者具有形式為<component>_<parameter>的參數，這樣就讓更新嵌套對象的每個樣本成為了可能。

參數	說明
**params	dict 估計器參數。

返回值	說明
self	object 估計器實例。

transform(raw_documents)

[源碼]

將文檔轉換為文檔術語矩陣。

使用擬合fit的詞匯表或提供給構造函數的詞匯表從原始文本文檔中提取標記數。

參數	說明
raw_documents	iterable 生成str、unicode或file對象的iterable。

返回值	說明
X	sparse matrix of shape (n_samples, n_features) Document-term矩陣。

示例sklearn.feature_extraction.text.CountVectorizer?

非負矩陣分解與潛在Dirichlet分配的主題提取 ?

用于文本特征提取和評估的示例管道 ?