class sklearn.feature_extraction.text.HashingVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, n_features=1048576, binary=False, norm=’l2’, alternate_sign=True, non_negative=False, dtype=<class ‘numpy.float64’>) [source]
Convert a collection of text documents to a matrix of token occurrences
It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.
This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.
This strategy has several advantages:
There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):
The hash function employed is the signed 32-bit version of Murmurhash3.
Read more in the User Guide.
| Parameters: |
|
|---|
See also
>>> from sklearn.feature_extraction.text import HashingVectorizer >>> corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ] >>> vectorizer = HashingVectorizer(n_features=2**4) >>> X = vectorizer.fit_transform(corpus) >>> print(X.shape) (4, 16)
build_analyzer() | Return a callable that handles preprocessing and tokenization |
build_preprocessor() | Return a function to preprocess the text before tokenization |
build_tokenizer() | Return a function that splits a string into a sequence of tokens |
decode(doc) | Decode the input into a string of unicode symbols |
fit(X[, y]) | Does nothing: this transformer is stateless. |
fit_transform(X[, y]) | Transform a sequence of documents to a document-term matrix. |
get_params([deep]) | Get parameters for this estimator. |
get_stop_words() | Build or fetch the effective stop words list |
partial_fit(X[, y]) | Does nothing: this transformer is stateless. |
set_params(**params) | Set the parameters of this estimator. |
transform(X) | Transform a sequence of documents to a document-term matrix. |
__init__(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, n_features=1048576, binary=False, norm=’l2’, alternate_sign=True, non_negative=False, dtype=<class ‘numpy.float64’>) [source]
build_analyzer() [source]
Return a callable that handles preprocessing and tokenization
build_preprocessor() [source]
Return a function to preprocess the text before tokenization
build_tokenizer() [source]
Return a function that splits a string into a sequence of tokens
decode(doc) [source]
Decode the input into a string of unicode symbols
The decoding strategy depends on the vectorizer parameters.
| Parameters: |
|
|---|
fit(X, y=None) [source]
Does nothing: this transformer is stateless.
| Parameters: |
|
|---|
fit_transform(X, y=None) [source]
Transform a sequence of documents to a document-term matrix.
| Parameters: |
|
|---|---|
| Returns: |
|
get_params(deep=True) [source]
Get parameters for this estimator.
| Parameters: |
|
|---|---|
| Returns: |
|
get_stop_words() [source]
Build or fetch the effective stop words list
partial_fit(X, y=None) [source]
Does nothing: this transformer is stateless.
This method is just there to mark the fact that this transformer can work in a streaming setup.
| Parameters: |
|
|---|
set_params(**params) [source]
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.
| Returns: |
|
|---|
transform(X) [source]
Transform a sequence of documents to a document-term matrix.
| Parameters: |
|
|---|---|
| Returns: |
|
sklearn.feature_extraction.text.HashingVectorizer
© 2007–2018 The scikit-learn developers
Licensed under the 3-clause BSD License.
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html