In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
In this article, the editor introduces in detail "what is the use of Extracting, transforming and selecting features". The content is detailed, the steps are clear, and the details are handled properly. I hope this article "what is the use of Extracting, transforming and selecting features" can help you solve your doubts.
Table of Contents
Feature Extractors feature extraction
TF-IDF
Word2Vec
CountVectorizer
Feature Transformers feature transformation
Tokenizer word splitter
StopWordsRemover stop word cleanup
Nn-gram
Binarizer dualization method
PCA principal component analysis
PolynomialExpansion polynomial extension
Discrete Cosine Transform (DCT- discrete CoSine transform)
StringIndexer string-index transformation
IndexToString Index-string Transformation
OneHotEncoder single hot coding
Indexing of VectorIndexer vector types
Interaction
Normalizer norm p-norm normalization
StandardScaler normalization converts attribute values to obey normal distribution based on columns of eigenmatrix.
Maximum-minimum normalization of MinMaxScaler [0Pol 1]
Normalization of MaxAbsScaler absolute value [- 1]
Bucketizer box separator
ElementwiseProduct Hadamard product
SQLTransformer SQL transform
VectorAssembler feature vector merging
QuantileDiscretizer quantile discretization
Imputer
Feature Selectors feature selection
VectorSlicer vector selection
RFormula R model formula
ChiSqSelector chi-square feature selection
Locality Sensitive Hashing local hash sensitivity
Bucketed Random Projection for Euclidean Distance Euclidean distance bucket random projection
MinHash for Jaccard Distance Jeckard distance
Feature Transformation feature transformation
Approximate Similarity Join approximate similar connection
Approximate Nearest Neighbor Search approximate nearest neighbor search
LSH Operations
LSH Algorithms
Feature ExtractorsTF-IDF
Word Frequency-inverse File Frequency (TF-IDF) is a feature vectorization method, which is widely used in text mining to reflect the importance of words to documents in the corpus. Tt is used to represent words, dd is used to represent documents, and DD is used to represent corpus. The word frequency TF (tMagne d) TF (tMagne d) is the number of times the word tt appears in the document dd, while the document frequency DF (tMagne D) DF (tMagne D) is the number of documents containing words tt. If we only use word frequency to measure importance, it is easy to overemphasize words that occur frequently but have little information about the document, e.g. "a", "the", and "of". If a word appears frequently in the entire corpus, it means that the word does not contain important information about a particular document. Reverse document frequency is a numerical measure of how much information a word provides:
IDF (tMagne D) = log | D | + 1DF (tMagne D) + 1
D is the total number of documents in the corpus. Due to the use of the log function, if a word appears in all documents, its IDF value becomes 0. Add 1 to avoid a denominator of 0. The TF-IDF measure is expressed as follows:
TFIDF (tMagne D) = TF (tMagne d) ⋅ IDF (tMagne D).
In MLlib, TF-IDF is divided into two parts: TF and IDF, which makes it more flexible.
TF: both HashingTF and CountVectorizer can be used to generate word frequency vectors. .
HashingTF is a converter that accepts entries and converts these term sets into fixed-length feature vectors. In text processing, a word bag model. HashingTF utilizes the hashing trick. The original features are mapped into indexes using the hash function. Where the hash function is MurmurHash 3. Then the word frequency is calculated according to the mapped index. This method avoids the need to calculate a global term-to-index map and takes longer to map a large number of corpora. But there is a hash conflict, that is, after hashing, different original features may be the same word. To reduce the chance of collision, we can increase the feature dimension, i.e., and increase the number of buckets in the hash table. Because a simple module is used to convert the hash function into a column index, it is recommended to use the power of 2 as the feature dimension, otherwise, the features will not be evenly mapped to the column. The default feature dimension is 218,262, 144218,262144. Optional binary toggle parameter control frequency count. When set to true, all non-zero frequency counts are set to 1. This is especially useful for discrete probability models that simulate binary counting instead of integer counting.
Converts a text document into a vector for word counting. Refer to CountVectorizer for more details. .
IDF: IDF is an Estimator that applies its fit () method on a dataset to produce an IDFModel. The IDFModel receives the feature vector (generated by HashingTF) and then calculates the frequency of each word in the document. IDF reduces the weight of words that appear more frequently in the corpus.
Note: spark.ml does not provide a text segmentation tool. We refer users to the Stanford NLP Group and scalanlp/chalk.
After reading this, the article "what is the use of Extracting, transforming and selecting features" has been introduced. If you want to master the knowledge of this article, you still need to practice and use it. If you want to know more about the article, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.