What is the use of Extracting, transforming and selecting features 04/26 Update SLTechnology News&Howtos

What is the use of Extracting, transforming and selecting features

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this article, the editor introduces in detail "what is the use of Extracting, transforming and selecting features". The content is detailed, the steps are clear, and the details are handled properly. I hope this article "what is the use of Extracting, transforming and selecting features" can help you solve your doubts.

Table of Contents

Feature Extractors feature extraction

TF-IDF

Word2Vec

CountVectorizer

Feature Transformers feature transformation

Tokenizer word splitter

StopWordsRemover stop word cleanup

Nn-gram

Binarizer dualization method

PCA principal component analysis

PolynomialExpansion polynomial extension

Discrete Cosine Transform (DCT- discrete CoSine transform)

StringIndexer string-index transformation

IndexToString Index-string Transformation

OneHotEncoder single hot coding

Indexing of VectorIndexer vector types

Interaction

Normalizer norm p-norm normalization

StandardScaler normalization converts attribute values to obey normal distribution based on columns of eigenmatrix.

Maximum-minimum normalization of MinMaxScaler [0Pol 1]

Normalization of MaxAbsScaler absolute value [- 1]

Bucketizer box separator

ElementwiseProduct Hadamard product

SQLTransformer SQL transform

VectorAssembler feature vector merging

QuantileDiscretizer quantile discretization

Imputer

Feature Selectors feature selection

VectorSlicer vector selection

RFormula R model formula

ChiSqSelector chi-square feature selection

Locality Sensitive Hashing local hash sensitivity

Bucketed Random Projection for Euclidean Distance Euclidean distance bucket random projection

MinHash for Jaccard Distance Jeckard distance

Feature Transformation feature transformation

Approximate Similarity Join approximate similar connection

Approximate Nearest Neighbor Search approximate nearest neighbor search

LSH Operations

LSH Algorithms

Feature ExtractorsTF-IDF

Word Frequency-inverse File Frequency (TF-IDF) is a feature vectorization method, which is widely used in text mining to reflect the importance of words to documents in the corpus. Tt is used to represent words, dd is used to represent documents, and DD is used to represent corpus. The word frequency TF (tMagne d) TF (tMagne d) is the number of times the word tt appears in the document dd, while the document frequency DF (tMagne D) DF (tMagne D) is the number of documents containing words tt. If we only use word frequency to measure importance, it is easy to overemphasize words that occur frequently but have little information about the document, e.g. "a", "the", and "of". If a word appears frequently in the entire corpus, it means that the word does not contain important information about a particular document. Reverse document frequency is a numerical measure of how much information a word provides:

IDF (tMagne D) = log | D | + 1DF (tMagne D) + 1

D is the total number of documents in the corpus. Due to the use of the log function, if a word appears in all documents, its IDF value becomes 0. Add 1 to avoid a denominator of 0. The TF-IDF measure is expressed as follows:

TFIDF (tMagne D) = TF (tMagne d) ⋅ IDF (tMagne D).

In MLlib, TF-IDF is divided into two parts: TF and IDF, which makes it more flexible.

TF: both HashingTF and CountVectorizer can be used to generate word frequency vectors. .

HashingTF is a converter that accepts entries and converts these term sets into fixed-length feature vectors. In text processing, a word bag model. HashingTF utilizes the hashing trick. The original features are mapped into indexes using the hash function. Where the hash function is MurmurHash 3. Then the word frequency is calculated according to the mapped index. This method avoids the need to calculate a global term-to-index map and takes longer to map a large number of corpora. But there is a hash conflict, that is, after hashing, different original features may be the same word. To reduce the chance of collision, we can increase the feature dimension, i.e., and increase the number of buckets in the hash table. Because a simple module is used to convert the hash function into a column index, it is recommended to use the power of 2 as the feature dimension, otherwise, the features will not be evenly mapped to the column. The default feature dimension is 218,262, 144218,262144. Optional binary toggle parameter control frequency count. When set to true, all non-zero frequency counts are set to 1. This is especially useful for discrete probability models that simulate binary counting instead of integer counting.

Converts a text document into a vector for word counting. Refer to CountVectorizer for more details. .

IDF: IDF is an Estimator that applies its fit () method on a dataset to produce an IDFModel. The IDFModel receives the feature vector (generated by HashingTF) and then calculates the frequency of each word in the document. IDF reduces the weight of words that appear more frequently in the corpus.

Note: spark.ml does not provide a text segmentation tool. We refer users to the Stanford NLP Group and scalanlp/chalk.

After reading this, the article "what is the use of Extracting, transforming and selecting features" has been introduced. If you want to master the knowledge of this article, you still need to practice and use it. If you want to know more about the article, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.