Example Analysis of feature Engineering algorithm in python Machine Learning 07/12 Update SLTechnology News&Howtos

Example Analysis of feature Engineering algorithm in python Machine Learning

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you the "example analysis of feature engineering algorithms in python machine learning", which is easy to understand and well-organized. I hope it can help you solve your doubts. Let me lead you to study and study the "example analysis of feature engineering algorithms in python machine learning".

I. Overview of machine learning

Machine learning is to automatically analyze and obtain rules (models) from data, and use laws to predict unknown data.

Second, the composition of the data set 1. Data set storage

Historical data of machine learning is usually stored in csv files.

Reasons for not using mysql:

1. The reading speed is slow if the file is large.

2. the format does not meet the requirements of machine learning.

two。 Available dataset

Kaggle: big data competition platform, 800000 scientists, real data, huge amount of data

Kaggle Web site: https://www.kaggle.com/datasets

UCI:360 data sets, covering science, life, economy and other fields, hundreds of thousands of data

UCI dataset Web site: http://archive.ics.uci.edu/ml/

Scikit-learn: small amount of data, easy to learn

Scikit-learn Web site: http://scikit-learn.org/stable/datasets/index.html#datasets

3. The structure of commonly used data sets

Eigenvalue (conditions used to determine the target value, such as the area of the house, etc.) + target value (desired goal: such as house price)

Some datasets can have no target values.

III. Feature Engineering

The process of "transforming the original data into features that better represent the potential problems of the prediction model", called feature engineering, can improve the prediction accuracy of unknown data. If the feature is not good, it is likely that even if the algorithm is good, the result will not be satisfactory.

Pandas can be used for data reading and basic data processing.

Sklearn has more powerful interfaces for processing features

Feature extraction:

Feature extraction API:sklearn.feature_extraction

1. Dictionary data feature extraction

API:sklearn.feature_extraction.DictVectorizer

The syntax is as follows:

Dictionary data extraction: convert the category data in the dictionary into feature data respectively. Therefore, if the input is in the form of an array and has these characteristics of the category, it needs to be converted into dictionary data first, and then extracted.

two。 Text feature extraction

Count

Class: sklearn.feature_extraction.text.CountVectorizer

Usage:

1. Count all the words in all the articles and repeat them only once

two。 For each article, count the number of times each word appears in the list of words

3. Individual letters are not counted.

Note: this method does not support Chinese by default, each Chinese character is regarded as an English letter, a space or comma in the middle will be separated, similarly, a Chinese character will not be counted. (Chinese can use jieba participle: pip install jieba, use: jieba.cut ("I am a programmer"))

3. Text feature extraction: tf-idf

The above countvec cannot handle neutral words such as "tomorrow, noon, because". So you can use the tfidf method.

Tf:term frequency word frequency (same as countvec method)

Idf:inverse document frequency inverse document frequency log (total number of documents / number of documents in which the word appears)

Tf * idf importance

Class: sklearn.feature_extraction.text.TfidfVectorizer

4. Feature preprocessing: normalization

Feature preprocessing: convert the data into the data required by the algorithm through specific statistical methods

Feature preprocessing API:sklearn.preprocessing

Normalized API:sklearn.preprocessing.MinMaxScaler

When multiple features are equally important and there is a large gap between feature data, normalization is carried out. However, normalization is easily affected by outliers, so this method is less robust and is only suitable for traditional accurate small data scenarios.

5. Feature preprocessing: standardization

Transform the original data to a range with a mean value of 0 and a standard deviation of 1

Standardized API:

Sklearn.preprocessing.StandardScaler

Standardization is suitable for modern noisy big data scenes and is relatively stable when there are enough samples.

6. Feature preprocessing: missing value processing

Interpolation: to fill (usually by column) by missing the average or median of each row or column.

API:sklearn.impute.SimpleImputer

Missing value tag in data: default is np.nan

The above is all the contents of the article "example Analysis of feature Engineering algorithms in python Machine Learning". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.