Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of python Machine Learning algorithm and data dimensionality reduction

2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces python machine learning algorithm and data dimensionality reduction example analysis, the article is very detailed, has a certain reference value, interested friends must read it!

I. Data dimensionality reduction

Dimension in machine learning is the number of features, dimensionality reduction is to reduce the number of features. Dimension reduction methods include feature selection and principal component analysis.

1. feature selection

When the following situations occur, you can select this method to reduce dimensions:

① Redundancy: Some features have high correlation and are easy to consume computational performance.

Noise: Some features have an impact on the prediction results

Main methods of feature selection: filtering (VarianceThreshold), embedding (regularization, decision tree)

Filter:

sklearn Feature Selection API

sklearn.feature_selection.VarianceThreshold

Note: There is no best variance choice, you need to choose the variance according to the actual effect.

2. Principal Component Analysis (PCA)

API:sklearn.decomposition

Principal component analysis reduces the dimensionality of the original data as much as possible and loses a small amount of information. When the number of features reaches hundreds, principal component analysis needs to be considered. You can reduce the number of features in regression analysis or cluster analysis.

PCA syntax:

The n_components in it are usually filled in decimals of 0-1, representing how much data is retained, for example, 0.95 means 95% of the data is retained. Usually between 0.9 and 0.95

3. Dimension reduction method using process

For example: to study the relationship between users and purchase categories, the data is stored in different tables, all csv files, but the required two "users" and "purchase categories" exist in different tables. The following procedure can be followed:

1. Observe the keys of each table and merge the tables with the same key, using the pandas.merge(Table 1, Table 2, Key 1, Key 2) method, where Key 1 and Key 2 are the same. After several merges, the two targets are finally merged into a single table.

2. Create a data table with rows for users and columns for item categories from the cross-table pd.crosstab(merged table "'users '], merged table "'item category']).

3. To reduce the dimensionality of the data in the table, PCA(n_components=0.9) can be used to retain 90% of the valid information and output the dimensionality reduced data. This effectively reduces the dimensionality and ensures that 90% of the valid information is retained.

II. Machine learning development process 1. Classification of machine learning algorithms

Data Type:

Discrete: Inseparable in intervals, usually in classified problems.

Continuous: Separable within intervals, usually in predictive problems.

Algorithm classification:

Algorithms are generally divided into two categories, supervised learning and unsupervised learning.

Supervised learning consists of eigenvalues + target values, and algorithms are divided into two sub-categories, classification algorithms and regression algorithms.

Classification algorithms: k-nearest neighbor algorithm, Bayesian classification, decision tree and random forest, logistic regression, neural network

Regression algorithms: linear regression, ridge regression

② Unsupervised learning has only eigenvalues, usually clustering algorithm: k-means

2. Machine Learning Development Process

Machine learning development first needs data, and there may be several sources of data: the company itself has data, data from cooperation, and purchased data.

The specific development process is as follows:

① Clarify what the actual problem is: according to the target value data type, establish a model and divide the application type. See if it's a classification problem or a prediction problem.

Basic processing of data: use pandas to process data, missing values, merge tables, etc.

Feature engineering: processing data features (important).

Find the right algorithm to predict.

④ Model evaluation, judgment effect → online use, provided in API form; if the model evaluation is not qualified: conversion method, parameters, feature engineering

Use of sklearn datasets:

The dataset is usually partitioned prior to use, with about 75% of the data being taken as the training set and 25% as the test set. It can also be 0.8/0.2. 0.75/0.25 is usually the most used.

sklearn dataset split API: sklearn.model_selection.train_set_split

sklearn dataset API:

Get the type returned by the dataset:

The dataset is segmented:

Large datasets for classification:

sklearn regression dataset:

III. Converters and Estimators 1. Converters

The fit_tanform method used in data processing can actually be split into the fit method and the transform method.

fit_transform() = fit() + transform()

If fit_transform() is used directly, the average and standard deviation of the input data are calculated, and they are used for data processing and the final output result.

If you open it:

fit(): Input data, calculate mean, standard deviation, etc., without further work.

transform(): Use fit to calculate the content to transform.

That is to say, you can use the fit() method to generate a standard corresponding to one piece of data, and use this standard to convert other data by the transform method.

2. estimator

Estimator is the API of the algorithm that has been implemented, which can be called directly, input relevant data, predict the result, etc.

Estimator workflow:

1. Call fit(x_train, y_train) and input the training set

2. Enter the data of the test set (x_test, y_test), call different interfaces to get different results

API①: y_predict = predict(x_test), this interface can obtain the prediction value of the algorithm for y.

API②: score(x_test, y_test), this interface can obtain the accuracy of prediction.

The above is "python machine learning algorithm and data dimensionality reduction example analysis" all the content of this article, thank you for reading! Hope to share the content to help everyone, more relevant knowledge, welcome to pay attention to the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report