How to analyze big data's Pretreatment 04/26 Update SLTechnology News&Howtos

How to analyze big data's Pretreatment

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to analyze big data preprocessing, many novices are not very clear about this, in order to help you solve this problem, the following small series will explain in detail for everyone, people who have this need can learn, I hope you can gain something.

Data analysis is generally divided into two main lines:

The first thread is at the data level.

The second thread is the business level

General steps for data analysis:

Generate Data-> Collect Data-> Store Data-> Extract Data-> Preprocess Data-> Analyze Data-> Visualize Data-> Interpret Data Reports

I. Necessity of data preprocessing

At present, most of the research work of data mining focuses on the discussion of algorithms and neglects the research of data processing. In fact, data preprocessing is very important for data mining. Some mature algorithms have certain requirements for the data sets they process: for example, good data integrity, small redundancy, small correlation of attributes, etc.

Data preprocessing is an important part of data mining, and necessary. In order for mining algorithms to mine effective knowledge, clean, accurate and concise data must be provided. However, the data collected in the actual application system is usually "dirty" data.

II. Problems with data existence

Incomplete: #Missing data values #Missing some important attributes #Contains aggregated data only

Noisy: #contains errors or isolated points #e.g. salary =-100

Data inconsistency: #There is a difference in coding or naming #For example, the previous grade "1, 23" and the current grade "A, B, C#Inconsistency between duplicate records

III. Reasons for data problems

Causes of incomplete data

The data was collected without proper values.

Different considerations in data collection and data analysis

Human/hardware/software issues

Causes of noisy data (incorrect values)

Problems with data collection tools

Artificial computer errors in data entry

Errors in data transmission

Causes of Data Inconsistency

different data sources

Violation of functional dependencies

IV. Importance of pretreatment

Without high-quality data, there will be no high-quality mining results.

High-quality decisions must rely on high-quality data

For example, duplicate or missing values will produce incorrect or misleading statistics

Data warehouses require consistent integration of high-quality data

PS: Data preprocessing is the largest workload in the data analysis process

V. Conventional methods of data preprocessing

1 Data cleaning

Remove noise and extraneous data

2 Data integration

Combine data from multiple data sources in a consistent data store

3 Data transformation

Transform raw data into a form suitable for data mining

4 Data reduction

Main methods include: data cube aggregation, dimension reduction, data compression, numerical reduction,

discretization and conceptual layering, etc.

Analysis of the actual work in the data analysis of the preprocessing stage:

Data level analysis:

Data preprocessing: [null values, missing values, outliers, etc.]-> Processing methods are mainly deletion and filling (generally filling in median, mean, etc.)

The logical operation of pretreatment: the general logical order is as follows

1. Outliers: Measurement minus standard deviation greater than 2 times the mean, I consider this an outlier. -> Outliers can also be seen in box plots. Outliers can also be seen in the datahoop box plot.

You can also not deal with it: you should explain the reasons for not dealing with it. However, it mainly depends on the proportion and actual business situation. Remember one important thing about real life: Being is reasonable.

2. Data normalization: scaling data. First construct new variables and then standardize them to prevent the dimension from increasing and affecting the algorithm of data model.

3. Dimension: The size of the dimension will affect most, and the large fluctuation of the independent variable will affect most data model algorithms. So we need to standardize the data. Data standardization is the grouping of all data into a range interval. -> Z value formula: independent variable x=(original value minus mean value) divided by standard deviation.

4. Collinearity: The goal is to reduce dimensionality, collinearity-correlation coefficient matrix.

A correlation coefficient less than 0.3 is considered weak. A correlation coefficient between 0.7 and 0.9 is considered strong.

Before doing the algorithm, be sure to look at the correlation.

There are generally two ways to reduce correlation: 1. Increase the amount of sample size 2. Construct new variables (incremental and ratio methods)-> Reduce dimensionality (factor analysis and principal component analysis).

Difference between Principal Component Analysis and Factor Analysis:

Principal component analysis: Principal component analysis is to try to combine many original indicators with certain correlation into a new set of unrelated comprehensive indicators to replace the original indicators. The composite index is the principal component. The few principal components obtained should retain the information of the original variables as much as possible and be independent of each other.

Factor analysis is a multivariate statistical analysis method that studies how to decompose many original variables into a few factor variables with the least information loss, and how to make factor variables have strong interpretability.

Factor analysis: Factor analysis is not a recombination of the original variables, but a decomposition of the original variables into two parts: common factors and special factors. Specifically, it is to find out how the indexes directly measured in a certain problem are governed by a few relatively independent factors that are meaningful in the profession and cannot be directly measured, so that the state of each factor can be indirectly determined by the measurement of each index.

Factor analysis can explain only part of the variation, principal component analysis can explain all the variation.

The idea of data preprocessing must be complete, and the reason for data preprocessing should be given.

For example, Excel2016 and above basically have data analysis functions.

Delete null

Did reading the above help you? If you still want to have further understanding of related knowledge or read more related articles, please pay attention to the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.