In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
How to analyze big data preprocessing, many novices are not very clear about this, in order to help you solve this problem, the following small series will explain in detail for everyone, people who have this need can learn, I hope you can gain something.
Data analysis is generally divided into two main lines:
The first thread is at the data level.
The second thread is the business level
General steps for data analysis:
Generate Data-> Collect Data-> Store Data-> Extract Data-> Preprocess Data-> Analyze Data-> Visualize Data-> Interpret Data Reports
I. Necessity of data preprocessing
At present, most of the research work of data mining focuses on the discussion of algorithms and neglects the research of data processing. In fact, data preprocessing is very important for data mining. Some mature algorithms have certain requirements for the data sets they process: for example, good data integrity, small redundancy, small correlation of attributes, etc.
Data preprocessing is an important part of data mining, and necessary. In order for mining algorithms to mine effective knowledge, clean, accurate and concise data must be provided. However, the data collected in the actual application system is usually "dirty" data.
II. Problems with data existence
Incomplete: #Missing data values #Missing some important attributes #Contains aggregated data only
Noisy: #contains errors or isolated points #e.g. salary =-100
Data inconsistency: #There is a difference in coding or naming #For example, the previous grade "1, 23" and the current grade "A, B, C#Inconsistency between duplicate records
III. Reasons for data problems
Causes of incomplete data
The data was collected without proper values.
Different considerations in data collection and data analysis
Human/hardware/software issues
Causes of noisy data (incorrect values)
Problems with data collection tools
Artificial computer errors in data entry
Errors in data transmission
Causes of Data Inconsistency
different data sources
Violation of functional dependencies
IV. Importance of pretreatment
Without high-quality data, there will be no high-quality mining results.
High-quality decisions must rely on high-quality data
For example, duplicate or missing values will produce incorrect or misleading statistics
Data warehouses require consistent integration of high-quality data
PS: Data preprocessing is the largest workload in the data analysis process
V. Conventional methods of data preprocessing
1 Data cleaning
Remove noise and extraneous data
2 Data integration
Combine data from multiple data sources in a consistent data store
3 Data transformation
Transform raw data into a form suitable for data mining
4 Data reduction
Main methods include: data cube aggregation, dimension reduction, data compression, numerical reduction,
discretization and conceptual layering, etc.
Analysis of the actual work in the data analysis of the preprocessing stage:
Data level analysis:
Data preprocessing: [null values, missing values, outliers, etc.]-> Processing methods are mainly deletion and filling (generally filling in median, mean, etc.)
The logical operation of pretreatment: the general logical order is as follows
1. Outliers: Measurement minus standard deviation greater than 2 times the mean, I consider this an outlier. -> Outliers can also be seen in box plots. Outliers can also be seen in the datahoop box plot.
You can also not deal with it: you should explain the reasons for not dealing with it. However, it mainly depends on the proportion and actual business situation. Remember one important thing about real life: Being is reasonable.
2. Data normalization: scaling data. First construct new variables and then standardize them to prevent the dimension from increasing and affecting the algorithm of data model.
3. Dimension: The size of the dimension will affect most, and the large fluctuation of the independent variable will affect most data model algorithms. So we need to standardize the data. Data standardization is the grouping of all data into a range interval. -> Z value formula: independent variable x=(original value minus mean value) divided by standard deviation.
4. Collinearity: The goal is to reduce dimensionality, collinearity-correlation coefficient matrix.
A correlation coefficient less than 0.3 is considered weak. A correlation coefficient between 0.7 and 0.9 is considered strong.
Before doing the algorithm, be sure to look at the correlation.
There are generally two ways to reduce correlation: 1. Increase the amount of sample size 2. Construct new variables (incremental and ratio methods)-> Reduce dimensionality (factor analysis and principal component analysis).
Difference between Principal Component Analysis and Factor Analysis:
Principal component analysis: Principal component analysis is to try to combine many original indicators with certain correlation into a new set of unrelated comprehensive indicators to replace the original indicators. The composite index is the principal component. The few principal components obtained should retain the information of the original variables as much as possible and be independent of each other.
Factor analysis is a multivariate statistical analysis method that studies how to decompose many original variables into a few factor variables with the least information loss, and how to make factor variables have strong interpretability.
Factor analysis: Factor analysis is not a recombination of the original variables, but a decomposition of the original variables into two parts: common factors and special factors. Specifically, it is to find out how the indexes directly measured in a certain problem are governed by a few relatively independent factors that are meaningful in the profession and cannot be directly measured, so that the state of each factor can be indirectly determined by the measurement of each index.
Factor analysis can explain only part of the variation, principal component analysis can explain all the variation.
The idea of data preprocessing must be complete, and the reason for data preprocessing should be given.
For example, Excel2016 and above basically have data analysis functions.
Delete null
Did reading the above help you? If you still want to have further understanding of related knowledge or read more related articles, please pay attention to the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.