In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
The object of various data analysis techniques is the data in the data source.
The data in the data source may be incomplete (for example, the values of some attributes are uncertain or vacant), noisy and inconsistent (as an attribute has different names in different tables), and different dimensions.
If we analyze these unprocessed data directly, the results may not be accurate and the efficiency may be low.
It is necessary to use preprocessing methods such as cleaning, integration, transformation and reduction to improve data quality, so as to improve the efficiency and quality of data analysis.
This paper mainly introduces the preprocessing techniques such as data cleaning, integration, transformation, specification and so on.
Data cleaning is used to eliminate noise, data inconsistencies and incomplete data.
Noise can be eliminated by smoothing, identifying outliers, etc.
Sub-box technology: sort the data, distribute the data into different boxes according to the equal-depth or equal-width distribution rules, and replace the data in the same box with the average or median and boundary values of the data in the box (average smoothing, median smoothing, boundary smoothing)
Suppose the value of an attribute is 18, 12, 3, 9, 7, 6, 15, 21, 16, and the sub-box technique is used to smooth the data and eliminate noise. The distribution rule is equal depth, the depth is 3, and the smoothing rule is average smoothing.
First, sort the values of the attributes as 3, 6, 7, 9, 12, 15, 16, 18, 21
Incomplete data can be eliminated using the following methods:
1) fill with a global constant
2) fill with attribute average
3) fill with the average value of similar attributes
4) filling with the most likely values requires a prediction algorithm to predict and fill in the most likely values of a given sample.
Data inconsistencies can be eliminated by metadata (data that describes the data)
Data integration
Data integration is the combination of data from multiple data sources and stored in a consistent data store (such as a data warehouse).
These data sources may include multiple databases, data cubes, or general files
In data integration, redundancy needs to be eliminated-inconsistent attributes that can be "exported" and named by other attributes
Redundancy can be detected by correlation analysis.
Calculation of the correlation between attributes An and B:
There is a positive correlation between rA,B > 0 and B, and the value of An increases with the increase of the value of B.
RA,B
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.