In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Continue with the previous article. After the data is obtained, it can not be analyzed or used directly, because there is a lot of invalid junk data, so it must be processed. The main contents of data processing include data cleaning, data extraction, data exchange and data calculation.
Data cleaning
Data cleaning is the most critical step in the data value chain. Even through the best analysis, junk data can produce erroneous results and cause greater misleading.
Data cleaning is to deal with missing data and clear meaningless information, such as deleting irrelevant data in the original data set, duplicating data, smoothing noise data, filtering out data that has nothing to do with the analysis topic, and so on.
Treatment of repeated values
The steps are as follows:
1 returns a Boolean Series using the duplicated method in DataFrame, showing whether there are duplicate lines. No FALSE is displayed, and some are displayed as TRUE from the duplicate second line
2 returns a DataFrame with duplicate rows removed by using the drop_duplicates method in DataFrame
The format of duplicated:
Duplicated (subset=None, keep='first')
The parameters in parentheses are optional, and all columns are not written by default.
Subset is used to identify duplicate column tags or column label sequence numbers. The default is all column tags.
Keep for first means that all the same data is marked as duplicate except for the first occurrence, last means that all the same data is marked as duplicate except for the last time, and false means that all the same data is marked as duplicate
The format of drop_duplicates:
Drop_duplicates ()
If you want to specify a column, just add the column name in parentheses.
From pandas import DataFramefrom pandas import Series# creates data df=DataFrame ({'age':Series ([26 xiaoqiang1','xiaoqiang2','xiaoqiang2' 85])),' name':Series (['xiaoqiang1','xiaoqiang2','xiaoqiang2']}) df# to determine whether there are duplicate lines df.duplicated () # remove duplicate rows df.drop_duplicates ()
Treatment of missing values
The processing of missing values generally includes two steps, namely, the identification of missing data and the processing of missing data.
Identification of missing data
Pandas uses floating-point NaN to represent missing data in floating-point and non-floating-point arrays, and uses the isnull and notnull functions to determine the missing data.
# Identification of missing data from pandas import DataFramefrom pandas import read_excel# has missing data df=read_excel (sheetname='Sheet2') df# identifies missing data, NaN will display True. The notnull function is just the opposite of df.isnull ()
The content of rz.xlsx is as follows
Processing of missing data
For the processing of missing data, there are data completion, deletion of corresponding rows, and no processing. Here, just play with the code explanation.
# continue with the above Data processing # remove data rows with empty values in the data newdf=df.dropna () newdf# replace NaNnewdf2=df.fillna ('-') newdf2# use the previous data value instead of NaNnewdf3=df.fillna (method='pad') newdf3# use the latter data value instead of NaNnewdf4=df.fillna (method='bfill') newdf4# pass in a dictionary to fill different columns with different values newdf5=df.fillna ({'points': 100) Newdf5# uses averages instead of NaN. Will automatically calculate the average of data with two columns of NaN newdf6=df.fillna (df.mean ()) newdf6# can also use strip () to remove the specified characters around the data, this is the basis of python, there is no demonstration here
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.