Big data Analysis based on python-data processing (Code practice) 04/25 Update SLTechnology News&Howtos

Big data Analysis based on python-data processing (Code practice)

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Continue with the previous article. After the data is obtained, it can not be analyzed or used directly, because there is a lot of invalid junk data, so it must be processed. The main contents of data processing include data cleaning, data extraction, data exchange and data calculation.

Data cleaning

Data cleaning is the most critical step in the data value chain. Even through the best analysis, junk data can produce erroneous results and cause greater misleading.

Data cleaning is to deal with missing data and clear meaningless information, such as deleting irrelevant data in the original data set, duplicating data, smoothing noise data, filtering out data that has nothing to do with the analysis topic, and so on.

Treatment of repeated values

The steps are as follows:

1 returns a Boolean Series using the duplicated method in DataFrame, showing whether there are duplicate lines. No FALSE is displayed, and some are displayed as TRUE from the duplicate second line

2 returns a DataFrame with duplicate rows removed by using the drop_duplicates method in DataFrame

The format of duplicated:

Duplicated (subset=None, keep='first')

The parameters in parentheses are optional, and all columns are not written by default.

Subset is used to identify duplicate column tags or column label sequence numbers. The default is all column tags.

Keep for first means that all the same data is marked as duplicate except for the first occurrence, last means that all the same data is marked as duplicate except for the last time, and false means that all the same data is marked as duplicate

The format of drop_duplicates:

Drop_duplicates ()

If you want to specify a column, just add the column name in parentheses.

From pandas import DataFramefrom pandas import Series# creates data df=DataFrame ({'age':Series ([26 xiaoqiang1','xiaoqiang2','xiaoqiang2' 85])),' name':Series (['xiaoqiang1','xiaoqiang2','xiaoqiang2']}) df# to determine whether there are duplicate lines df.duplicated () # remove duplicate rows df.drop_duplicates ()

Treatment of missing values

The processing of missing values generally includes two steps, namely, the identification of missing data and the processing of missing data.

Identification of missing data

Pandas uses floating-point NaN to represent missing data in floating-point and non-floating-point arrays, and uses the isnull and notnull functions to determine the missing data.

# Identification of missing data from pandas import DataFramefrom pandas import read_excel# has missing data df=read_excel (sheetname='Sheet2') df# identifies missing data, NaN will display True. The notnull function is just the opposite of df.isnull ()

The content of rz.xlsx is as follows

Processing of missing data

For the processing of missing data, there are data completion, deletion of corresponding rows, and no processing. Here, just play with the code explanation.

# continue with the above Data processing # remove data rows with empty values in the data newdf=df.dropna () newdf# replace NaNnewdf2=df.fillna ('-') newdf2# use the previous data value instead of NaNnewdf3=df.fillna (method='pad') newdf3# use the latter data value instead of NaNnewdf4=df.fillna (method='bfill') newdf4# pass in a dictionary to fill different columns with different values newdf5=df.fillna ({'points': 100) Newdf5# uses averages instead of NaN. Will automatically calculate the average of data with two columns of NaN newdf6=df.fillna (df.mean ()) newdf6# can also use strip () to remove the specified characters around the data, this is the basis of python, there is no demonstration here

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.