What is the method of cleaning data by Python 07/19 Update SLTechnology News&Howtos

What is the method of cleaning data by Python

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is Python's method of cleaning data". In daily operation, I believe many people have doubts about Python's method of cleaning data. Xiaobian consulted all kinds of materials and sorted out simple and easy operation methods. I hope to help you answer the question of "what is Python's method of cleaning data"! Next, please follow the small series to learn together!

The common but scattered methods in actual data cleaning scenarios are classified according to the logic of adding, deleting, checking and dividing into four axes, so as to reduce memory cost and improve learning and use efficiency.

data set

primary flow

Traffic Level Delivery Area Visitor Number Payment Conversion Rate Customer Unit Price Payment Amount Level I Zone A 44,30011.78%58.79306,887.83 Level I Zone B 30,61213.85%86.64367,338.10 Level I Zone C 18,3892.50%0.28129.58 Level I Zone D 4,50910.73%64.1231,035.14 Level I Zone E 3, 7695.73%92.9120,068.20 Class I Zone F 2,42422.07%89.3347,791.60 Class I Zone G 2,4128.21%56.0411,096.42

secondary flow

Traffic Level Delivery Area Visitors Payment Conversion Rate Customer Unit Price Payment Amount Level II Zone A 29,11110.66%87.4271,189.23 Level II Zone B 17,16522.71%91.22355,662.39 Level II Zone C 8,8700.78%44.523,072.00

tertiary flow

Traffic Level Release Area Visitor Number Payment Conversion Rate Customer Unit Price Payment Amount Level III Zone A 45,05913.66%90.11554,561.22 Level III Zone B 2,13310.83%74.4817,204.50 Level III Zone C 8999.90%92.998,276.50 Level III Zone D 310.00%

Level 3 E Zone 170.00%

The above three levels of data are in different Sheets in the same Excel.

First, import the case dataset. Because the case data is stored under different Sheets of the same Excel sheet, we need to specify sheetname to read them separately:

Next, start cleaning up the dinner.

1 Increase--Expand data dimensionality 1. multiline merge

The dimensions of the data in these three sheets are completely consistent (the data in each column are the same), and it is very convenient to combine them vertically for analysis. When it comes to vertical merging, concat comes uninvited. His move is simple and clear pd.concat([Table 1, Table 2, Table 3]). For data with uniform column fields, we only need to pass the table parameters in turn:

concat continued: In fact, set my parameter axis to 1 and you can merge horizontally. At that time, I rushed up to cover his mouth. It is enough for a strong person to do one thing well. The horizontal one will be handed over to merge ~!

Tips: Pandas in many functions are very powerful, can achieve a variety of functions, but for Meng Xin, too many or even cross functions often cause confusion B state, so here a function first only one way to achieve.

2. multicolumn merge

Horizontal merging involves joining problems, and for ease of understanding, we construct some more representative datasets to practice:

h2 = pd.DataFrame({ 'Chinese':[93,80,85,76,58],' Mathematics':[87,99,95,85,70],'English':[80,85,97,65,88]},index=[' Han Meimei','Li Lei',' Li Hua','Wang Ming',' Tie Dan']) h2h3 = pd.DataFrame({ 'Basketball':[93,80,85,76],'Dance':[87,99,95,85]},index=['Li Hua',' Wang Ming','Tie Dan',' Liu Qiang'])h3

Two DataFrames are two grades, h2 is the math, English, Chinese scores of 5 students, h3 is the basketball and dance scores of 4 students, now want to find and merge the two tables at the same time students and their scores, you can use the merge method:

pd.merge(left=h2,right= h3, left_index =True,right_index=True,how='inner') You can also specify fields associated with left and right tables Oh data = pd.merge(left= intopia_label, right= intopia_pr, how="inner", left_on="order_number",right_on="order_number")

Let's explain the parameters of merge in detail. Left and rgiht correspond to the left table and right table that need to be connected respectively. Here, the score table outside the language is the left table, and the basketball and dance scores are the right table.

Left_index and right_index are the parameters specified when we join with the index (the names of the two tables are in the index). Setting them to on indicates that the index of the table is used as the condition (or bridge) for the join. Assuming that name is a separate column of values and needs to be matched by name, we need to use·left_on = 'name', right_on = ' name·, we can specify the matching column of the left table and the matching column of the right table separately.

How is the specified join method, here inner, means that we match based on the name index, returning only the data of common (simultaneous) names in the two tables. Here is a detailed explanation of the other parameters involved in inner-left, right, outer.

Left and right links (left and right):

Left connection (left) and right connection (right), we can intuitively understand which table is the boss, who is the boss, listen to who (all rows are kept), first look at the left connection, left table h2 intact, right according to the left table merger, if there is a related name, normal return data, if there is no (Han Meimei, Li Lei), return null (NAN) value; right connection is to listen to the right table, left table has returned null.

External connection (outer):

The outer join is the product of compromise between two tables. All my data is retained, all yours is retained, what you have and I don't have is empty, and what you don't have and I have is empty. It is almost similar to SQL.

2 Delete--Delete empty to duplicate 1. blank

In some scenarios, missing (null) source data is a noise term for analysis and requires systematic deletion. The df dataset we merged above is the one with missing data:

To remove nulls, a dropna will do:

The dropna function deletes all rows with null values by default, that is, as long as any field in a row is null, it will be deleted. We can set the subset parameter, such as dropna(subset =['city ']), to specify that a row will be deleted only if the city field is empty.

2. deduplication

drop_duplicates() is said to be heavy, but the case data is relatively clean, no two rows of data are exactly the same, so we have to make it difficult, add a few rows of duplicate values:

Drop_duplicates method de-duplicates will delete completely duplicate rows (rows with the same value) by default. If we want to delete duplicate data in a specified column, we can do so by specifying the subset parameter. If we have a strange idea, we can de-duplicate this column based on traffic level.

We will see that there are three levels of traffic, and by specifying the subset parameter, we remove the rows where this field is duplicated and keep the first rows where each is not duplicated. Continue to expand, in the source data, there are 7 rows of data for the first level of traffic channel, and other fields of each row of data are different. Here we delete the last 6 rows and only retain the first row, but if we want to delete the first 6 rows in the process of demultiplexing and retain the last row of data, how to operate? The answer is simple: specify the keep parameter.

The value of keep is equal to last, and the last row of data is retained. If you do not enter the value of keep, the system will assign the value of first to keep by default, and the first row of data will be retained and the other rows will be deleted.

3 Query--query based on condition

Query is not simply returning a few rows of data, but viewing and selecting data based on certain conditions according to actual business needs.

1 Index/Filter by Condition

loc monologue: you are not mistaken, brother's weight is too heavy, so to grab a sofa, brush a face familiar.

Requirements: filter out the first-level channels with more than 10000 visitors, loc below:

2. 3.2 sort

In many cases, we need to sort to observe data patterns and quickly filter out TOP N data items. For case data, how do we sort by transaction amount and filter out TOP3 channels?

The key to the problem is sorting, and the sort_values function comes in handy:

The whole operation is very simple, sort_values function, as the name suggests, is sorted according to the numerical value, the first parameter to be passed is the column parameter, that is, we sort according to which column of numerical value, ascending parameter determines the sorting order, equal to Flase is descending order from large to small, set to True is ascending order.

After sorting, filtering TOP3 channels is very simple:

Add a knowledge point, if you follow the article operation, you will find that whether it is dropna deleted, drop_duplicates removed, or sort_values sorted, after operating on the source data, the source data does not change, this is because we do not set the inplace value of these functions, if set to inplace = True, deletion, deduplication and sorting will take effect on the source data.

However, in order to avoid unnecessary errors and cannot be changed here, it is recommended that you assign the source data after the operation to the new variable, such as new = df.dropna(), instead of setting the inplace parameter of the source data to True. Similar to Scala and Spark.

4 points-grouping and segmentation

The world's general trend, long time will be divided, data is the same. In the grouping section, we focus on groupby grouping and cut segmentation.

1 packets

In the case data, there are three levels of total traffic levels, and there are multiple regions under each level. If we want to summarize the total number of visitors and payment amount corresponding to each level of traffic, we need to use grouping.

groupby is a grouping function. The most important parameter is the column parameter, that is, according to which column or columns (multiple columns should be enclosed in a list) to summarize, here according to traffic level:

As you can see, after grouping directly, we don't return any expected data. To get further data, we need to calculate the relevant fields at the time of grouping (common calculation methods include sum, max, min, mean, std):

The sum is added to indicate that we first group by traffic class and then sum the fields within the group. Since there is no column specified for summation, all numeric fields are summed. Here we only want the number of visitors and the amount paid for each level of traffic, and we need to specify the parameters:

Traffic level as the basis column of summary, converted to index column by default, if we do not want it to become index, pass parameter as_index = False into groupby:

2 Cut

Cut (bucket) operations are often used to classify and label one-dimensional arrays, and cut functions can efficiently complete the task. Its main parameters and usage are as follows:

pd.cut(x,bins,right,labels)

The first parameter x is the one-dimensional array we want to pass in to the partition, which can be a list or a column of DataFrame.

bins represents the segmentation method. You can customize the incoming list [a,b,c] to indicate segmentation according to the a-b-c interval, or you can enter a numerical value (such as 5) to directly divide the data into 5 parts.

The value of right can be set to True or False, when True means that the grouping interval contains the right and does not contain the left. Equal to False means left without right.

labels are marking parameters, we divide a column of data into 3 groups, and set a label for each group, such as [low, medium, high]

Don't be fooled by complicated explanations, one example will make it all clear. Take the case data as an example. Each channel has a corresponding number of visitors. We now want to evaluate the visitor level of each channel. According to the number of visitors, it is divided into hot chicken (within 100 traffic), 100, 1000 and 10,000 channels.

Because we want to classify traffic levels into hundreds, thousands, and tens of thousands, we pass grouping criteria into the bins parameter. As can be seen from the results, in the case of not setting right, the grouping interval is left open and right closed by default, while we hope to be left closed and right open, that is, the number of visitors to the 100-level traffic channel is between 0 and 99, so we need to set the right value to False.

Below, we directly mark the grouped data. The number of visitors is set to spicy chicken from 0 to 99, 100 to 999 is set to 100, 1000 and 10,000, and so on. At the same time, the labeled data is given to the source data as a new column:

df['category tags']= pd.cut(x=df[' visitors '],bins=[0,100,1000,10000,100000],right=False,labels=[' garbage','hundred',' thousand','ten-thousand'])

Very efficient, one and a half lines of code to get the grouping, judgment and marking process.

At this point, the study of "What is Python's method of cleaning data" is over, hoping to solve everyone's doubts. Theory and practice can better match to help you learn, go and try it! If you want to continue learning more relevant knowledge, please continue to pay attention to the website, Xiaobian will continue to strive to bring more practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.