In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "what is the method of python optimized data preprocessing". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "what is the python optimization data preprocessing method"!
We know that real data is usually disorganized and requires a lot of preprocessing before it can be used. Pandas is one of the most widely used data analysis and processing libraries. It provides a variety of methods to preprocess the original data.
Import numpy as npimport pandas as pddf = pd.DataFrame ({"id": [100,100,101,102,104,104,104,104,104,104,2, np.nan, 5], "B": [45,56,48,47,62,112,54,49], "C": [1.2,1.4,1.1,1.8, np.nan, 1.4,1.6] 1.5]}) df
The missing value represented by NaN in the above data, the id column contains duplicate values, and the 112in column B seems to be an outlier.
These are some typical problems in real data. We will create a pipeline to deal with the problem just described. For each task, we need a function. Therefore, the first step is to create a function that is placed in the pipe. It should be noted that the functions used in the pipeline need to take the data frame as a parameter and return the data frame.
The first function is to handle the missing values def fill_missing_values (df): for col in df.select_dtypes (include= ["int", "float"]). Columns: val = DF [col] .mean () DF [col] .fillna (val, inplace=True) return df
I like to replace the missing values in the numeric column with the average of the column, but you can also define it according to the specific scenario. As long as it takes the data frame as a parameter and returns the data frame, it can work in the pipe.
The second function is to help us delete the duplicate value def drop_duplicates (df, column_name): df = df.drop_duplicates (subset=column_name) return df
Call Pandas's built-in drop duplicates function, which eliminates duplicate values in a given column.
The last function is to eliminate outliers def remove_outliers (df, column_list): for col in column_list: avg = DF [col] .mean () std = DF [col] .std () low = avg-2 * std high = avg + 2 * std df = DF [col] .between (low, high, inclusive=True)] return df
This function works as follows:
Need a data frame and a list of columns
For each column in the list, it calculates the average and standard deviation
Calculate the standard deviation and use the lower limit average
Delete values outside the range defined by the lower and upper limits
As with the previous functions, you can choose your own method of detecting outliers.
Create Pip
We now have three functions for data preprocessing tasks. The next step is to use these functions to create pipes.
Df_processed = (df.pipe (fill_missing_values) .pipe (drop_duplicates, "id") .pipe (remove_outliers, ["A", "B"]))
This pipeline executes functions in the given order. We can pass parameters to the pipe along with the function name.
One thing to mention here is that some functions in the pipeline modify the original data frame. Therefore, using the above pipe will also update the df.
One way to solve this problem is to use a copy of the original data frame in the pipe. If you don't care about keeping the original data frame as it is, you can use it in the pipeline.
I will update the pipe as follows:
My_df = df.copy () df_processed = (my_df.pipe (fill_missing_values) .pipe (drop_duplicates, "id") .pipe (remove_outliers, ["A", "B"]))
Let's take a look at the original and processed data frames:
At this point, I believe that everyone on the "python optimization data preprocessing method is what" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.