In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the relevant knowledge of "what is the way of Python data preprocessing". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
When it comes to preprocessing, it is generally necessary to:
Digital missing value processing
Type missing value processing
Digital standardization
Category-type features become dummy variables
Pipeline thought
In the process of data processing and machine learning, you will finally find that there seems to be a "routine" in every project. There is a "routine" in all project processes:
Pretreatment
Modeling
Training
Forecast
For preprocessing, it is also a routine, but we do not use the pipeline function, but another FeatureUnion function.
Of course, one function can't solve all the problems. Let's see which functions and coding styles can make our code look organized and "zhuang" style.
Import data to start actual combat
Today we analyzed the titanic data, which I have downloaded and placed in the data file under the project path.
Import pandas as pd file = 'data/titanic_train.csv' raw_df = pd.read_csv (file)
Next comes the standard routine: preview info and preview head.
Print (raw_df.info ()) print (raw_df.head ())
Let's briefly review the name of the dataset:
RangeIndex: 891 entries, 0 to 890: represents 891 samples
Columns: 12 columns in total
Divided by data type:
Int64:
PassengerId: passenger ID
Survived: whether to survive or not, 1 for survival
Pclass: passenger class
SibSp: sibling and spouse (number of siblings and spouses) Parch: parents and children (number of parents and children)
Object:
Name: first name
Sex: gender
Ticket: ticket number
Cabin: cabin number
Embarked: boarding place
Float64:
Age: age
Fare ticket price
RangeIndex: 891 entries 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype-0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null Object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64 (2) Int64 (5), object (5) memory usage: 83.7 + KB
In general, machine learning does not preprocess missing values and category data, so we should preprocess at least these two cases.
First of all, let's take a look at the missing values, which is already available in the info above. Here we show the missing information more explicitly.
# get null count for each columns nulls_per_column = raw_df.isnull (). Sum () print (nulls_per_column)
The results are as follows:
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
You can see that Age is missing, Age is float64 type data, Cabin is missing, Cabin is object, Embarked is missing, and Embarked is also object.
The protagonist enters the stage (strategy and function)
We can see which columns are missing above, and for some cases, such as quickly cleaning up data, we will only develop the following strategy:
For the float type, we generally use the mean or median to replace the object type, if the ordinal type, that is, the strict category, such as (male, female), such as (high, medium, low), etc., generally use the mode to replace the object type, if the nominal type, that is, there is no hierarchical / strict category relationship, such as ID, we use constant values instead. Sklearn's preprocessing module, pipeline module, and a third-party "rookie" sklearn_pandas library are used in this article.
Here we briefly introduce the purpose of this function.
StandardScaler: used to standardize numeric types LabelBinarizer: as the name implies, type types are first label (numeric), and then Binarize (binary). It is equivalent to onehot coding, but LabelBinarizer only processes FeatureUnion for one column: it is used to remerge different feature preprocessing procedures (functions), but it is important to note that its input is not data but transformer, that is, the preprocessing method. SimpleImputer:sklearn comes with a preprocessing function similar to fillna, CategoricalImputer: a supplement from sklearn_pandas, because there is no preprocessing for category type data in sklearn. DataFrameMapper: the equivalent of building different transformer for different columns of dataframe. From sklearn.preprocessing import StandardScaler from sklearn.preprocessing import LabelBinarizer from sklearn.pipeline import FeatureUnion from sklearn_pandas import CategoricalImputer from sklearn_pandas import DataFrameMapper from sklearn.impute import SimpleImputer
According to our strategy, we need to divide columns into numeric and category types. The idea is to see if a column of data is of type object.
# split categorical columns and numerical columns categorical_mask = (raw_df.dtypes = = object) categorical_cols = raw_ df.Columbia [classical _ mask] .tolist () numeric_cols = raw_df.columns [~ categorical_mask] .tolist () numeric_cols.remove ('Survived') print (f'categorical_cols are {categorical_cols}') print (f'numeric_cols are {numeric_cols}')
Print:
Categorical_cols are ['Name',' Sex', 'Ticket',' Cabin', 'Embarked'] numeric_cols are [' PassengerId', 'Pclass',' Age', 'SibSp',' Parch', 'Fare']
Numerical data preprocessing
Preprocess the numeric data. Here we use DataFrameMapper to create the transformer object and fill in the median value for all numeric_cols.
Numeric_fillna_mapper=DataFrameMapper ([[col], SimpleImputer (strategy= "median")) for col in numeric_cols], input_df=True, df_out=True)
We can test the code to see what the transformed data looks like. Here you need to call the fit_transform method.
Transformed = numeric_fillna_mapper.fit_transform (raw_df) print (transformed.info ())
The result is as follows, you can see that the transformed data contains only the columns we processed, and you can see that the number of non-null has reached 891, indicating that there is no missing.
# Column Non-Null Count Dtype-0 PassengerId 891 non-null float64 1 Pclass 891 non-null float64 2 Age 891 non-null float64 3 SibSp 891 non-null float64 4 Parch 891 non-null float64 5 Fare 891 non-null float64
If we need numerical features, fill in the missing values first, and then standardize them. So we just need to modify the above function again to add a transformer list. This transformer list contains two steps: SimpleImputer and StandardScaler.
# fill nan with mean # and then standardize cols numeric_fillna_standardize_mapper=DataFrameMapper ([(col], [SimpleImputer (strategy= "median"), StandardScaler ()]) for col in numeric_cols], input_df=True Df_out=True) fillna_standardized = numeric_fillna_standardize_mapper.fit_transform (raw_df) print (fillna_standardized.head ())
Preview the result of the transformation:
PassengerId Pclass Age SibSp Parch Fare 0-1.730108 0.827377-0.565736 0.432793-0.473674-0.502445 1-1.726220-1.566107 0.663861 0.432793-0.473674 0.786845 2-1.722332 0.827377-0.258337-0.474545-0.473674-0.488854 3-1.718444-1.566107 0.433312 0.432793-0.473674 0 .420730 4-1.714556 0.827377 0.433312-0.474545-0.473674-0.486337
In this way, we have completed the preprocessing of numerical data. Similarly, we can preprocess the type data.
Category data preprocessing
In this case, Cabin is missing and Embarked is missing, because both have a limited number of categories, and we can fill them with data with the highest frequency. What if Name is missing? Generally speaking, Name does not have duplicate names, and even if there are individual duplicate names, it does not make sense to fill them with the data of the highest frequency. So we will choose to fill it with constant values, such as "unknown" and so on.
As a template, our approach here covers two situations.
['Name','Cabin','Ticket'] is actually similar to ID, with almost no repetition, we replace it with constant values, then use LabelBinarizer to change into other columns of the dummy variable, we fill it with the category with the highest frequency, and then use LabelBinarizer to change to the dummy variable.
# Apply categorical imputer constant_cols = ['Name','Cabin','Ticket'] frequency_cols = [_ for _ in categorical_cols if _ not in constant_cols] categorical_fillna_freq_mapper = DataFrameMapper ([(col, [CategoricalImputer (), LabelBinarizer ()]) for col in frequency_cols], input_df=True Df_out=True) categorical_fillna_constant_mapper = DataFrameMapper ([(col, [CategoricalImputer (strategy='constant',fill_value='unknown'), LabelBinarizer ()]) for col in constant_cols] Input_df=True, df_out=True)
We also test the code:
Transformed = categorical_fillna_freq_mapper.fit_transform (raw_df) print (transformed.info ()) transformed = categorical_fillna_constant_mapper.fit_transform (raw_df) print (transformed.shape)
The results are as follows:
Data columns (total 4 columns): # Column Non-Null Count Dtype-0 Sex 891non-null int32 1 Embarked_C 891non-null int32 2 Embarked_Q 891non-null int32 3 Embarked_S 891non-null int32 dtypes: int32 (4)
And:
(891,1720)
All the pretreatment processes of featureunion
Previously, we have tested each method of preprocessing (transfomer, or mapper), and we can see that the result contains only the corresponding results of the processed columns.
In practice, we can use FeatureUnion to directly turn all the ways we need to deal with (transfomer or mapper) into a pipeline, which is clear at a glance.
Then call fit_transform to transform the original data so that our preprocessing looks more organized.
Feature_union_1 = FeatureUnion ([("numeric_fillna_standerdize", numeric_fillna_standardize_mapper), ("cat_freq", categorical_fillna_freq_mapper), ("cat_constant" Categorical_fillna_constant_mapper)]) df_1 = feature_union_1.fit_transform (raw_df) print (df_1.shape) print (raw_df.shape) "what is the method of Python data preprocessing"? Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.