In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
Python function to accelerate data analysis and processing speed example analysis, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.
Foreword:
Pandas is the most widely used data analysis and manipulation library in Python. It provides many functions and methods to speed up the "data analysis" and "preprocessing" steps.
In order to better learn Python, I will take the customer churn dataset as an example to share the functions and methods most commonly used in the process of data analysis.
The data is as follows:
Import numpy as npimport pandas as pddf = pd.read_csv ("Churn_Modelling.csv") print (df.shape) df.columns
Result output:
(10000, 14)
Index (['RowNumber',' CustomerId', 'Surname',' CreditScore', 'Geography','Gender',' Age', 'Tenure',' Balance', 'NumOfProducts',' HasCrCard','IsActiveMember', 'EstimatedSalary',' Exited'], dtype='object')
1. Delete column
Df.drop (['RowNumber',' CustomerId', 'Surname',' CreditScore'], axis=1, inplace=True) print (df [: 2]) print (df.shape)
Result output:
Geography Gender Age Tenure Balance NumOfProducts HasCrCard
0 France Female 42 2 0.0 1 1
IsActiveMember EstimatedSalary Exited
01 101348.88 1
(10000, 10)
Description: the "axis" parameter is set to 1 to place the column, and 0 is set to the row. The "inplace=True" parameter is set to True to save changes. We reduced the number of columns by four, so the number of columns decreased from 14 to 10.
two。 Select a specific column
We read some column data from the csv file. You can use the usecols parameter.
Df_spec = pd.read_csv ("Churn_Modelling.csv", usecols= ['Gender',' Age', 'Tenure',' Balance']) df_spec.head () 3.nrows
You can use the nrows parameter to create a data frame that contains the first 5000 lines of the csv file. You can also use the skiprows parameter to select lines from the end of the file. Skiprows=5000 indicates that we will skip the first 5000 lines when reading the csv file.
Df_partial = pd.read_csv ("Churn_Modelling.csv", nrows=5000) print (df_partial.shape) 4. Sample
After creating the data box, we may need a small sample to test the data. We can use the n or frac parameters to determine the sample size.
Df= pd.read_csv ("Churn_Modelling.csv", usecols= ['Gender',' Age', 'Tenure',' Balance']) df_sample = df.sample (nasty 1000) df_sample2 = df.sample (frac=0.1) 5. Check for missing values
The isna function determines the missing value in the data frame. By using isna with the sum function, we can see the number of missing values in each column.
Df.isna (). Sum () 6. Add missing values using loc and iloc
Use loc and iloc to add missing values, the difference is as follows:
Loc: select tagged
Iloc: select index
We first create 20 random indexes to select.
Missing_index = np.random.randint (10000, size=20)
We will use loc to change some values to np.nan (missing values).
Df.Lok [missing _ index, ['Balance','Geography']] = np.nan
Missing 20 values in the "Balance" and "Geography" columns. Let's do another example with iloc.
Df.ilo [missing _ index,-1] = np.nan7. Fill in missing values
The fillna function is used to fill in the missing values. It offers a number of options. We can use specific values, aggregate functions (such as mean), or previous or next values.
Avg = df ['Balance'] .mean () df [' Balance'] .Fillna (value=avg, inplace=True)
The method argument of the fillna function can be used to fill the missing value based on the previous or next value in the column (for example, method = "ffill"). It can be very useful for sequential data, such as time series.
8. Delete missing value
Another way to deal with missing values is to delete them. The following code deletes rows with any missing values.
Df.dropna (axis=0, how='any', inplace=True) 9. Select rows according to conditions
In some cases, we need observations (that is, rows) that suit certain conditions.
France_churn = df [(df.Geography = = 'France') & (df.Exited = = 1)] france_churn.Geography.value_counts () 10. Describe the condition with query
Query functions provide a more flexible way to pass conditions. We can describe them with strings.
Df2 = df.query ('80000
< Balance < 100000')# 让我们通过绘制平衡列的直方图来确认结果。df2['Balance'].plot(kind='hist', figsize=(8,5))11.用 isin 描述条件 条件可能有多个值。在这种情况下,最好使用 isin 方法,而不是单独编写值。 df[df['Tenure'].isin([4,6,9,10])][:3]12.Groupby function
The Pandas Groupby function is a versatile and easy-to-use feature that helps you get an overview of the data. It makes it easier to browse datasets and reveal the basic relationships between variables.
We will do several examples of group ratio functions. Let's start with something simple. The following code groups rows based on a combination of Geography and Gender, and then gives the average turnover rate for each group.
Df [['Geography','Gender','Exited']] .groupby ([' Geography','Gender']) .mean () 13.Groupby combines with aggregate function
The agg function allows multiple aggregate functions to be applied to a group, and the list of functions is passed as parameters.
Df [['Geography','Gender','Exited']] .groupby ([' Geography','Gender']) .agg (['mean','count']) 14. Apply different aggregate functions to different groups
Df_summary = df [['Geography','Exited','Balance']] .groupby (' Geography') .agg ({'Exited':'sum',' Balance':'mean'}) df_summary.rename (columns= {'Exited':'# of churned customers',' Balance':'Average Balance of Customers'}, inplace=True)
In addition, the NamedAgg function allows you to rename columns in an aggregation
Import pandas as pddf_summary = df [['Geography','Exited','Balance']] .groupby (' Geography') .agg (Number_of_churned_customers = pd.NamedAgg ('Exited',' sum'), Average_balance_of_customers = pd.NamedAgg (' Balance', 'mean')) print (df_summary)
15. Reset index
Have you noticed the data format of the above figure? We can change it by resetting the index.
Print (df_summary.reset_index ())
Picture
16. Reset and delete the original index
In some cases, we need to reset the index and also delete the original index.
Df [['Geography','Exited','Balance']] .sample (nasty 6) .reset_index (drop=True) 17. Set a specific column as an index
We can set any column in the data frame as the index.
Df_new.set_index ('Geography') 18. Insert a new column
Group = np.random.randint (10, size=6) df_new ['Group'] = group19.where function
It is used to replace values in rows or columns based on conditions. The default replacement value is NaN, but we can also specify as the replacement value.
Df_new ['Balance'] = df_new [' Balance'] .where (df_new ['Group'] > = 6,0) 20. Rank function
The rating function assigns a ranking to the value. Let's create a column to rank customers based on their balance.
Df_new ['rank'] = df_new [' Balance'] .rank (method='first', ascending=False) .astype ('int') 21. Number of unique values in the column
It comes in handy when using classification variables. We may need to check the number of unique categories. We can check the size of the sequence returned by the value counting function or use the nunique function.
Df.Geography.nunique22. Memory usage
Using the function memory_usage, these values display memory in bytes.
Df.memory_usage ()
23. Data type conversion
By default, classification data is stored with object data types. However, it can lead to unnecessary memory usage, especially if classification variables have a low cardinality.
A low cardinality means that columns have almost no unique value compared to the number of rows. For example, a geographic column has three unique values and 10000 rows.
We can save memory by changing its data type to Category.
Df ['Geography'] = df [' Geography'] .astype ('category') 24. Replacement value
The replacement function can be used to replace the value in the data frame.
Df ['Geography'] .replace ({0V Geography' B 1R B 2'}) 25. Draw a histogram
Pandas is not a data visualization library, but it makes it very easy to create basic drawings.
I find it easier to create basic drawings using Pandas rather than using other data visualization libraries.
Let's create a histogram of the balanced column.
twenty-six。 Reduce floating point decimal point
Pandas may show too many decimal points for floating point numbers. We can easily adjust it.
Df ['Balance']. Plot (kind='hist', figsize= (10) 6), title='Customer Balance') 27. Change display options
Instead of manually adjusting the display options each time, we can change the default display options for various parameters.
Get_option: returns the current option
Set_option: change the option to change the display option for the decimal point to 2.
Pd.set_option ("display.precision", 2)
Some other options that you may want to change include:
Max_colwidth: the maximum number of characters displayed in the column
Max_columns: maximum number of columns to display
Max_rows: maximum number of rows to display
twenty-eight。 Calculate the percentage change by column
Pct_change is used to calculate the percentage of change in the value in the sequence. It is useful when calculating the percentage of changes in a time series or element order array.
Ser= pd.Series ([2, 4, 5, 5, 6, 72, 72) ser.pct_change () 29. String-based filtering
We may need to filter observations (rows) based on text data (such as customer name). I have added the df_new name to the data frame.
Df_ new [DF _ new.Names.str.startswith ('Mi')]
thirty。 Set data frame styl
We can do this by using the Style property that returns the Style object, which provides a number of options for formatting and displaying data boxes. For example, we can highlight the minimum or maximum values.
It also allows custom style functions to be applied.
Df_new.style.highlight_max (axis=0, color='darkgreen')
Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 284
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.