Example Analysis of data Analysis and processing Speed accelerated by Python function 07/12 Update SLTechnology News&Howtos

Example Analysis of data Analysis and processing Speed accelerated by Python function

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

Python function to accelerate data analysis and processing speed example analysis, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

Foreword:

Pandas is the most widely used data analysis and manipulation library in Python. It provides many functions and methods to speed up the "data analysis" and "preprocessing" steps.

In order to better learn Python, I will take the customer churn dataset as an example to share the functions and methods most commonly used in the process of data analysis.

The data is as follows:

Import numpy as npimport pandas as pddf = pd.read_csv ("Churn_Modelling.csv") print (df.shape) df.columns

Result output:

(10000, 14)

Index (['RowNumber',' CustomerId', 'Surname',' CreditScore', 'Geography','Gender',' Age', 'Tenure',' Balance', 'NumOfProducts',' HasCrCard','IsActiveMember', 'EstimatedSalary',' Exited'], dtype='object')

1. Delete column

Df.drop (['RowNumber',' CustomerId', 'Surname',' CreditScore'], axis=1, inplace=True) print (df [: 2]) print (df.shape)

Result output:

Geography Gender Age Tenure Balance NumOfProducts HasCrCard

0 France Female 42 2 0.0 1 1

IsActiveMember EstimatedSalary Exited

01 101348.88 1

(10000, 10)

Description: the "axis" parameter is set to 1 to place the column, and 0 is set to the row. The "inplace=True" parameter is set to True to save changes. We reduced the number of columns by four, so the number of columns decreased from 14 to 10.

two。 Select a specific column

We read some column data from the csv file. You can use the usecols parameter.

Df_spec = pd.read_csv ("Churn_Modelling.csv", usecols= ['Gender',' Age', 'Tenure',' Balance']) df_spec.head () 3.nrows

You can use the nrows parameter to create a data frame that contains the first 5000 lines of the csv file. You can also use the skiprows parameter to select lines from the end of the file. Skiprows=5000 indicates that we will skip the first 5000 lines when reading the csv file.

Df_partial = pd.read_csv ("Churn_Modelling.csv", nrows=5000) print (df_partial.shape) 4. Sample

After creating the data box, we may need a small sample to test the data. We can use the n or frac parameters to determine the sample size.

Df= pd.read_csv ("Churn_Modelling.csv", usecols= ['Gender',' Age', 'Tenure',' Balance']) df_sample = df.sample (nasty 1000) df_sample2 = df.sample (frac=0.1) 5. Check for missing values

The isna function determines the missing value in the data frame. By using isna with the sum function, we can see the number of missing values in each column.

Df.isna (). Sum () 6. Add missing values using loc and iloc

Use loc and iloc to add missing values, the difference is as follows:

Loc: select tagged

Iloc: select index

We first create 20 random indexes to select.

Missing_index = np.random.randint (10000, size=20)

We will use loc to change some values to np.nan (missing values).

Df.Lok [missing _ index, ['Balance','Geography']] = np.nan

Missing 20 values in the "Balance" and "Geography" columns. Let's do another example with iloc.

Df.ilo [missing _ index,-1] = np.nan7. Fill in missing values

The fillna function is used to fill in the missing values. It offers a number of options. We can use specific values, aggregate functions (such as mean), or previous or next values.

Avg = df ['Balance'] .mean () df [' Balance'] .Fillna (value=avg, inplace=True)

The method argument of the fillna function can be used to fill the missing value based on the previous or next value in the column (for example, method = "ffill"). It can be very useful for sequential data, such as time series.

8. Delete missing value

Another way to deal with missing values is to delete them. The following code deletes rows with any missing values.

Df.dropna (axis=0, how='any', inplace=True) 9. Select rows according to conditions

In some cases, we need observations (that is, rows) that suit certain conditions.

France_churn = df [(df.Geography = = 'France') & (df.Exited = = 1)] france_churn.Geography.value_counts () 10. Describe the condition with query

Query functions provide a more flexible way to pass conditions. We can describe them with strings.

Df2 = df.query ('80000

< Balance < 100000')# 让我们通过绘制平衡列的直方图来确认结果。df2['Balance'].plot(kind='hist', figsize=(8,5))11.用 isin 描述条件条件可能有多个值。在这种情况下，最好使用 isin 方法，而不是单独编写值。 df[df['Tenure'].isin([4,6,9,10])][:3]

12.Groupby function

The Pandas Groupby function is a versatile and easy-to-use feature that helps you get an overview of the data. It makes it easier to browse datasets and reveal the basic relationships between variables.

We will do several examples of group ratio functions. Let's start with something simple. The following code groups rows based on a combination of Geography and Gender, and then gives the average turnover rate for each group.

Df [['Geography','Gender','Exited']] .groupby ([' Geography','Gender']) .mean () 13.Groupby combines with aggregate function

The agg function allows multiple aggregate functions to be applied to a group, and the list of functions is passed as parameters.

Df [['Geography','Gender','Exited']] .groupby ([' Geography','Gender']) .agg (['mean','count']) 14. Apply different aggregate functions to different groups

Df_summary = df [['Geography','Exited','Balance']] .groupby (' Geography') .agg ({'Exited':'sum',' Balance':'mean'}) df_summary.rename (columns= {'Exited':'# of churned customers',' Balance':'Average Balance of Customers'}, inplace=True)

In addition, the NamedAgg function allows you to rename columns in an aggregation

Import pandas as pddf_summary = df [['Geography','Exited','Balance']] .groupby (' Geography') .agg (Number_of_churned_customers = pd.NamedAgg ('Exited',' sum'), Average_balance_of_customers = pd.NamedAgg (' Balance', 'mean')) print (df_summary)

15. Reset index

Have you noticed the data format of the above figure? We can change it by resetting the index.

Print (df_summary.reset_index ())

Picture

16. Reset and delete the original index

In some cases, we need to reset the index and also delete the original index.

Df [['Geography','Exited','Balance']] .sample (nasty 6) .reset_index (drop=True) 17. Set a specific column as an index

We can set any column in the data frame as the index.

Df_new.set_index ('Geography') 18. Insert a new column

Group = np.random.randint (10, size=6) df_new ['Group'] = group19.where function

It is used to replace values in rows or columns based on conditions. The default replacement value is NaN, but we can also specify as the replacement value.

Df_new ['Balance'] = df_new [' Balance'] .where (df_new ['Group'] > = 6,0) 20. Rank function

The rating function assigns a ranking to the value. Let's create a column to rank customers based on their balance.

Df_new ['rank'] = df_new [' Balance'] .rank (method='first', ascending=False) .astype ('int') 21. Number of unique values in the column

It comes in handy when using classification variables. We may need to check the number of unique categories. We can check the size of the sequence returned by the value counting function or use the nunique function.

Df.Geography.nunique22. Memory usage

Using the function memory_usage, these values display memory in bytes.

Df.memory_usage ()

23. Data type conversion

By default, classification data is stored with object data types. However, it can lead to unnecessary memory usage, especially if classification variables have a low cardinality.

A low cardinality means that columns have almost no unique value compared to the number of rows. For example, a geographic column has three unique values and 10000 rows.

We can save memory by changing its data type to Category.

Df ['Geography'] = df [' Geography'] .astype ('category') 24. Replacement value

The replacement function can be used to replace the value in the data frame.

Df ['Geography'] .replace ({0V Geography' B 1R B 2'}) 25. Draw a histogram

Pandas is not a data visualization library, but it makes it very easy to create basic drawings.

I find it easier to create basic drawings using Pandas rather than using other data visualization libraries.

Let's create a histogram of the balanced column.

twenty-six。 Reduce floating point decimal point

Pandas may show too many decimal points for floating point numbers. We can easily adjust it.

Df ['Balance']. Plot (kind='hist', figsize= (10) 6), title='Customer Balance') 27. Change display options

Instead of manually adjusting the display options each time, we can change the default display options for various parameters.

Get_option: returns the current option

Set_option: change the option to change the display option for the decimal point to 2.

Pd.set_option ("display.precision", 2)

Some other options that you may want to change include:

Max_colwidth: the maximum number of characters displayed in the column

Max_columns: maximum number of columns to display

Max_rows: maximum number of rows to display

twenty-eight。 Calculate the percentage change by column

Pct_change is used to calculate the percentage of change in the value in the sequence. It is useful when calculating the percentage of changes in a time series or element order array.

Ser= pd.Series ([2, 4, 5, 5, 6, 72, 72) ser.pct_change () 29. String-based filtering

We may need to filter observations (rows) based on text data (such as customer name). I have added the df_new name to the data frame.

Df_ new [DF _ new.Names.str.startswith ('Mi')]

thirty。 Set data frame styl

We can do this by using the Style property that returns the Style object, which provides a number of options for formatting and displaying data boxes. For example, we can highlight the minimum or maximum values.

It also allows custom style functions to be applied.

Df_new.style.highlight_max (axis=0, color='darkgreen')

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.