Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use python for data analysis like Excel

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

Editor to share with you how to use python for data analysis like Excel. I hope you will get something after reading this article. Let's discuss it together.

Excel is the most commonly used tool in data analysis. Through the comparison of the functions of python and excel, this paper introduces how to use python to complete the data processing and analysis in excel through functional programming. The pandas library is used for data processing in Python. We summarize the 36 most commonly used functions from 1787 pages of pandas official website documents. Through these functions, we introduce how to complete data generation and import, data cleaning, preprocessing, as well as the most common operations such as data classification, data filtering, classification and summary, perspective, and so on.

1. Data preprocessing

This part is the preprocessing of the data, sorting out the cleaned data for later statistics and analysis. It mainly includes data table merging, sorting, numerical sorting, data grouping and marking and so on.

Data table merging

The first is to merge different data tables. Here we create a new data table, df1, and merge the two data tables, df and df1. There is no function of merging data tables directly in Excel, so it can be realized step by step by VLOOKUP function. In python, you can do this at one time through the merge function. Let's create a df1 data table for merging with the df data table.

# create df1 data sheet df1=pd.DataFrame ({"id": [1001Perry 1002Power1004pyrry1005pyrr1005pyrr1008], "gender": ['male','female','male','female','male','female','male','female'], "pay": [' m-point']], "m-point": [10pyrryl 12pyrine 2040pyrr40]

Use the merge function to merge the two data tables in the way of inner, and match the common data in the two data tables together to generate a new data table. And named df_inner.

# data table matching and merging, inner mode df_inner=pd.merge (df,df1,how='inner')

In addition to the inner mode, there are left,right and outer methods for merging. The differences between these ways are explained and compared in detail in my other articles.

# other data table matching patterns df_left=pd.merge (df,df1,how='left') df_right=pd.merge (df,df1,how='right') df_outer=pd.merge (df,df1,how='outer')

Set index column

After completing the merging of the data table, we set the index column on the df_inner data table, which has many functions, such as data extraction, summary, data filtering and so on.

The function to set the index is set_index.

# set index column df_inner.set_index ('id')

Sort (by index, by number)

In Excel, you can sort the data table directly through the sort button under the data directory, which is relatively simple. In Python, you need to use the ort_values function and the sort_index function to sort.

In python, you can sort the data table either by index or by the value of the specified column. First, we sort the data table by the age of the user in the age column.

The function used is sort_values.

# sort df_inner.sort_values by the value of a specific column (by= ['age'])

The Sort_index function is used to sort the data table by the value of the index column.

# sort df_inner.sort_index () by index column

Data grouping

In Excel, you can group values by approximately matching the VLOOKUP function, or by using the PivotTable report. The where function is used in the corresponding python to complete the data grouping.

The Where function is used to judge and group the data. In the following code, we judge the value of the price column, divide those who meet the conditions into one group, those that do not meet the conditions into another group, and mark them with the group field.

# if the value of the price column > 3000 group' group column shows high, otherwise, it shows low df_inner ['group'] = np.where (df_inner [' price'] > 3000 dagger column)

In addition to the where function, you can also group data after judging the values of multiple fields. In the following code, data with a city column equal to beijing and a price column greater than or equal to 4000 is marked as 1.

# grouping data with multiple conditions marked df_inner.loc [(df_inner ['city'] = =' beijing') & (df_inner ['price'] > = 4000),' sign'] = 1

Data disaggregation

Contrary to data grouping, numerical values are broken down, and the "breakdown" function is provided under the data catalog in Excel. The split function is used in python to achieve separation.

The data in the category column in the data table contains two pieces of information, preceded by a number of categories id and followed by a letter called the size value. The middle is connected by a hyphen. We use the split function to split this field and match the split data table back to the original data table.

# sort the values of the category field in turn, and create a data table. The index value is the index column of df_inner, and the column names are category and size pd.DataFrame ((x.split ('-') for x in df_inner ['category']), index=df_inner.index,columns= [' category','size'])

# match the sorted data table with the original df_inner data table df_inner=pd.merge (df_inner,split,right_index=True, left_index=True)

two。 Data extraction

The fifth part is data extraction, which is also the most common work in data analysis. This part mainly uses three functions, loc,iloc and ix,loc functions to extract by label value, iloc to extract by location, and ix to extract by label and location at the same time. The use of each function is described below.

Extract by label (loc)

The Loc function is extracted by the index tag of the data table, and the following code extracts a single piece of data with index column 3.

# extract single row values by index df_inner.loc [3] id 1004 date 2013-01-05 00:00:00 city shenzhen category 110murC age 32 price 5433 gender female m-point 40 pay Y group high sign NaN category_1 110 size C Name: 3, dtype: object

Use colons to limit the range of extracted data, preceded by the start tag value and followed by the end tag value. The following data rows are extracted from 0 to 5.

# extract region row values df_inner.loc by index [0:5]

The Reset_index function is used to restore the index. Here we re-set the date of the date field to the index of the data table and extract the data by date.

# reset index df_inner.reset_index ()

# set the date to index df_inner=df_inner.set_index ('date')

Use colons to limit the range of extracted data, and a blank before the colon indicates that it starts at 0. Extract all data prior to January 4, 2013.

# extract all data before 4 days df_inner [: '2013-01-04']

Extract by location (iloc)

Use the iloc function to extract the data in the data table by location, where the number before and after the colon is no longer the label name of the index, but the location of the data, starting at 0.

# using iloc to extract data df_inner.iloc by location region [: 3Personalized 2]

Iloc function not only can extract data by region, but also can be extracted by position one by one. 0Magne2 and 5 in square brackets indicate the position of the row of the data, and the number in square brackets indicates the position of the column.

# using iloc to extract data df_inner.iloc by location region [: 3Personalized 2]

Extract by label and location (ix)

Ix is a mixture of loc and iloc, which can extract data not only by index tag, but also by location. The location of the lines in the following code is set by index date, and the columns are set by location.

# using ix to extract data by index tag and location mixed df_inner.ix [: '2013-01-03 parallel Magazine 4]

To extract (regions and conditional values) by condition

In addition to lifting the data by label and location, you can also carry out the data according to specific conditions. The following two functions, loc and isin, are used together to extract data according to the specified conditions.

Use the isin function to determine whether the value in city is beijing.

# determine whether the value of city column is beijing df_inner ['city'] .isin ([' beijing']) date 2013-01-02 True 2013-01-05 False 2013-01-07 True 2013-01-06 False 2013-01-03 False 2013-01-04 False Name: city, dtype: bool

The isin function is nested into the data extraction function of loc, and the judgment result is Ture data. Here we change the judgment condition to whether the city value is beijing and shanghai. If so, extract this piece of data.

# first determine whether the city column contains beijing and shanghai, and then extract the data of the compound condition. Df_ inner.loco [DF _ inner ['city'] .isin ([' beijing','shanghai'])]

Numerical extraction can also complete the work of similar data separation, and extract the established values from the merged values.

Category=df_inner ['category'] 0 100m A 3 110m C 5130m F 4 210m A 1100m B 210m A Name: category, dtype: object # extract the first three characters and generate the data table pd.DataFrame (category.str [: 3])

3. Data screening

The sixth part is data filtering, using and, OR, non-three conditions to match greater than, less than and equal to filter the data, and count and sum. Similar to the filtering and countifs and sumifs functions in excel.

Filter by condition (and, or, not)

The filter function is provided under the Excel data catalog, which is used to filter the data table by different criteria. In Python, the loc function is used to complete the filtering function with the filter criteria. With sum and count functions, we can also realize the functions of sumif and countif functions in excel.

The and condition is used for screening if the age is greater than 25 years old and the city is beijing. After screening, only one piece of data meets the requirements.

# filter df_inner.loc using the and condition [(df_inner ['age'] > 25) & (df_inner [' city'] = = 'beijing'), [' id','city','age','category','gender']]

Use the "or" condition for screening, older than 25 years old or beijing in the city. After screening, 6 pieces of data meet the requirements.

# filter df_inner.loc [(df_inner ['age'] > 25) using the "or" condition | (df_inner [' city'] = = 'beijing'), [' id','city','age','category','gender']] .sort (['age'])

Add the price field and the sum function after the previous code to sum the values of the price field according to the filtered result, which is equivalent to the function of sumifs in excel.

# Sum the filtered data by price field df_inner.loc [(df_inner ['age'] > 25) | (df_inner [' city'] = = 'beijing'), [' id','city','age','category','gender','price']] .sort (['age']) .price.sum () 19796

Using the "not" condition to filter, the city is not equal to beijing. There are four pieces of data that meet the criteria. Sort the filter results by id column.

# filter df_inner.loc using the "not" condition [(df_inner ['city']! =' beijing'), ['id','city','age','category','gender']] .sort ([' id'])

Add the city column after the previous code and use the count function to count. Equivalent to the function of the countifs function in excel.

# count the filtered data by city column df_inner.loc [(df_inner ['city']! =' beijing'), ['id','city','age','category','gender']] .sort ([' id']) .city.count () 4

Another way to filter is to use the query function. Here are the specific code and filter results.

# use the query function to filter df_inner.query ('city = = ["beijing", "shanghai"]')

Add the price field and sum function after the previous code. Summing the filtered price field is equivalent to the function of the sumifs function in excel.

# Sum the filtered results by price df_inner.query ('city = = ["beijing", "shanghai"]'). Price.sum () 12230 after reading this article, I believe you have some understanding of "how to use python for data analysis like Excel". If you want to know more related knowledge, welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report