In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article shows you what the most commonly used Pandas functions are from Excel to Python, which are concise and easy to understand, which can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.
Data preprocessing
This chapter is mainly about data preprocessing, sorting out the cleaned data for later statistics and analysis. It mainly includes data table merging, sorting, numerical sorting, data grouping and marking and so on.
1. Data table merging
There is no function of merging data tables directly in Excel, so it can be realized step by step by VLOOKUP function.
In Python, you can do this at one time through the merge function.
# set up df1 data sheet df1=pd.DataFrame ({"id": [1001Perry 1002Power1004pence1005pyrr1008], "gender": ['male','female','male','female','male','female','male','female'], "pay": [' YYYYYYYYZHANGZHANGJIANZHANG], "m-point": [10pence12pence20pence40Phone30])
Use the merge function to merge two data tables in the form of inner, and set the
The data of the two data tables are matched together to generate a new data table. And named it
Df_inner .
# data table matching and df_inner=pd.merge (df,df1,how='inner')
There are also left,right and outer ways to merge.
Df_left=pd.merge (df,df1,how='left') df_right=pd.merge (df,df1,how='right') df_outer=pd.merge (df,df1,how='outer')
two。 Set index column
Index columns can be used for data extraction, summary, and data filtering.
# set index column df_inner.set_index ('id')
3. Sort (by index, by number)
In Excel, you can sort the data table directly through the sort button under the data directory.
Order
In Python, you need to use ort_values function and sort_index function to complete sorting.
# sort df_inner.sort_values by the value of a specific column (by= ['age'])
The Sort_index function is used to sort the data table by the value of the index column.
# sort df_inner.sort_index () by index column
4. Data grouping
In Excel, you can group values by approximately matching the VLOOKUP function, or by using the "data penetrating shi table".
Where function is used to judge and group data in Python.
# if the value of the price column > 3000 group' group column shows high, otherwise, it shows lowdf_inner ['group'] = np.where (df_inner [' price'] > 3000 dagger column)
You can also group the data after judging the values of multiple fields. In the following code, data with a city column equal to beijing and a price column greater than or equal to 4000 is marked as 1.
# grouping data with multiple conditions marked df_inner.loc [(df_inner ['city'] = =' beijing') & (df_inner ['price'] > = 4000),' sign'] = 1
5. Data disaggregation
The "breakdown" function is provided under the data catalog in Excel.
Use the split function in Python to implement that the data in the category column in the data table contains two pieces of information, with the number in front of the category id and the letter followed by the size value. The middle is connected by a hyphen. We use the split function to split this field and match the split data table back to the original data table.
# sort the values of the category field in turn, and create a data table. The index value is the index column of df_inner, and the column names are category and sizepd.DataFrame ((x.split ('-') for x in df_inner ['category']), index=df_inner.index,columns= [' category','size'])
# match the sorted data table with the original df_inner data table df_inner=pd.merge (df_inner,split,right_index=True, left_index=True)
Data extraction
1. Extract by label (loc)
# extract the values of a single row by index df_inner.loc [3] id 1004date 2013-01-05 00:00:00city shenzhencategory 110-Cage 32price 5433gender femalem-point 40pay Ygroup highsign NaNcategory_1 110size CName: 3, dtype: object
Use colons to limit the range of extracted data, which is preceded by the starting tag value followed by the ending tag value.
# extract region row values df_inner.loc by index [0:5]
The Reset_index function is used to restore the index. Here we re-change the date of the date field
Set to the index of the data table and extract the data by date.
# reset index df_inner.reset_index ()
# set the date to index df_inner=df_inner.set_index ('date')
# extract all data before 4 days df_inner [: '2013-01-04']
two。 Extract by location (iloc)
Use the iloc function to extract the data in the data table by location, here before and after the colon
Is no longer the label name of the index, but the location of the data, starting at 0.
# using iloc to extract data df_inner.iloc by location region [: 3Personalized 2]
The iloc function can extract data not only by region, but also by location.
# use iloc to extract data df_inner.iloc separately by location [[0JEI 2jue 5], [4J5]]
The number in the preceding square brackets indicates the location of the row of the data, and the number in the following square brackets indicates the position of the column.
3. Extract by label and location (ix)
Ix is a mixture of loc and iloc, which can be extracted by index tag and counted by location.
According to the extraction.
# using ix to extract data by index tag and location mixed df_inner.ix [: '2013-01-03 parallel Magazine 4]
4. To extract (regions and conditional values) by condition
Use loc and isin functions together to extract data according to the specified conditions
# determine whether the value of city column is beijingdf_inner ['city'] .isin ([' beijing']) date2013-01-02 True2013-01-05 False2013-01-07 True2013-01-06 False2013-01-03 False2013-01-04 FalseName: city, dtype: bool
Nest the isin function into the data extraction function of loc, and judge the result as Ture data.
Extract it. Here we change the judgment condition to whether the city value is beijing and shanghai. If so, extract this piece of data.
# first determine whether the city column contains beijing and shanghai, and then extract the data of the compound condition. Df_ inner.loco [DF _ inner ['city'] .isin ([' beijing','shanghai'])]
Data screening
Filter by condition (and, or, non)
The "filter" function is provided under the Excel data directory, which is used to press different bars on the data table.
Pieces for screening.
In Python, the loc function is used to complete the filtering function with the filter criteria. With sum and count functions, we can also realize the functions of sumif and countif functions in Excel.
Filter using the "and" condition if the age is greater than 25 years old and the city is
Beijing .
# filter df_inner.loc using the and condition [(df_inner ['age'] > 25) & (df_inner [' city'] = = 'beijing'), [' id','city','age','category','gender']] /
# filter df_inner.loc [(df_inner ['age'] > 25) using the "or" condition | (df_inner [' city'] = = 'beijing'), [' id','city','age','category','gender']] .sort (['age'])
# filter df_inner.loc using the "not" condition [(df_inner ['city']! =' beijing'), ['id','city','age','category','gender']] .sort ([' id'])
Add the city column after the previous code and use the count function to count. Equivalent to the function of the countifs function in Excel
# count the filtered data by city column df_inner.loc [(df_inner ['city']! =' beijing'), ['id','city','age','category','gender']] .sort ([' id']) .city.count ()
Another way to filter is to use the query function
# use the query function to filter df_inner.query ('city = = ["beijing", "shanghai"]')
Add the price field and sum function after the previous code. For the filtered price field
Summation is equivalent to the function of the sumifs function in Excel.
# Sum the filtered results by price df_inner.query ('city = = ["beijing", "shanghai"]') .price.sum () 12230 data summary
Subtotals and pivots are used in Excel to summarize data by specific dimensions, and the main functions used in Python are groupby and pivot_table.
1. Classified summary
# count and summarize all columns df_inner.groupby ('city') .count () /
# count a specific ID column df_inner.groupby ('city') [' id'] .count () citybeijing 2guangzhou 1shanghai 2shenzhen 1Name: id, dtype: int64# to count two fields df_inner.groupby (['city','size']) [' id'] .count () city sizebeijing A 1F 1guangzhou A 1shanghai A 1B 1shenzhen C 1Name: id, dtype: int64
The aggregated data can also be calculated according to multiple dimensions at the same time.
# summarize the city fields and calculate the sum and mean of price. Df_inner.groupby ('city') [' price'] .agg ([len,np.sum, np.mean])
two。 Data perspective
Achieve the same effect through pivot_table function in Python
# set city as the row field, size as the column field and price as the value field. Calculate the quantity and amount of price respectively and summarize them by row and column. Pd.pivot_table (df_inner,index= ["city"], values= ["price"], columns= ["size"], aggfunc= [len,np.sum], fill_value=0,margins=True)
Data statistics
1. Data sampling
Data sampling is provided in the data analysis function of Excel.
Python completes data sampling through sample function
# simple data sampling df_inner.sample (naugh3)
The Weights parameter is the weight of the sample. You can change the result of the sample by setting different weights.
# manually set the sampling weight weights= [0,0,0,0,0.5,0.5] df_inner.sample (nasty 2, weights=weights)
Parameter replace in the Sample function, which is used to set whether to put it back after sampling
# do not put back df_inner.sample (nasty 6, replace=False) # after sampling, put back df_inner.sample (nasty 6, replace=True)
two。 Descriptive statistics
The data can be described and counted by Describe in Python.
# Datasheet descriptive statistics df_inner.describe (). Round (2). T
3. Correlation analysis
In Python, the operation of correlation analysis is completed through the corr function, and the correlation coefficient is returned.
# correlation analysis df_inner ['price'] .corr (df_inner [' mmurpoint']) 0.7746655561708526 data table correlation analysis df_inner.corr ()
Data output
1. Write to Excel
# Export to Excel format df_inner.to_Excel ('Excel_to_Python.xlsx', sheet_name='bluewhale_cc')
two。 Write to csv
# output to CSV format df_inner.to_csv ('Excel_to_Python.csv') what are the most commonly used Pandas functions from Excel to Python? have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.