In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly shows you "how to use Pandas in python", the content is easy to understand, clear, hope to help you solve doubts, the following let the editor lead you to study and learn "how to use Pandas in python" this article.
1. Series and DataFrame
Pandas is based on NumPy. For more knowledge about NumPy, please refer to my previous article pre-machine learning (3): master common NumPy usage in 30 minutes.
Pandas is particularly suitable for dealing with tabular data, such as SQL tables and EXCEL tables. An ordered or disordered time series. Any matrix data with row and column labels.
Open Jupyter Notebook and import numpy and pandas to begin our tutorial:
Import numpy as npimport pandas as pd1. Pandas.Series
Series is an one-dimensional ndarray array with an index. The index value is not unique, but it must be hashable.
Pd.Series ([1,3,5, np.nan, 6,8])
Output:
0 1.01 3.02 5.03 NaN4 6.05 8.0dtype: float64
We can see that the default index values are 0, 1, 2, 3, 4, 5. Add the index attribute, and specify it as' clockwork, pencils, pencils, packs, jacks, packs, etc.
Pd.Series ([1, 3, 5, np.nan, 6, 8], index= ['clockwise pencils, jongles, pompies, jongles, pompies, jungles, etc.])
The output is as follows, and we can see that index is repeatable.
C 1.0a 3.0i 5.0yong NaNj 6.0i 8.0dtype: float642. Pandas.DataFrame
DataFrame is a table structure with rows and columns. Can be understood as the dictionary structure of multiple Series objects.
Pd.DataFrame (np.array ([[1, 2, 3], [4, 5, 6], [7, 8, 9]), index= ['Aids,' Bones,'C'])
The output table is as follows, where index corresponds to its row and columns corresponds to its column.
Common usage of ABCi123ii456iii789 II and Pandas 1. Access data
Prepare the data and randomly generate a two-dimensional array of six rows and four columns with row labels from 20210101 to 20210106 and columns labeled A, B, C, D.
Import numpy as npimport pandas as pdnp.random.seed (20201212) df = pd.DataFrame (np.random.randn (6,4), index=pd.date_range ('20210101, periods=6), columns=list (' ABCD')) df
The display form is as follows:
ABCD2021-01-010.270961-0.4054630.3483730.8285722021-01-020.6965410.136352-1.64592-0.69812021-01-030.325415-0.602236-0.1345081.281212021-01-04-1.40384-0.938091.488042021-01-050.3487081.2750.626011-0.2538452021-01-06-0.8160641.301970.656281-1.27181.1 head () and tail ()
Look at the first few lines of the table:
Df.head (2)
The display form is as follows:
ABCD2021-01-010.270961-0.4054630.3483730.8285722021-01-020.6965410.136352-1.64592-0.69841
Look at the following lines of the table:
Df.tail (3)
The display form is as follows:
ABCD2021-01-04-0.33032-1.40384-0.938091.488042021-01-050.3487081.271750.626011-0.2538452021-01-06-0.8160641.301970.656281-1.27181.2 describe ()
The describe method is used to generate description statistics for DataFrame. You can easily view the distribution of the dataset. Note that the statistical distribution here does not contain nan values.
Df.describe ()
The display is as follows:
ABCDcount6666mean0.08254020.0497552-0.1813090.22896std0.5514121.078340.9331551.13114min-0.816064-1.40384-1.64592-1.271825% copyright 0.18-0.553043-0.737194-0.58726950% 0.298188-0.1345550.1069330.28736375%0.3428850.9879010.5566011.16805max0.6965411.301970.6562811.48804
First of all, let's review the mathematical formulas we have.
Average (mean):
$\ bar x =\ frac {\ sum_ {item1} ^ {n} {Xeroi}} {n} $
Variance (variance):
$s ^ 2 =\ frac {\ sum_ {iTun1} ^ {n} {(Xeroi -\ bar x) ^ 2}} {n} $
Standard deviation (std):
$s =\ sqrt {\ frac {\ sum_ {iTun1} ^ {n} {(Xanthi -\ bar x) ^ 2}} {n}} $
Let's explain the meaning of the attributes of pandas's describe statistics. Let's just take An as an example.
Count represents a count. There are 6 data in column A that are not empty.
Mean stands for average. The average value of all non-empty data in column An is 0.0825402.
Std stands for standard deviation. The standard deviation of column An is 0.551412.
Min represents the minimum value. The minimum value in column An is-0.816064. That is, 0% of the data is smaller than-0.816064.
1/4 represents the quartile. The quartile of column An is-1/4. That is, 25% of the data is smaller than-0.18.
50% represents the 1/2 quartile. The 1/4 quartile of column An is 0.298188. That is, 50% of the data is smaller than 0.298188.
75% represents the 3/4 quartile. The 3/4 quartile of column An is 0.342885. That is, 75% of the data is smaller than 0.342885.
Max represents the maximum value. The maximum value of column An is 0.696541. That is, 100% of the data is smaller than 0.696541.
1.3 T
T generally means the abbreviation of Transpose, that is, transpose. Change of ranks and ranks.
Df.T
The display form is as follows:
2021-01-012021-01-022021-01-032021-042021-052021-01-06A0.2709610.6965410.325415-0.330320.348708-0.816064B-0.4054630.136352-0.602236-1.403841.271751.30197C0.348373-1.64592-0.134508-0.938090.6260110.656281D0.828572-0.698411.281211.48804-0.253845-1.27181.4 sort_values ()
Specify a column to sort, and the following code sorts in positive order according to column C.
Df.sort_values (by='C')
The display form is as follows:
ABCD2021-01-020.6965410.136352-1.64592-0.698412021-01-04-0.33032-1.40384-0.938091.488042021-01-030.325415-0.602236-0.1345081.281212021-01-010.270961-0.4054630.3483730.8285722021-01-050.34881.271750.626011-0.2538452021-06-0.8160641.301970.656281-1.27181.5 nlargest ()
Select the largest n rows of data in a column. For example, df.nlargest (2) indicates that the largest 2 rows of data in column An are returned.
Df.nlargest (2)
The display form is as follows:
ABCD2021-01-020.6965410.136352-1.64592-0.698412021-01-050.3487081.271750.626011-0.2538451.6 sample ()
The sample method means to view random sample data.
Df.sample (5) means that five random rows of data are returned.
Df.sample (5)
The parameter frac represents fraction, which means fraction. Frac=0.01 returns 1% of the random data as an example.
Df.sample (frac=0.01) 2. Selection data 2.1 Select according to label
We enter the df ['A'] command to select column A.
Df ['A']
Output column A data, which is also a Series object:
2021-01-01 0.2709612021-01-02 0.6965412021-01-03 0.3254152021-01-04-0.3303202021-01-05 0.3487082021-01-06-0.816064Name: a, dtype: float64
Df [0:3] this code is the same as df.head (3). But df [0:3] is the array selection method of NumPy, which shows that Pandas has good support for NumPy.
Df [0:3]
The display table is as follows: | | A | B | C | D | |:-| | 2021-01-01 | 0.270961 |-0.405463 | 0.348373 | 0.828572 | | 2021-01-02 | 0.696541 | 0.136352 |-1. 64592 |-0.69841 | | 2021-01-03 | 0.325415 |-0.602236 |-0.134508 | 1.28121 |
Specify the row and column tags through the loc method.
Df.loc ['2021-01-01-01-01-02]
The display form is as follows:
AB2021-01-010.270961-0.4054632021-01-020.6965410.1363522.2 Select according to location
Iloc is different from loc. Loc specifies the specific label, while iloc specifies the index location of the tag. Df.iloc [3:5, 0:3] selects rows with indexes 3 and 4, and columns with indexes 0, 1, and 2. That is, rows 4 and 5, columns 1, 2, and 3. Notice that the index number starts at 0. The colon indicates the interval, and the left and right sides indicate the beginning and the end, respectively. For example, 3:5 means that the left opening and right closing interval [3jin5), that is, does not include 5 itself.
Df.iloc [3:5, 0:3]
ABC2021-01-04-0.33032-1.40384-0.938092021-01-050.3487081.271750.626011df.iloc [:, 1:3]
BC2021-01-01-0.4054630.3483732021-01-020.136352-1.645922021-01-03-0.602236-0.1345082021-04-1.40384-0.938092021-051.271750.6260112021-01-061.301970.6562812.3 Boolean Index
DataFrame can be filtered according to the condition. When the condition determines the True, it returns. When the condition is False, filter it out.
We set a filter to determine whether column An is greater than 0.
Filter = df ['A'] > 0filter
The output is as follows, and you can see the behavior False for 2021-01-04 and 2021-01-06.
2021-01-01 True2021-01-02 True2021-01-03 True2021-01-04 False2021-01-05 True2021-01-06 FalseName: a, dtype: bool
We view the dataset through the filter.
Df [filter] # df [df ['A'] > 0]
Looking at the table, we can see that the rows of 2021-01-04 and 2021-01-06 have been filtered out.
ABCD2021-01-010.270961-0.4054630.3483730.8285722021-01-020.6965410.136352-1.64592-0.698412021-01-030.325415-0.602236-0.1345081.281212021-01-050.3481.271750.626011-0.2538453. Dealing with missing values
Prepare the data.
Df2 = df.copy () df2.loc [: 3,'E'] = 1.0f_series = {'021-01-02: 1.0 df2 [' F'] = pd.Series (f_series) df2
The display form is as follows:
ABCDFE2021-01-010.270961-0.4054630.3483730.828572nan12021-01-020.6965410.136352-1.64592-0.69841112021-01-030.325415-0.602236-0.1345081.281212021-01-04-0.33032-1.40384-0.938091.488043nan2021-01-050.3481.271750.626011-0.2538454nan2021-01-06-0.8160641.301970.656281-1.27185nan3.1 dropna ()
Use the dropna method to clear the nan value. Note: the dropa method returns the new DataFrame and does not change the original DataFrame.
Df2.dropna (how='any')
The above code indicates that it is deleted when any value of the row data is empty.
ABCDFE2021-01-020.6965410.136352-1.64592-0.69841112021-01-030.325415-0.602236-0.1345081.28121213.2 fillna ()
Use the filna command to fill in the nan value.
Df2.fillna (df2.mean ())
The above code indicates that the average of each column is used to fill the gap. Similarly, fillna does not update the original DataFrame, if you need to update the original DataFrame using the code df2 = df2.fillna (df2.mean ()).
The display form is as follows:
ABCDFE2021-01-010.270961-0.4054630.3483730.828572312021-01-020.6965410.136352-1.64592-0.6981112021-01-030.325415-0.602236-0.1345081.28121212021-01-04-1.40384-0.938091.48804312021-01-050.34881.271750.626011-0.253845412021-01-06-0.8160641.301970.656281-1.2718514. Method of operation 4.1 agg ()
Agg is the abbreviation of Aggregate, which means aggregation.
Common aggregation methods are as follows:
Mean (): Compute mean of groups
Sum (): Compute sum of group values
Size (): Compute group sizes
Count (): Compute count of group
Std (): Standard deviation of groups
Var (): Compute variance of groups
Sem (): Standard error of the mean of groups
Describe (): Generates descriptive statistics
First (): Compute first of group values
Last (): Compute last of group values
Nth (): Take nth value, or a subset if n is a list
Min (): Compute min of group values
Max (): Compute max of group values
Df.mean ()
Returns the average of each column
A 0.082540B 0.049755C-0.181309D 0.228960dtype: float64
You can view the row average by adding the parameter axis.
Df.mean (axis=1)
Output:
2021-01-01 0.2606112021-01-02-0.3778602021-01-03 0.2174702021-01-04-0.2960532021-01-05 0.4981562021-01-06-0.032404dtype: float64
What if we want to view multiple aggregate statistics for a column?
At this point we can call the agg method:
Df.agg (['std','mean']) [' A']
The returned result shows the standard deviation std and the mean mean:
Std 0.551412mean 0.082540Name: A, dtype: float64
Apply different aggregate functions to different columns:
Df.agg ({'max','mean': [' max','mean'], 'mean','std','var': [' mean','std','var']})
The returned result is as follows:
ABmax0.696541nanmean0.08254020.0497552stdnan1.07834varnan1.162814.2 apply ()
Apply () is a call to a method. For example, df.apply (np.sum) means that each column calls the np.sum method and returns the sum of the values for each column.
Df.apply (np.sum)
The output is as follows:
A 0.495241B 0.298531C-1.087857D 1.373762dtype: float64
The apply method supports lambda expressions.
Df.apply (lambda n: Niss2)
ABCD2021-01-010.541923-0.8109250.69471.657142021-01-021.393080.272704-3.29185-1.396822021-01-030.65083-1.20447-0.2690162.562422021-01-04-0.66064-2.80768-1.876182.976072021-01-050.6974172.54351.25202-0.507692021-01-06-1.632132.603931256-2.54364.3 value_counts ()
The value_counts method is used to view the repeated statistics of each row and column. We regenerate some integer data to ensure that there is a certain amount of data duplication.
Np.random.seed (101c) df3 = pd.DataFrame (np.random.randint (0min9)), columns=list ('ABCD')) df3
ABCD016781485025813383324837057843
Call the value_counts () method.
Df3 ['A'] .value_counts ()
Looking at the output, we can see that there are two numbers 8 in column A, and the number of other numbers is 1.
8 27 15 14 11 1Name: A, dtype: int644.4 str
Pandas built-in string handling method.
Names = pd.Series (['andrew','bobo','claire','david','4']) names.str.upper ()
With the above code, we set all the strings in Series to uppercase.
0 ANDREW1 BOBO2 CLAIRE3 DAVID4 4dtype: object
Initials capitalized:
Names.str.capitalize ()
The output is:
0 Andrew1 Bobo2 Claire3 David4 4dtype: object
To determine whether it is a number:
Names.str.isdigit ()
The output is:
0 False1 False2 False3 False4 Truedtype: bool
String segmentation:
Tech_finance = ['GOOG,APPL,AMZN','JPM,BAC,GS'] tickers = pd.Series (tech_finance) tickers.str.split (','). Str [0:2]
Split the string with a comma, and the result is:
0 [GOOG, APPL] 1 [JPM, BAC] dtype: object5. Merge 5.1 concat ()
Concat is used to concatenate data sets. Let's prepare the data first.
Data_one = {'Col1': [' A0,'A1,'A2,'A3], 'Col2': [' B0,'B1,'B2,'B3]} data_two = {'Col1': [C0, C1, C2, C3],' Col2': [D0, D1, D2] 'D3']} one = pd.DataFrame (data_one) two = pd.DataFrame (data_two)
Use the concat method to concatenate two datasets.
Pt (pd.concat ([one,two]))
Get the table: | | Col1 | Col2 | |-- |: |:-| 0 | A0 | B0 | | 1 | A1 | B1 | | 2 | A2 | B2 | 3 | A3 | B3 | 0 | C0 | D0 | 1 | C1 | D1 | 2 | C2 | D2 | 3 | C3 | D3 |
5.2 merge ()
Merge is equivalent to the join method in a SQL operation, which is used to connect two datasets through some relationship.
Registrations = pd.DataFrame ({'reg_id': [1je 2je 3jue 4],' name': ['Andrew','Bobo','Claire','David']}) logins = pd.DataFrame ({' log_id': [1je 2 Jing 3Jing 4], 'name': [' Xavier','Andrew','Yolanda','Bobo']})
We join two tables according to name, and the connection mode is outer.
Pd.merge (left=registrations, right=logins, how='outer',on='name')
The returned result is:
Reg_idnamelog_id01Andrew212Bobo423Clairenan34Davidnan4nanXavier15nanYolanda3
We notice that how: {'left',' right', 'outer',' inner'} has four connection modes. Indicates whether to select the nanvalue of the left and right tables. For example, left means to keep all the data in the table on the left side, and when the data in the table on the right side is nm, the data on the right side will not be displayed. Simply put, think of the left table and the right table as two collections.
Left represents the collection of all left tables + the intersection of two tables.
Right means to take all collections of the right table + the intersection of two tables
Outer represents the union of two tables.
Inner denotes the intersection of two tables
6. Grouping GroupBy
The grouping function in Pandas is very similar to SQL statements SELECT Column1, Column2, mean (Column3), sum (Column4) FROM SomeTableGROUP BY Column1, Column2. It doesn't matter if you don't have contact with SQL. Grouping is the process of splitting, counting, and merging table data according to a column.
Prepare the data.
Np.random.seed (20201212) df = pd.DataFrame ({'Aids: [' foo', 'bar',' foo', 'foo'],' blocked: ['one',' one', 'two',' three', 'two',' two', 'one',' three'] 'np.random.randn: np.random.randn (8),' dating: np.random.randn (8)}) df
As you can see, we have a lot of duplicate data in columns An and B. At this point we can group according to foo/bar or one/two.
ABCD0fooone0.2709610.3254151barone-0.405463-0.6022362footwo0.348373-0.1345083barthree0.8285721.281214footwo0.696541-0.330325bartwo0.136352-1.403846fooone-1.64592-0.938097foothree-0.698411.488046.1 single column packet
We use the groupby method to group the data in the table above.
Df.groupby ('A')
Executing the above code, you can see that the groupby method returns an object of type DataFrameGroupBy. We can't view it directly, we need to apply the aggregate function. Refer to section 4.1 of this article.
Let's try the aggregate function sum.
Df.groupby ('A'). Sum ()
The display form is as follows:
ACDbar0.559461-0.724868foo-1.028460.4105336.2 multi-column grouping
The groupby method supports passing in multiple columns as parameters.
Df.groupby (['Agar,' B']) .sum ()
After grouping, the results are as follows:
ABCDbarone-0.405463-0.602236
One-0.405463-0.602236
Three0.8285721.28121
Two0.136352-1.40384fooone-1.37496-0.612675
Three-0.698411.48804
Two1.04491-0.4648286.3 Application of Polymerization method
We apply agg () to pass the aggregate method array as a parameter to the method. The following code is classified according to An and only counts the values of column C.
Df.groupby ('A') ['C'] .agg ([np.sum, np.mean, np.std])
You can see that the results of each aggregate function of bar group and foo group are as follows:
Asummeanstdbar0.5594610.1864870.618543foo-1.02846-0.2056920.9572426.4 different aggregation statistics are carried out in different columns
The following code makes different aggregate statistics for C and D columns, sums C columns, and makes standard deviation statistics for D columns.
Df.groupby ('A') .agg ({'sum',' D': lambda x: np.std (x, ddof=1)})
The output is as follows:
ACDbar0.5594611.37837foo-1.028460.9074226.5 more
For more information about the goupby method of Pandas, please refer to the official website: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
Third, Pandas advanced usage 1. Reshape
Reshape stands for reshaping the table. For complex tables, we need to transform them into something suitable for our understanding, such as separate statistics after grouping according to certain attributes.
1.1 stack () and unstack ()
The stack method divides the table into two parts: index and data. The columns of the index are retained and the data is stacked.
Prepare the data.
Tuples = list (zip (* [['bar',' bar', 'baz',' baz','foo',' foo', 'qux',' qux'], ['one',' two','one', 'two']]) index = pd.MultiIndex.from_tuples (tuples, names= [' first', 'second'])
Based on the code above, we created a composite index.
MultiIndex ([(bar', 'one'), (' bar', 'two'), (' baz', 'one'), (' baz', 'two'), (' foo', 'one'), (' foo', 'two'), (' qux', 'one'), (' qux'') 'two')], names= [' first', 'second'])
We create a DataFrame with a composite index.
Np.random.seed (20201212) df = pd.DataFrame (np.random.randn (8,2), index=index, columns= ['A','B']) df
The output is as follows:
ABCDbarone0.270961-0.405463
Two0.3483730.828572bazone0.6965410.136352
Two-1.64592-0.69841fooone0.325415-0.602236
Two-0.1345081.28121quxone-0.33032-1.40384
Two-0.938091.48804
We execute the stack method.
Stacked = df.stack () stacked
The table after the output is stacked (compressed) is as follows. Note: the output you use Jupyter Notebook/Lab may not be the same as the following. The output below has some adjustments in order to facilitate the display in Markdown.
First second bar one A 0.942502bar one B 0.060742bar two A 1.340975bar two B-1.712152baz one A 1.899275baz one B 1.237799baz two A-1.589069baz two B 1.288342foo one A-0.326792foo one B 1.576351foo two A 1.526528foo two B 1.410695qux one A 0.420718qux one B-0.288002qux two A 0.361586qux two B 0.177352dtype: float64
We execute unstack to expand the data.
Stacked.unstack ()
Export the original form.
ABCDbarone0.270961-0.405463
Two0.3483730.828572bazone0.6965410.136352
Two-1.64592-0.69841fooone0.325415-0.602236
Two-0.1345081.28121quxone-0.33032-1.40384
Two-0.938091.48804
We add the parameter level.
Stacked.unstack (level=0) # stacked.unstack (level=1)
When you get the following output when you level=0, you can try what is output when you level=1.
SecondfirstbarbazfooquxoneA0.9425021.89927-0.3267920.420718oneB0.0607421.23781.57635-0.288002twoA1.34097-1.589071.526530.361586twoB-1.712151.288341.41070.1773521.2 pivot_table ()
Pivot_table represents a PivotTable, a table format that dynamically arranges and classifies data.
We generate DataFrame without indexed columns.
Np.random.seed (99) df = pd.DataFrame ({'Agar: [' one', 'one',' two', 'three'] * 3,' Bread: ['Agar,' Bread,'C'] * 4, 'Che: [' foo', 'bar',' bar', 'bar'] * 2 'During: np.random.randn (12), 'ejaculating: np.random.randn (12)}) df
The display form is as follows:
ABCDE0oneAfoo-0.1423590.02350011oneBfoo2.057220.4562012twoCfoo0.2832620.2704933threeAbar1.32981-1.435014oneBbar-0.1546220.8828175oneCbar-0.0690309-0.5800826twoAfoo0.75518-0.5015657threeBfoo0.8256470.5909538oneCfoo-0.113069-0.7316169oneAbar-2.367840.26175510twoBbar-0.167049-0.85579611threeCbar0.685398-0.187526
By observing the data, we can obviously draw a conclusion that columns A, B and C have certain attribute meanings. We execute the pivot_table method.
Pd.pivot_table (df, values= ['Dumbledore recording E'], index= [' A','B'], columns= ['C'])
The above code means that columns D and E are used as data columns, An and B are used as composite row indexes, and C's data values are used as column indexes.
('one',' bar') ('one',' foo') ('one',' bar') ('Egg,' foo') ('one',' A')-2.36784-0.1423590.2617550.0235001 ('three'' B')-0.1546222.057220.8828170.456201 ('one',' C')-0.0690309-0.113069-0.580082-0.731616 ('three'') 'B') nan0.825647nan0.590953 (' three','C') 0.685398nan-0.187526nan ('two',' A') nan0.75518nan-0.501565 ('two',' B')-0.167049nan-0.855796nan ('two',' C') nan0.283262nan0.2704932. Time series.
Date_range is the generation date interval method that comes with Pandas. We execute the following code:
Rng = pd.date_range ('1gamma 2021, periods=100, freq='S') pd.Series (np.random.randint (0500, len (rng)), index=rng)
The date_range method starts at 0 seconds on January 1, 2021, and performs 100 time periods at an interval of 1 second. The output is as follows:
2021-01-01 00:00:00 4752021-01-01 00:00:01 1452021-01-01 00:00:02 132021-01-01 00:00:03 2402021-01-01 00:00:04 183.2021-01-01-01 00:01:35 4132021-01-01 00:01:36 3302021-01-01 00 01VOV 37 2722021-01-01 00VOLV 01RV 38 3042021-01-01 001RV 39 151Freq: s Length: 100, dtype: int32
Let's try changing the parameter value of freq from S (second) to M (Month).
Rng = pd.date_range ('1gamma 2021, periods=100, freq='M') pd.Series (np.random.randint (0500, len (rng)), index=rng)
Output:
2021-01-31 3112021-02-28 2562021-03-31 3272021-04-30 1512021-05-31 484... 2028-12-31 1702029-01-31 4922029-02-28 2052029-03-31 902029-04-30 446Freq: M, Length: 100, dtype: int32
We set up date generation on a quarterly basis.
Prng = pd.period_range ('2018Q1,' 2020Q4, freq='Q-NOV') pd.Series (np.random.randn (len (prng)), prng)
Output all quarters from the first quarter of 2018 to the fourth quarter of 2020.
2018Q1 0.8330252018Q2-0.5095142018Q3-0.7355422018Q4-0.2244032019Q1-0.1197092019Q2-1.3794132019Q3 0.8717412019Q4 0.8774932020Q1 0.5776112020Q2-0.3657372020Q3-0.4734042020Q4 0.529800Freq: Q-NOV dtype: float643. classification
Pandas has a special data type called "catalog", or dtype= "category", which we classify by setting certain columns as directories.
Prepare the data.
Df = pd.DataFrame ({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ['averse,' baked, 'baked,' averse, 'averse,' e']}) df
Idraw_grade01a12b23b34a45a56e
We add a new column, grade, and set its data type to category.
Df ["grade"] = df ["raw_grade"] .astype ("category") df ["grade"]
We can see that there are only three values for the grade column: a _ meme _ b _ e.
0a1 b2 b3 a4 a5 eName: grade, dtype: categoryCategories (3, object): ['asides,' bads,'e']
We replace a, b, e with very good, good, very bad in order.
Df ["grade"] .cat.categories = ["very good", "good", "very bad"]
The table at this time is:
Idraw_gradegrade01avery good12bgood23bgood34avery good45avery good56every bad
We sort the tables:
Df.sort_values (by= "grade", ascending=False)
Idraw_gradegrade56every bad12bgood23bgood01avery good34avery good45avery good
Check different types of quantities:
Df.groupby (grade) .size ()
The output of the above code is:
Gradevery good 3good 2very bad 1dtype: int644. IO
Pandas supports reading and writing data directly from files, such as CSV, JSON, EXCEL, and other file formats. The file formats supported by Pandas are as follows.
Format TypeData DescriptionReaderWritertextCSVread_csvto_csvtextFixed-Width Text Fileread_fwf
TextJSONread_jsonto_jsontextHTMLread_htmlto_htmltextLocal clipboardread_clipboardto_clipboard
MS Excelread_excelto_excelbinaryOpenDocumentread_excel
BinaryHDF5 Formatread_hdfto_hdfbinaryFeather Formatread_featherto_featherbinaryParquet Formatread_parquetto_parquetbinaryORC Formatread_orc
BinaryMsgpackread_msgpackto_msgpackbinaryStataread_statato_statabinarySASread_sas
BinarySPSSread_spss
BinaryPython Pickle Formatread_pickleto_pickleSQLSQLread_sqlto_sqlSQLGoogle BigQueryread_gbqto_gbq
Let's just take the CSV file as an example. For other formats, please refer to the table above.
We import data from the CSV file. You don't have to pay special attention to the domain name and address of the URL below.
Df = pd.read_csv ("http://blog.caiyongji.com/assets/housing.csv")
View the first five rows of data:
Df.head (5)
Longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity0-122.2337.88418801293221268.3252452600NEAR BAY1-122.2237.862170991106240111388.3014358500NEAR BAY2-122.2437.855214671904961777.2574352100NEAR BAY3-122.2537.855212742355582195.6431341300NEAR BAY4-122.2537.855216272805652593.8462342200NEAR BAY5. Drawing
Pandas supports matplotlib,matplotlib, which is a powerful Python visualization tool. This section only gives a brief introduction to the drawing methods supported by Pandas, which we will cover in detail in the next article. In order not to miss the update, welcome to follow me.
Np.random.seed (999) df = pd.DataFrame (np.random.rand (10,4), columns= ['averse,' baked, 'cased,' d'])
We directly call the plot method to display. Here are two things to pay attention to:
The plot method is a plot method called through Pandas, not matplotlib.
We know that the Python language does not need a semicolon for closing statements. The semicolon here indicates that the image is displayed directly after the drawing rendering is performed.
Df.plot ()
Df.plot.bar ()
Df.plot.bar (stacked=True)
The above is all the contents of the article "how to use Pandas in python". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.