In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)05/31 Report--
This article introduces the relevant knowledge of "what are the knowledge points of Python Pandas". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Why learn Pandas?
So here's the problem:
Numpy has been able to help us with data processing, can be combined with matplotlib to solve our data analysis problems, so what is the purpose of pandas learning?
Numpy can help us deal with numerical data, but this is not enough. in many cases, in addition to numerical values, our data also has strings, time series and so on.
For example, we get the data stored in the database through the crawler.
So, pandas showed up.
What is Pandas?
The name of Pandas comes from panel data (panel data)
Pandas is a powerful toolset for analyzing structured data. It is built on NumPy and provides advanced data structures and data manipulation tools. It is one of the important factors that make Python a powerful and efficient data analysis environment.
A powerful toolset for analyzing and manipulating large structured datasets
The foundation is NumPy, which provides high-performance matrix operation.
Provides a large number of functions and methods that can deal with data quickly and easily.
Applied to data mining and data analysis
Provide data cleaning function
1. Index operation of Pandas
Index object Index
1. Indexes in Series and DataFrame are Index objects
Sample code:
Print (type (ser_obj.index)) print (type (df_obj2.index)) print (df_obj2.index)
Running result:
Int64Index ([0,1,2,3], dtype='int64') 2. The index object is immutable, which ensures the security of the data.
Sample code:
# indexed object immutable df_obj2.index [0] = 2
Running result:
-TypeError Traceback (most recent call last) in () 1 # indexed object immutable-> 2 df_obj2.index [0] = 2/Users/Power/anaconda/lib/python3.6/site-packages/pandas/indexes/base.py in _ _ setitem__ (self Key, value) 1402 1403 def _ setitem__ (self, key, value):-> 1404 raise TypeError ("Index does not support mutable operations") 1405 1406 def _ getitem__ (self, key): TypeError: Index does not support mutable operations3. Common types of Index
Index, index
Int64Index, integer index
MultiIndex, hierarchical index
DatetimeIndex, timestamp type
3.1 Series index 1. Index specifies the row index name
Sample code:
Ser_obj = pd.Series (range (5), index = ['averse,' baked, 'caged,' dumped,'e']) print (ser_obj.head ())
Running result:
A 0b 1c 2d 3e 4dtype: int642. Row index
Ser_obj ['label'], ser_ OBJ [pos]
Sample code:
# Row index print (ser_obj ['b']) print (ser_obj [2])
Running result:
one hundred and twenty three。 Slice index
Ser_obj [2:4], ser_obj ['label1':' label3']
Note that when slicing by index name, it includes terminating the index.
Sample code:
# slicing index print (ser_obj [1:3]) print (ser_obj ['baked virtual index d'])
Running result:
B 1c 2dtype: int64b 1c 2d 3dtype: int644. Discontinuous index
Ser_obj [['label1',' label2', 'label3']]
Sample code:
# discontiguous index print (ser_obj [0,2,4]]) print (ser_obj [['asides,' e']])
Running result:
A 0c 2e 4dtype: int64a 0e 4dtype: int645. Boolean index
Sample code:
# Boolean index ser_bool = ser_obj > 2print (ser_bool) print (ser_ OBJ [ser _ bool]) print (ser_ OBJ [ser _ obj > 2])
Running result:
A Falseb Falsec Falsed Truee Truedtype: boold 3e 4dtype: int64d 3e 4dtype: int643.2 DataFrame index 1. Columns specifies the column index name
Sample code:
Import numpy as npdf_obj = pd.DataFrame (np.random.randn (5prime4), columns = ['aura,' baked, 'cached,' d']) print (df_obj.head ())
Running result:
A b c d0-0.241678 0.621589 0.843546-0.3831051-0.526918-0.485325 1.124420-0.6531442-1.074163 0.939324-0.309822-0.2091493-0.716816 1.844654-2.123637-1.3234844 0.368212-0.910324 0.064703 0.486016
two。 Column index
Df_obj [['label']]
Sample code:
# column index print (df_obj ['a']) # returns Series type
Running result:
0-0.2416781-0.5269182-1.0741633-0.7168164 0.368212Name: a, dtype: float643 Discontinuous index
Df_obj [['label1',' label2']]
Sample code:
# discontiguous index print (df_obj [['axiomain c']])
Running result:
A c0-0.241678 0.8435461-0.526918 1.1244202-1.074163-0.3098223-0.716816-2.1236374 0.368212 0.0647034. Advanced indexing: tags, locations, and mixes
There are three advanced indexes of Pandas.
1. Loc tag index
DataFrame cannot be sliced directly, but can be sliced through loc.
Loc is an index based on the tag signature, which is our custom index name.
Sample code:
# tag index loc# Seriesprint (ser_obj ['bounded VOVIDED']) print (ser_obj.loc ['baked VOBG']) # DataFrameprint (df_obj ['a']) # the first parameter indexes the row, and the second parameter is the column print (df_obj.loc [0:2,'a'])
Running result:
B 1c 2d 3dtype: int64b 1c 2d 3dtype: int640-0.2416781-0.5269182-1.0741633-0.7168164 0.368212Name: a, dtype: float640-0.2416781-0.5269182-1.074163Name: a, dtype: float642. Iloc location Index
It works the same as loc, except that it is indexed based on the index number
Sample code:
# Integer location index iloc# Seriesprint (ser_obj [1:3]) print (ser_obj.iloc [1:3]) # DataFrameprint (df_obj.iloc [0:2, 0]) # Note and df_obj.loc [0:2,'a']
Running result:
B 1c 2dtype: int64b 1c 2dtype: int640-0.2416781-0.526918Name: a, dtype: float643. Ix tag and location hybrid index
Ix is a combination of the above two. You can use either index numbers or custom indexes, depending on the situation.
If the index has both numbers and English, then this method is not recommended and can easily lead to confusion in positioning.
Sample code:
# mixed index ix# Seriesprint (ser_obj.ix [1:3]) print (ser_obj.ix ['baked mozzarella]) # DataFrameprint (df_obj.loc [0:2,' a']) print (df_obj.ix [0:2, 0])
Running result:
B 1c 2dtype: int64b 1c 2dtype: int640-0.2416781-0.5269182-1.074163Name: a, dtype: float64
Be careful
DataFrame index operation, which can be thought of as an index operation of ndarray
The slice index of the label contains the position at the end
2. Alignment of Pandas
The alignment operation of Pandas is an important process of data cleaning, which can be calculated by index alignment. If there is no alignment, the NaN can be filled. Finally, NaN can also be filled.
2.1alignment of Series 1.alignment of Series by row and index
Sample code:
S1 = pd.Series (range (10,20), index = range (10)) S2 = pd.Series (range (20,25), index = range (5) print ('S1:') print (S1) print ('') print ('S2:') print (S2)
Running result:
S1: 0 101 112 123 134 145 156 167 178 189 19dtype: int64s2: 0 201 212 223 234 24dtype: int642. Alignment operation of Series
Sample code:
# Series alignment S1 + S2
Running result:
0 30.01 32.02 34.03 36.04 38.05 NaN6 NaN7 NaN8 NaN9 NaNdtype: alignment of float642.2 DataFrame 1. DataFrame alignment by row and column index
Sample code:
Df1 = pd.DataFrame (np.ones (2Power2), columns = ['axiao,' b']) df2 = pd.DataFrame (np.ones (3p3), columns = ['axiao,' baked,'c']) print ('df1:') print (df1) print ('') print ('df2:') print (df2)
Running result:
Df1: a b0 1.0 1.01 1.0 1.0df2: a b c0 1.0 1.0 1.01 1.0 1.0 1.02 1.0 1.0 1.02. Alignment operation of DataFrame
Sample code:
# DataFrame alignment df1 + df2
Running result:
A b c0 2.0 NaN1 2.0 NaN2 NaN NaN NaN2.3 populates unaligned data to perform operations fill_value
While using add,sub,p,mul
Specify the padding value through fill_value, and the unaligned data will operate with the padding value.
Sample code:
Print (S1) print (S2) s1.add (S2, fill_value =-1) print (df1) print (df2) df1.sub (df2, fill_value = 2.)
Running result:
# print (S1) 0101 112 123 134 145 156 167 178 189 19dtype: int64# print (S2) 0201 212 223 234 24dtype: int64# s1.add (S2) Fill_value =-1) 0 30.01 32.02 34.03 36.04 38.05 14.06 15.07 16.08 17.09 18.0dtype: float64# print (df1) a b1.0 1.01 1.0 1.0 print (df2) a b c01.0 1.01 1.0 1.02 1.0 1.0 1.0 df1.sub (df2, fill_value = 2.) A b c0 0.0 0.0 1.01 0.0 0.0 1.02 1.0 1.0 1.0
Table of arithmetic methods:
The method describes add,radd addition (+) sub,rsub subtraction (-) floorp,rfllorp division (/) floorp,rfllorp division (/ /) mul,rmul multiplication (*) pow,rpow power (*) 3. The function of Pandas applies 3.1 apply and applymap1. Functions that can be used directly with NumPy
Sample code:
# Numpy ufunc function df = pd.DataFrame (np.random.randn (5Power4)-1) print (df) print (np.abs (df))
Running result:
0 12 30-0.062413 0.844813-1.853721-1.9807171-0.539628-1.975173-0.856597-2.6124062-1.277081-1.088457-0.152189 0.5303253-1.356578-1.996441 0.368822-2.2114784-0.562777 0.518648-2.007223 0.059411 01230 0.062413 0.844813 1.853721 1.9807171 0.539628 1.975173 0.856597 2.6124062 1.277081 1.088457 0.152189 0.5303253 1.356578 1.996441 0.368822 2.2114784 0.562777 0.518648 2.007223 0.0594112. Apply a function to a column or row through apply
Sample code:
# apply row or column data using apply # f = lambda x: x.max () print (df.apply (lambda x: x.max ()
Running result:
0-0.0624131 0.8448132 0.3688223 0.530325dtype: float64
Note that the direction of the specified axis, default axis=0, direction is the column
Sample code:
# specify axis direction, axis=1, direction is row print (lambda x: x.max (), axis=1))
Running result:
0 0.8448131-0.5396282 0.5303253 0.3688224 0.518648dtype: float643. Apply a function to each data through applymap
Sample code:
# using applymap to apply to each data f2 = lambda x: '.2f'% xprint (df.applymap (f2))
Running result:
01 230-0.06 0.84-1.85-1.981-0.54-1.98-0.86-2.612-1.28-1.09-0.15 0.533-1.36-2.00 0.37-2.214-0.56 0.52-2.01 0.063.2 sort 1. Index sort
Sort_index ()
Sort uses ascending sort by default, and ascending=False uses descending sort.
Sample code:
# Seriess4 = pd.Series (range (10,15), index = np.random.randint (5, size=5)) print (S4) # Index sort s4.sort_index () # 0 0 1 3 3
Running result:
0 103 111 123 130 14dtype: int640 100 141 123 113 13dtype: int64
Pay attention to the axis direction when operating on DataFrame
Sample code:
# DataFramedf4 = pd.DataFrame (np.random.randn (3,5), index=np.random.randint (3, size=3), columns=np.random.randint (5, size=5) print (df4) df4_isort = df4.sort_index (axis=1, ascending=False) print (df4_isort) # 4 2 1 1 0
Running result:
14 0 1 22-0.416686-0.161256 0.088802-0.004294 1.1641381-0.671914 0.531256 0.303222-0.509493-0.3425731 1.988321-0.466987 2.787891-1.105912 0.889082 1 02-0.161256 1.164138-0.416686-0.004294 0.0888021 0.531256 -0.342573-0.671914-0.509493 0.3032221-0.466987 0.889082 1.988321-1.105912 2.7878912. Sort by value
Sort_values (by='column name')
Sort by a unique column name, and an error is reported if there are other identical column names.
Sample code:
# sort df4_vsort = df4.sort_values (by=0, ascending=False) print (df4_vsort) by value
Running result:
14 0 1 21 1.988321-0.466987 2.787891-1.105912 0.8890821-0.671914 0.531256 0.303222-0.509493-0.3425732-0.416686-0.161256 0.088802-0.004294 1.1641383.3 deal with missing data
Sample code:
Df_data = pd.DataFrame ([np.random.randn (3), [1, 2, np.nan], [np.nan, 4, np.nan], [1, 2, 3.]) print (df_data.head ())
Running result:
01 20-0.281885-0.786572 0.4871261 1.000000 2.000000 NaN2 NaN 4.000000 NaN3 1.000000 2.000000 3.0000001. Determine if there is a missing value: isnull ()
Sample code:
# isnullprint (df_data.isnull ())
Running result:
0 1 20 False False False1 False False True2 True False True3 False False False2. Discard missing data: dropna ()
The row or column containing the axis is discarded depending on the direction of the NaN axis. Sample code:
# dropnaprint (df_data.dropna ()) # default is print by row (df_data.dropna (axis=1)) # axis=1 is by column
Running result:
0 1 20-0.281885-0.786572 0.4871263 1.000000 2.000000 3.000000 10-0.7865721 2.0000002 4.0000003 2.0000003. Populate missing data: fillna ()
Sample code:
# fillnaprint (df_data.fillna (- 100.)
Running result:
0 20-0.281885-0.786572 0.4871261 1.000000 2.000000-100.0000002-100.000000 4.000000-100.0000003 1.000000 2.000000 3.0000004. Hierarchical index (hierarchical indexing)
Let's create a Series. When you enter the index Index, you enter a list consisting of two sub-list, the first sub-list is the outer index and the second list is the inner index.
Sample code:
Import pandas as pdimport numpy as npser_obj = pd.Series (np.random.randn (12), index= [[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 1, 2, 0, 1, 2, 0, 1) 2]]) print (ser_obj)
Running result:
A 0 0.099174 1-0.310414 2-0.558047b 0 1.742445 1 1.152924 2-0.725332c 0-0.150638 1 0.251660 2 0.063387d 0 1.080605 1 0.567547 2-0.154148dtype: float644.1 MultiIndex index object
Print the index type of this Series and show it as MultiIndex
Print out the index directly, and you can see that there are two messages: lavels and labels. Levels indicates which tags are in each of the two levels, and labels is what the tags are for each location.
Sample code:
Print (type (ser_obj.index)) print (ser_obj.index)
Running result:
MultiIndex (levels= [[0,1,2]], labels= [[0,0,0,1,1,2,2,2,3,3,3], [0,1,0,1,2]) 4.2 Select a subset
Get the data according to the index. Because there are now two-tier indexes, when you get data through the outer index, you can directly use the label of the outer index to get it.
When you want to get the data through the inner index, pass in two elements in the list, the former represents the outer index to be selected, and the latter represents the inner index to be selected.
1. Outer layer selection:
Ser_obj ['outer_label']
Sample code:
# Select print (ser_obj ['c']) in the outer layer
Running result:
0-1.3620961 1.5580912-0.452313dtype: float642. Inner layer selection:
Ser_obj [:, 'inner_label']
Sample code:
# Select print (ser_obj [:, 2]) in the inner layer
Running result:
A 0.826662b 0.015426c-0.452313d-0.051063dtype: float64
Commonly used for grouping operations, PivotTable generation, etc.
4.2 switching hierarchical order swaplevel ()
Swaplevel () exchanges the inner and outer indexes.
Sample code:
Print (ser_obj.swaplevel ())
Running result:
0a 0.0991741 a-0.3104142 a-0.5580470 b 1.7424451 b 1.1529242 b-0.7253320 c-0.1506381 c 0.2516602 c 0.0633870 d 1.0806051 d 0.5675472 d-0.154148dtype: float644.3 exchange and sort hierarchical sortlevel ()
Sortlevel () sorts the outer index first, and then the inner index. The default is ascending order.
Sample code:
# Exchange and sort hierarchical print (ser_obj.swaplevel (). Sortlevel ())
Running result:
0a 0.099174 b 1.742445 c-0.150638 d 1.0806051 a-0.310414 b 1.152924 c 0.251660 d 0.5675472 a-0.558047 b-0.725332 c 0.063387 d-0.154148dtype: float645. Statistical calculation and description of Pandas
Sample code:
Arr1 = np.random.rand (4 ABC' 3) pd1 = pd.DataFrame (arr1,columns=list ('ABC'), index=list (' abcd')) f = lambda x: '.2f'% xpd2 = pd1.applymap (f) .astype (float) pd2
Running result:
A B Ca 0.87 0.26 0.67b 0.69 0.89 0.17c 0.94 0.33 0.04d 0.35 0.46 0.295.1 commonly used statistical calculations sum, mean, max, min...
Axis=0 statistics by column, axis=1 statistics by row
Skipna excludes missing values. Default is True.
Sample code:
Pd2.sum () # calculates the Series of this column by default, pd2.sum (axis='columns') # specifies the sum of all columns in each row and pd2.idxmax () # to view the label index where the maximum value of all rows in each column is located. Similarly, we can also find the label index of the maximum value of all columns in each row through axis='columns'.
Running result:
A 2.85B 1.94C 1.17dtype: float64a 1.80b 1.75c 1.31d 1.10dtype: float64A cB bC adtype: object
5.2 commonly used statistical description describe generates multiple statistical data
Sample code:
Pd2.describe () # View Summary
Running result:
A B Ccount 4.000000 4.00000 4.000000mean 0.712500 0.48500 0.292500std 0.263613 0.28243 0.271585min 0.350000 0.26000 0.04000025% 0.605000 0.31250 0.137500 50% 0.780000 0.39500 0.23000075% 0.887500 0.56750 0.385000max 0.940000 0.89000 0.67000 percent: divided by the original amount pd2.pct_change () # to see the percentage change of the row Also specify the percentage change of axis='columns' columns and columns A B Ca NaN NaN NaNb-0.206897 2.423077-0.746269c 0.362319-0.629213-0.764706d-0.627660 0.393939 6.2500005.3 commonly used statistical description method
6. Data reading and storage
6.1 csv Fil
Read csv file read_csv (file_path or buf,usecols,encoding): file_path: file path, usecols: specify the column name to be read, encoding: encoding
Data = pd.read_csv (data = pd.read_csv) data.head () name num0 buttered tea 219.01 highland barley wine 95.02 yoghurt 62.03 Ciba 16.04 pipa meat 2.04column name specified to be read Usecols= ['name']) data.head () name0 buttered tea 1 highland barley wine 2 yogurt 3 Zanba 4 pipa # if the file path is in Chinese Then you need to know the parameters engine='python'data = pd.read_csv ('dvv / data / food_rank.csv',engine='python',encoding='utf8') data.head () name num0 buttered tea 219.01 highland barley wine 95.02 yoghurt 62.03 Ciba 16.04 pipa meat 2.02.The suggested file path and file name should not appear in Chinese.
Write to csv file
DataFrame:to_csv (file_path or buf,sep,columns,header,index,na_rep,mode): file_path: save the file path, default None,sep: delimiter, default',', columns: whether to retain a column of data, default None,header: whether to retain column names, default True,index: whether to retain row indexes, default True,na_rep: specify strings instead of null values, default is null characters, mode: default 'washes, append' a'
* * Series**: `Series.to _ csv`\ (_ path=None_,_index=True_,_sep='_,_'_,_na\ _ rep=''_,_header=False_,_mode='w'_,_encoding=None_\) 6.2 Database interaction
Pandas
Sqlalchemy
Pymysql
# Import necessary module import pandas as pdfrom sqlalchemy import create_engine# initialize database connection # user name root password port 3306 database db2engine = create_engine ('mysql+pymysql://root:@localhost:3306/db2') # query statement sql =' 'select * from class;'''# two parameters sql statement database connection df = pd.read_sql (sql,engine) df
# New df = pd.DataFrame ({'id': [1 df2',engine,index=False 2 3jade 4],' num': [34 Magi 56 authoring 78 Magi 90]}) df = pd.read_csv ('ex1.csv') # # write to database df.to_sql (' df2',engine,index=False) print ("ok")
Enter the database to view
7. Data cleaning 7.1 data cleaning and preparation
Data cleaning is a key step in data analysis, which directly affects the subsequent processing work.
Does the data need to be modified? Is there anything that needs to be changed? How should the data be adjusted to apply to the following analysis and mining?
Is an iterative process, and these cleaning operations may need to be performed more than once in the actual project
1. Dealing with missing data
Pd.fillna ()
Pd.dropna ()
two。 Whether duplicated () is a duplicate row or not
Duplicated\ (\): returns a Boolean Series indicating whether each row is a duplicate row
Sample code:
Import numpy as npimport pandas as pddf_obj = pd.DataFrame ({'data1': [' a'] * 4 + ['b'] * 4, 'data2': np.random.randint (0,4,8)}) print (df_obj) print (df_obj.duplicated ())
Running result:
# print (df_obj) data1 data20 a 31 a 22 a 33 a 34 b 15 b 06 b 37 b duplicate print (df_obj.duplicated ()) 0 False1 False2 True3 True4 False5 False6 False7 Truedtype: bool2.4 drop_duplicates () filter duplicate rows
Judge all columns by default
Can be specified to judge by certain columns
Sample code:
Print (df_obj.drop_duplicates ()) print (df_obj.drop_duplicates ('data2'))
Running result:
# print (df_obj.drop_duplicates ()) data1 data20 a 31 a 24 b 15 b 06 b 'print (df_obj.drop_duplicates (' data2')) data1 data20 a 31 a 24 b 15 b 02.5 data conversion using functions or mappings
Convert each row or column according to the function passed in by map
Sample code:
Ser_obj = pd.Series (np.random.randint (0Jing 10j 10)) print (ser_obj) print (ser_obj.map (lambda x: X * * 2))
Running result:
# print (ser_obj) 0 11 42 83 64 85 66 67 48 79 3dtype: int64# print (ser_obj.map (lambda x: X * * 2)) 0 11 162 643 364 645 366 367 168 499 9dtype: int642.6 replacement value replace is replaced according to the content of the value
Sample code:
# single value replaces single value print (ser_obj.replace (1,100)) # multiple values replace one value print (ser_obj.replace ([6,8],-100)) # multiple values replace multiple values print (ser_obj.replace ([4,7], [- 100,200]))
Running result:
# print (ser_obj.replace (1,100)) 0-1001 42 83 64 85 66 67 48 79 3dtype: int64# print (ser_obj.replace ([6,8],-100)) 0 11 42-1003-1004-1005-1006-1007 48 79 3dtype: int64# print (ser_obj.replace ([4,7]) ) 0 11-1002 83 64 85 66 67-1008-2009 3dtype: int643. String manipulation 3.1 string method
3.2 regular expression method
3.3 pandas string function
7.2 data consolidation 1. Data merging (pd.merge)
Join rows of different DataFrame based on single or multiple keys
Connection operation similar to database
Pd.merge: (left, right, how='inner',on=None,left_on=None, right_on=None)
Left: DataFrame on the left when merging
Right: the DataFrame on the right when merging
How: merge method. Default is' inner', 'outer',' left', 'right'.
On: column names that need to be merged, must have column names on both sides, and use the intersection of column names in left and right as the join key
Left_on: the column used as the join key in left Dataframe
Right_on: the column used as the join key in right Dataframe
Inner join inner: join the intersection of keys in both tables
Full join outer: join the union of keys in both tables
Left join left: join all the keys of the left table
Right join right: join all the keys of the right table
Sample code:
Import pandas as pdimport numpy as npleft = pd.DataFrame ({'key': [' K0','K1','K2','K3'),'A': ['A0','A1','A2','A3'], 'bread: [' B0','B1','B2,'B3]}) right = pd.DataFrame ({'key': [' K0') ['K1','K2','K3'],'C': ['C _ 0','C _ 1','C _ 2','C _ 3'],'D': ['D _ 0','D _ 1','D _ 2,'D3']}) pd.merge (left,right,on='key') # specifies the connection key key
Running result:
Key A B C D0 K0 A0 B0 C0 D01 K1 A1 B1 C1 D12 K2 A2 B2 C2 D23 K3 A3 B3 C3 D3
Sample code:
Left = pd.DataFrame ({'key1': [' K0,'K0,'K1,'K2], 'key2': [' K0,'K1, K0, K1],'A': ['A0,'A1,'A2,'A3],'B0, 'B1' ) right = pd.DataFrame ({'key1': [' K0','K1','K1','K2'], 'key2': [' K0','K0'],'C8: ['C0,'C1,'C2,'C3]) 'Dkeys: ['D0keys,' D1keys, 'D2keys,' D3']}) pd.merge (left,right,on= ['key1','key2']) # specify multiple keys Carry out a merger
Running result:
Key1 key2 A B C D0 K0 K0 A0 B0 C0 D01 K1 K0 A2 B2 C1 D12 K1 K0 A2 B2 C2 D2
# specify the left connection left = pd.DataFrame ({'key1': [' K0,'K0,'K1,'K2], 'key2': [' K0, K1, K0, K1],'A: ['A0,'A1,'A2,'A3],'B0: ['B0') ) right = pd.DataFrame ({'key1': [K0, K1, K2],' key2': ['K0,'K0, K0, K0],'C0: [C0, C1, C2, C3]) Pd.merge (left, right, how='left', on= ['key1']) 'key2']) key1 key2 A B C D0 K0 K0 A0 B0 B0 C0 D01 K0 K1 A1 B1 NaN NaN2 K0 A2 B2 C1 D13 K1 K0 A2 B2 D24 K2 K1 A3 B3 NaN NaN
# specify the right connection left = pd.DataFrame ({'key1': [' K0,'K0,'K1,'K2], 'key2': [' K0, K1, K0, K1],'A': ['A0,'A1,'A2,'A3],'B0: ['B0') ) right = pd.DataFrame ({'key1': [K0, K1, K2],' key2': ['K0,'K0, K0, K0],'C0: [C0, C1, C2, C3]) Pd.merge (left, right, how='right', on= ['key1']) 'key2']) key1 key2 A B C D0 K0 K0 A0 B0 B0 C0 D01 K1 K0 A2 B2 C1 D12 K1 K0 A2 B2 C2 D23 K0 NaN NaN C3 D3
The default is "inner", that is, the keys in the result are intersecting
How: specify the connection method
"external join" (outer), and the key in the result is union
Sample code:
Left = pd.DataFrame ({'key1': [' K0,'K0,'K1,'K2], 'key2': [' K0,'K1, K0, K1],'A': ['A0,'A1,'A2,'A3],'B0, 'B1' ) right = pd.DataFrame ({'key1': [' K0','K1','K1','K2'], 'key2': [' K0','K0'],'C8: ['C0,'C1,'C2,'C3]) 'During: ['D0','D1','D2','D3']}) pd.merge (left,right,how='outer',on= ['key1','key2'])
Running result:
Key1 key2 A B C D0 K0 K0 A0 B0 C0 D01 K0 K1 A1 B1 NaN NaN2 K1 K0 A2 B2 C1 D13 K1 K0 A2 B2 C2 D24 K2 K1 A3 B3 NaN NaN5 K2 K0 NaN NaN C3 D3
1. Deal with duplicate column names
Parameter suffixes: default is _ x, _ y
Sample code:
# deal with repeated column names df_obj1 = pd.DataFrame ({'key': [' baked, 'baked,' axed, 'ajar,' b'], 'data': np.random.randint (0SCR 10L7)}) df_obj2 = pd.DataFrame ({' key': ['baked,' baked,'d'] 'data': np.random.randint (010jin3)}) print (pd.merge (df_obj1, df_obj2, on='key', suffixes= (' _ left','_ right')
Running result:
Data_left key data_right0 9 b 11 5 b 12 1 b 13 2 a 84 2 a 85 5 a 82. Join by index
Parameter left_index=True or right_index=True
Sample code:
# Connect df_obj1 = pd.DataFrame ({'key': [' baked, 'baked,' ajar,'b'], 'data1': np.random.randint (010L7)} by index) df_obj2 = pd.DataFrame ({' data2': np.random.randint)}, index= ['averse,' b' 'd']) print (pd.merge (df_obj1, df_obj2, left_on='key', right_index=True))
Running result:
Data1 key data20 3 b 61 4 b 66 8 b 62 6 a 04 3 a 05 0 a 02. Data merging (pd.concat)
Merge multiple objects together along the axis
1. Concat of NumPy
Np.concatenate
Sample code:
Import numpy as npimport pandas as pdarr1 = np.random.randint (0,10, (3,4)) arr2 = np.random.randint (0,10, (3,4)) print (arr1) print (arr2) print (np.concatenate ([arr1, arr2])) # default axis=0, splicing print (np.concatenate ([arr1, arr2], axis=1)) # by column
Running result:
# print (arr1) [[3 308] [20 31] [48 8 2]] # print (arr2) [[6 8 7 3] [1 68 7 7] [1 47 1]] # print (np.concatenate ([arr1, arr2])) [3 308] [2 031] [4 8 8 2] [6 8 7 3] [1 6 8 7] [1 47]] # print (np.concatenate ([arr1, arr2]) Axis=1) [[3 30 0 8 6 8 7 3] [2 0 1 1 6 8 7 7] [4 8 8 2 4 7 1]] 2. Pd.concat
Note that the axis direction is specified, default axis=0
Join specifies the merge method. Default is outer.
Check row indexes for duplicates when Series merges
Df1 = pd.DataFrame (np.arange (6) .reshape (3jue 2), index=list ('abc'), columns= [' one','two']) df2 = pd.DataFrame (np.arange (4) .reshape (2) + 5 ac' indexedlist ('ac'), columns= [' three','four']) pd.concat ([df1,df2]) # default external connection Axis=0 four one three twoa NaN 0.0 NaN 1.0b NaN 2.0 NaN 3.0c NaN 4.0 NaN 5.0a 6.0 NaN 5.0 NaNc 8.0 NaN 7.0NaNpd.concat ([df1,df2] Axis='columns') # specify axis=1 connection one two three foura 0 1 5.06.0b 2 3 NaN NaNc 4 57.08. Similarly, we can also specify the connection method as innerpd.concat ([df1,df2], axis=1,join='inner') one two three foura 0 15 6c 4 5 7 87.3 reshape 1. Stack
Rotate the column index into a row index to complete the hierarchical index
DataFrame- > Series
Sample code:
Import numpy as npimport pandas as pddf_obj = pd.DataFrame (np.random.randint (0Jing 10, (5J 2)), columns= ['data1',' data2']) print (df_obj) stacked = df_obj.stack () print (stacked)
Running result:
# print (df_obj) data1 data20 7 91 7 82 8 93 4 14 1 "print (stacked) 0 data1 7 data2 91 data1 7 data2 82 data1 8 data2 93 data1 4 data2 14 data1 1 data2 2dtype: int642. Unstack
Expand the hierarchical index
Series- > DataFrame
Default operation inner index, that is, level=-1
Sample code:
# default operation inner layer index print (stacked.unstack ()) # specify the level of the operation index print (stacked.unstack (level=0)) through level
Running result:
# print (stacked.unstack ()) data1 data20 7 91 7 82 8 93 4 14 1 print (stacked.unstack (level=0)) 0 1 2 3 4data1 7 7 8 4 1data2 9 8 91 28. Data packet aggregation
What is grouping aggregation? As shown in the figure:
Groupby: (by=None,as_index=True)
By: based on what is grouped, used to determine the groups of groupby
As_index: for aggregate output, returns objects indexed by group notes, only for DataFrame
Df1 = pd.DataFrame ({'fruit': [' apple','banana','orange','apple','banana'], 'color': [' red','yellow','yellow','cyan','cyan'], 'price': [8.5 GruopBy 6.8, 6.6) # View type type (df1.groupby (' fruit')) pandas.core.groupby.groupby.DataFrameGroupBy # GruopBy object It is a two-dimensional tuple sequence containing the group name and the data block Support for iterative for name Group in df1.groupby ('fruit'): print (name) # output group name apple banana orange print (group) # output data block fruit color price 0 apple red 8.53 apple cyan 7.8 fruit color price 1 banana yellow 6.8 4 banana cyan 6.4 fruit color price 2 orange yellow 5.6 # output group type print (type ( Group)) # data block is dataframe type # Select any data block dict (list (df1.groupby ('fruit') [' apple'] # extract the data block of the apple group fruit color price0 apple red 8.53 apple cyan 7.8
Polymerization
The function name describes the number of non-na values in the count packet sum non-na and mean non-na mean median non-na median std, var standard deviation and variance min, max non-NA minimum The product of the maximum prod non-na value first, the first of the non-na values of last, and the last # Groupby object has the aggregation method in the above table # to find the average of price according to fruit df1 ['price'] .groupby (df1 [' fruit']). Mean () fruitapple 8.15banana 6.60orange 5.60Name: price, dtype: float64 # or df1.groupby ('fruit') [' price'] .mean () # as_index=False (do not use the grouped value as an index Regenerate the default index) df1.groupby ('fruit',as_index=False) [' price']. Mean () fruit price0 apple 8.151 banana 6.602 orange 5.60 "if I have a requirement now, calculate the difference for each fruit, 1. The aggregate function in the above table can not meet our needs, we need to use the custom aggregate function 2. 5. In the grouping object, we use our custom aggregate function "" # to define a function def diff_value (arr): return arr.max ()-arr.min () # using the custom aggregate function, we need to pass the function to the agg or aggregate method. When we use the custom aggregate function, it is much slower than the aggregate function in our table, because we have to make a function call. Data rearrangement df1.groupby ('fruit') [' price'] .agg (diff_value) fruitapple 0.7banana 0.4orange 0.0Name: price, dtype: float64
Group by dictionary or Series object:
M = {'a 'people.groupby people.groupby (S1, axis=1). Sum () people.groupby (m, axis=1).
Grouped by function:
Time series in people.groupyby (len). Sum () 9. Pandas
Time series (time series) data is an important form of structured data.
Any time observed or measured at multiple time points can form a time series. A lot of time, the time series is of a fixed frequency, that is, data points appear periodically according to a certain rule (for example, every 15 seconds. ).
Time series can also be irregular. The meaning of time series data depends on the specific application scenario.
It is mainly composed of the following:
Time stamp (timestamp), a specific moment.
A fixed period (period), such as January 2007 or the whole of 2010.
Time interval (interval), represented by the start and end timestamps. Period can be regarded as a special case of interval.
9.1 time and date data types and their tools
The Python standard library contains data types for date (date) and time (time) data, as well as calendar functions. We will mainly use datetime, time and calendar modules.
Datetime.datetime (also abbreviated to datetime) is the most frequently used data type:
In [10]: from datetime import datetimeIn [11]: now = datetime.now () In [12]: nowOut [12]: datetime.datetime (2017, 9, 25, 14, 5, 52, 72973) In [13]: now.year, now.month, now.dayOut [13]: (2017, 9, 25)
Datetime stores dates and times in milliseconds. Timedelta represents the time difference between two datetime objects:
In [14]: delta = datetime (2011, 1,7)-datetime (2008, 6, 24, 8, 15) In [15]: deltaOut [15]: datetime.timedelta (926,56700) In [16]: delta.daysOut [16]: 926In [17]: delta.secondsOut [17]: 56700
You can add (or subtract) one or more timedelta to the datetime object, which results in a new object:
In [18]: from datetime import timedeltaIn [19]: start = datetime (2011, 1,7) In [20]: start + timedelta (12) Out [20]: datetime.datetime (2011, 1, 19, 0, 0) In [21]: start-2 * timedelta (12) Out [21]: datetime.datetime (2010, 12, 14, 0, 0)
9.2 conversion between string and datetime
Using the str or strftime method (passing in a formatted string), the datetime object and the Timestamp object of the pandas (described later) can be formatted as a string:
In [22]: stamp = datetime (2011, 1,3) In [23]: str (stamp) Out [23]: '2011-01-03 00:00:00'In [24]: stamp.strftime ('% Ymuri% mmi% d') Out [24]: '2011-01-03'
Datetime.strptime can use these formatting encodings to convert strings to dates:
In [26]: datetime.strptime (value,'% Ymuri% mmer% d') Out [26]: datetime.datetime (2011, 1,3,0,0) In [27]: datestrs = ['7datestrs 6for x in datestrs 2011] In [28]: [datetime.strptime (x,'% m hand% d hand% Y') for x in datestrs [28]: [datetime.datetime (0,7,6,0,0), datetime.datetime (2011, 8,6,0) 0)]
Datetime.strptime is the best way to parse dates in a known format. But it's troublesome to write a format definition every time, especially for some common date formats.
In this case, you can use the parser.parse method in the third-party package dateutil (which is automatically installed in pandas):
In [29]: from dateutil.parser import parseIn [30]: parse ('2011-01-03') Out [30]: datetime.datetime (2011, 1, 3, 0, 0)
Dateutil can parse almost all date representations that humans can understand:
In [31]: parse ('Jan 31, 1997 10:45 PM') Out [31]: datetime.datetime (1997, 1, 31, 22, 45)
In the international format, it is common for the day to appear before the month, and the introduction of dayfirst=True can solve this problem:
In [32]: parse ('6-12-pound-2011, dayfirst=True) Out [32]: datetime.datetime (2011, 12, 6, 0, 0)
Pandas is typically used to deal with grouped dates, regardless of whether they are DataFrame axes or columns. The to_datetime method can parse many different date representations. Parsing standard date formats such as ISO8601 is very fast:
In [33]: datestrs = ['2011-07-06 1200 In 00L,' 2011-08-06 0000 In [34]: pd.to_datetime (datestrs) Out [34]: DatetimeIndex (['2011-07-06 1200 In,' 2011-08-06 1200'], dtype='datetime64 [ns]', freq=None)
It can also handle missing values (None, empty strings, etc.):
In [35]: idx = pd.to_datetime (datestrs + [None]) In [36]: idxOut [36]: DatetimeIndex (['2011-07-06 1200 dtype='datetime64 [ns]', freq=None) In [37]: idx [2] Out [37]: NaTIn [38]: pd.isnull (idx) Out [38]: array ([False, False, True], dtype=bool)
NaT (Not a Time) is the null value of timestamp data in Pandas.
Basis of time series
The most basic time series type of pandas is Series indexed by a timestamp (usually represented by a Python string or datatime object):
In [39]: from datetime import datetimeIn [40]: dates = [datetime (2011, 1, 2), datetime (2011, 1, 5),....: datetime (2011, 1, 7), datetime (2011, 1, 8),....: datetime (2011, 1, 10), datetime (2011, 1, 12)] In [41]: ts = pd.Series (6) Index=dates) In [42]: tsOut [42]: 2011-01-02-0.2047082011-01-05 0.4789432011-01-07-0.5194392011-01-08-0.5557302011-01-10 1.9657812011-01-12 1.393406dtype: float64
These datetime objects are actually placed in a DatetimeIndex:
In [43]: ts.indexOut [43]: DatetimeIndex (['2011-01-02,' 2011-01-05, '2011-01-07,' 2011-01-08, '2011-01-10,' 2011-01-12], dtype='datetime64 [ns], freq=None)
Like other Series, arithmetic operations between time series of different indexes are automatically aligned by date:
In [44]: ts + ts [:: 2] Out [44]: 2011-01-02-0.4094152011-01-05 NaN2011-01-07-1.0388772011-01-08 NaN2011-01-10 3.9315612011-01-12 NaNdtype: float64
Ts [:: 2] takes one every two.
9.3 Index, selection, subset construction
When you select data based on the tag index, the time series is very similar to other pandas.Series:
In [48]: stamp = ts.index [2] In [49]: ts [stamp] Out [49]:-0.51943871505673811
There is also a more convenient use: pass in a string that can be interpreted as a date:
In [50]: ts ['1gam10] Out [50]: 1.9657805725027142In [51]: ts [' 20110110'] Out [51]: 1.96578057250271429.4 range, frequency and movement of dates
Native time series in Pandas are generally considered to be irregular, that is, they do not have a fixed frequency. For most applications, this doesn't matter. However, it often needs to be analyzed at a relatively fixed frequency, such as daily, monthly, every 15 minutes, etc. (this naturally introduces missing values into the time series).
Fortunately, pandas has a set of standard time series frequencies and tools for resampling, frequency inference, and generating a fixed frequency date range.
For example, we can convert the previous time series to a time series with a fixed frequency (daily) by calling resample:
In [72]: tsOut [72]: 2011-01-02-0.2047082011-01-05 0.4789432011-01-07-0.5194392011-01-08-0.5557302011-01-10 1.9657812011-01-12 1.393406dtype: float64In [73]: resampler = ts.resample ('D')
The string "D" means every day.
Frequency conversion (or resampling) is a big topic. Here, I will show you how to use the basic frequency and its multiples.
Generate date range
Although I didn't say it explicitly when I used it before, you may have guessed that pandas.date_range can be used to generate DatetimeIndex of a specified length based on a specified frequency:
In [74]: index = pd.date_range ('2012-04-01,' 2012-06-01') In [75]: indexOut [75]: DatetimeIndex (['2012-04-01,' 2012-04-02, '2012-04-03,' 2012-04-04, '2012-04-05, 2012-04-06, 2012-04-07, 2012-04-08' '2012-04-09,' 2012-04-10, '2012-04-11, 2012-04-12, 2012-04-13, 2012-04-14, 2012-04-15, 2012-04-16, 2012-04-17, 2012-04-18, 2012-04-19, 2012-04-20 '2012-04-21, 2012-04-22, 2012-04-23, 2012-04-24, 2012-04-25, 2012-04-26, 2012-04-27, 2012-04-28, 2012-04-29, 2012-04-30, 2012-05-01, 2012-05-02 '2012-05-03,' 2012-05-04, '2012-05-05,' 2012-05-06, '2012-05-07,' 2012-05-08, '2012-05-09,' 2012-05-10, '2012-05-11,' 2012-05-12, 2012-05-13, 2012-05-14' '2012-05-15,' 2012-05-16, '2012-05-17,' 2012-05-18, '2012-05-19,' 2012-05-20, '2012-05-21,' 2012-05-22, '2012-05-23,' 2012-05-24, 2012-05-25, 2012-05-26 '2012-05-27,' 2012-05-28, '2012-05-29,' 2012-05-30, '2012-05-31,' 2012-06-01], dtype='datetime64 [ns], freq='D')
By default, date_range produces a daily point in time. If you pass in only the start or end date, you have to pass in a number that represents a period of time:
In [76]: pd.date_range (start='2012-04-01, periods=20) Out [76]: DatetimeIndex (['2012-04-01,' 2012-04-02, '2012-04-03,' 2012-04-04, '2012-04-05,' 2012-04-06, '2012-04-07, 2012-04-08, 2012-04-09') '2012-04-10,' 2012-04-11, '2012-04-12, 2012-04-13, 2012-04-14, 2012-04-15,' 2012-04-16, 2012-04-17, 2012-04-18, 2012-04-19, 2012-04-20], dtype='datetime64 [ns]' Freq='D') In [77]: pd.date_range (end='2012-06-01, periods=20) Out [77]: DatetimeIndex (['2012-05-13,' 2012-05-14, '2012-05-15,' 2012-05-16, '2012-05-17,' 2012-05-18, '2012-05-19,' 2012-05-20, '2012-05-21' '2012-05-22,' 2012-05-23, '2012-05-24,' 2012-05-25, '2012-05-26,' 2012-05-27, dtype='datetime64 [ns]', '2012-05-29,' 2012-05-30, '2012-05-31,' 2012-06-01] Freq='D')
The start and end dates define strict boundaries for the date index.
For example, if you want to generate a date index consisting of the last working day of each month, you can pass in the "BM" frequency (for business end of month), which will only include dates within the interval (or just on the boundary) that meet the frequency requirements:
In [78]: pd.date_range ('2000-01-01,' 2000-12-01, freq='BM') Out [78]: DatetimeIndex (['2000-01-31,' 2000-02-29, '2000-03-31,' 2000-04-28, '2000-05-31,' 2000-06-30, 2000-07-31, 2000-08-31' '2000-09-29,' 2000-10-31, '2000-11-30], dtype='datetime64 [ns], freq='BM')
Resampling and frequency conversion
Resampling refers to the process of converting a time series from one frequency to another.
Aggregating high-frequency data into low-frequency data is called downsampling (downsampling), while converting low-frequency data to high-frequency data is called up-sampling (upsampling). Not all resampling can be divided into these two categories.
For example, converting W-WED (every Wednesday) to W-FRI is neither downsampling nor ascending sampling.
Pandas objects come with a resample method, which is the main function of all kinds of frequency conversion work. Resample has an API similar to groupby, which calls resample to group data, and then calls an aggregate function:
In: rng = pd.date_range ('2000-01-01-01, periods=100, freq='D') In [209]: ts = pd.Series (len (rng)) Index=rng) In: tsOut: 2000-01-01 0.6316342000-01-1.5943132000-03-1.5199372000-01-04 1.1087522000-01-05 1.2558532000-01-06-0.0243302000-01-07-2.0479392000-01-08-0.2726572000-01-09-1.6926152000-01-10 1.423830. 0078522000-04-01-1.6388062000-04-02 1.4012272000-04-03 1.7585392000-04-04 0.6289322000-04-05-0.4237762000-04-06 0.7897402000-04-07 0.9375682000-04-08-2.2532942000-04-09-1.772919Freq: d Length: 100, dtype: float64In [21111]: ts.resample ('M'). Mean () Out [211l]: 2000-01-31-0.1658932000-02-29 0.0786062000-03-0.2238112000-04-30-0.063643Freq: M, dtype: float64In [21212]: ts.resample ('Manners, kind='period'). Mean () Out: 2000-01-0.1658932000-02 0.0786062000-03 0.2238112000-04-0.063643Freq: M, dtype: float64
Resample is a flexible and efficient method for dealing with very large time series.
This is the end of the content of "what are the knowledge points of Python Pandas". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.