In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
This article is a simple translation of "10 Minutes to pandas" on the official website of pandas. The original text is here. This article is a brief introduction to pandas. For a detailed introduction, please refer to: Cookbook. Traditionally, we introduce the required packages in the following format:
In [1]: import pandas as pdIn [2]: import numpy as npIn [3]: import matplotlib.pyplot as plt
First, create objects
You can view detailed information about the contents of this section through Data Structure Intro Setion.
1. You can create a Series,pandas by passing a list object. An integer index will be created by default:
In [4]: s = pd.Series In [5]: sOut [5]: 01.013.025.03 NaN 46.058.0 dtype: float64
2. Create a DataFrame by passing a numpy array, a time index, and a column label:
In [6]: dates = pd.date_range ('20130101, periods=6) In [7]: datesOut [7]: DatetimeIndex ([' 2013-01-01-01, '2013-01-02,' 2013-01-03, '2013-01-04,' 2013-01-05, '2013-01-06], dtype='datetime64 [ns]', freq='D') In [8]: df = pd.DataFrame (np.random.randn (6J4), index=dates Columns=list ('ABCD') In [9]: dfOut [9]: ABCD 2013-01-01 0.469112-0.282863-1.509059-1.135632 2013-01-02 1.212112-0.173215 0.119209-1.044236 2013-01-03-0.861849-2.104569-0.494929 1.071804 2013-01-04 0.721555- 0.706771-1.039575 0.271860 2013-01-05-0.424972 0.567020 0.276232-1.087401 January-06-0.673690 0.113648-1.478427 0.524988
3. Create a DataFrame by passing a dictionary object that can be transformed into a similar sequence structure:
In [10]: df2 = pd.DataFrame ({'A': [1.],....:'B': pd.Timestamp ('20130102'),....:'C': pd.Series (1m indexied list (range (4)), dtype='float32') ....:'D': np.array ([3] * 4 dtypewriter int32),....:'E': pd.Categorical (["test", "train", "test", "train"]) ....:'F': 'foo'})....: In [11]: df2 Out [11]: A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo
4. View the data types of different columns:
In [12]: df2.dtypesOut [12]: A float64B datetime64[ ns] C float32D int32E categoryF objectdtype: object
5. If you are using IPython, using the Tab auto-completion feature will automatically identify all attributes and custom columns. The following figure is a subset of all attributes that can be automatically identified:
In [13]: df2.df2.A df2.booldf2.abs df2.boxplotdf2.add df2.Cdf2.add_prefix df2.clipdf2.add_suffix df2.clip_lowerdf2.align df2.clip_upperdf2.all df2.columnsdf2.any df2.combinedf2.append Df2.combine_firstdf2.apply df2.compounddf2.applymap df2.consolidatedf2.D II. View data
For more information, please see: Basics Section
1. View the header and tail data in frame (default 5 rows):
In [14]: df.head () Out [14]: A B C D2013-01-01 0.469112-0.282863-1.509059-1.1356322013-01-02 1.212112-0.173215 0.119209-1.0442362013-01-03-0.861849-2.104569-0.494929 1.0718042013-01-04 0.721555-0.706771-1.039575 0.2718602013-01-05-0.424972 0.567020 0. 276232-1.087401In [15]: df.tail (3) Out [15]: A B C D2013-01-04 0.721555-0.706771-1.039575 0.2718602013-01-05-0.424972 0.567020 0.276232-1.0874012013-01-06-0.673690 0.113648-1.478427 0.5249882, Displays indexes, columns, and underlying numpy data:
In [16]: df.indexOut [16]: DatetimeIndex (['2013-01-01-01,' 2013-01-02, '2013-01-03,' 2013-01-04, '2013-01-05,' 2013-01-06], dtype='datetime64 [ns]', freq='D') In [17]: df.columnsOut [17]: Index In [18]: df.valuesOut [18]: array ([[0.4691,-0.2829,-1.5091,-1.1356], [1.2121,-0.1732, 0.1192,-1.0442], [- 0.8618,-2.1046,-0.4949,1.0718] [0.7216,-0.7068,-1.0396, 0.2719], [- 0.425, 0.567, 0.2762,-1.0874], [- 0.6737, 0.1136,-1.4784,0.525]) 3. Quick summary of data by describe () function:
In [19]: df.describe () Out [19]: A B C D count 6.000000 6.000000 6.000000 mean 0.073711-0.431125-0.687758-0.233103 std 0.843157 0.922818 0.779887 0.973118 min-0.861849-2.104569-1.509059-1.135632 25% -0.611510-0.600794-1.368714-1.076610 0.022070-0.228039-0.767252-0.386188 75% 0.658444 0.041933-0.034326 0.461706 max 1.212112 0.567020 0.276232 1.0718044, Transposition of data:
In [20]: df.TOut [20]: 2013-01-01 2013-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06A 0.469112 1.212112-01-0.861849 0.721555-0.424972-0.673690B-0.282863-0.173215-2.104569-0.706771 0.567020 0.113648C-1.509059 0.119209-0.4929-1.039575 0 . 276232-1.478427D-1.135632-1.044236 1.071804 0.271860-1.087401 0.5249885, Sort by axis
Axis = 0 represents a row, which is index. Axis = 1 represents the column, which is columns.
Axis = 1, which means to operate along the row, which represents the horizontal axis, and axis = 0, which means to operate along the column, which represents the vertical axis.
In [21]: df.sort_index (axis=1 Ascending=False) Out [21]: D C B A 2013-01-01-1.135632-1.509059-0.282863 0.4691122013-01-02-1.044236 0.119209-0.173215 1.2121122013-01-03 1.071804-0.494929-2.104569-0.8618492013-01-04 0.271860-1.039575-0.706771 0.7215552013-01-05-1.087401 0.276232 0.567020-0.4249722013-01-06 0. 524988-1.478427 0.113648-0.6736906, Sort by value
In [22]: df.sort_values (by='B') Out [22]: A B C D2013-01-03-0.861849-2.104569-0.494929 1.0718042013-01-04 0.721555-0.706771-1.039575 0.2718602013-01-01 0.469112-0.282863-1.509059-1.135632 2013-01-02 1.212112-0.173215 0.119209-1.0442362013-01-06- 0.673690 0.113648-1.478427 0.5249882013-01-05-0.424972 0.567020 0.276232-1.087401
III. Choice
Although standard Python/Numpy selection and setting expressions can be directly used, as the code used by the project, we recommend using optimized pandas data access: .at, .iat, .loc, .iloc and .ix see Indexing and Selecing Data and MultiIndex / Advanced Indexing for details.
A query that is commonly used but not mentioned in the original text: locate the cell by row number and column name, such as fetching the value of the pname field in the third row, my method:
Df.iloc [2] .pname, if you clearly know that the row index can be loc: Df.loc.index] .pname; finally, the universal form: df.ix [2] [pname] or df.ix [index] [2], index and column, both can be serial numbers or names
(1) to obtain
1. Select a separate column, which will return a Series, which is equivalent to df.A:
In [23]: df ['A'] Out [23]: 2013-01-01 0.469112 2013-01-02 1.212112 2013-01-03-0.861849 2013-01-04 0.721555 2013-01-05-0.424972 2013-01-06-0.673690 Freq: d, Name: a, dtype: float64
2. Select through [], which will slice the row
In [24]: df [0:3] Out [24]: A B C D2013-01-01 0.469112-0.282863-1.509059-1.1356322013-01-02 1.212112-0.173215 0.119209-1.0442362013-03-0.861849-2.104569-0.494929 1.071804In [25]: df ['20130102 1.071804In'] Out [25]: a B DC 2013-01-02 1.212112-0.173215 0.119209-1.0442362013-01-03-0.861849-2.104569-0.494929 1.0718042013-01-04 0.721555-0.706771-1.039575 0.271860
(2) Select by label
Read more and check out Selection by Label
1. Use tags to get a crossed area
In [26]: df.loc [dates [0]] Out [26]: a 0.469112B-0.282863C-1.509059D-1.135632Name: 2013-01-01 00:00:00, dtype: float64
2. Select on multiple axes through labels
In [27]: df.loc [:, ['Achilles gramma B']] Out [27]: a B2013-01-01 0.469112-0.2828632013-01-02 1.212112-0.1732152013-01-03-0.861849-2.1045692013-01-04 0.721555-0.7067712013-01-05-0.424972 0.5670202013-01-06-0.673690 0.113648
3. Label slice
In [28]: df.loc ['20130102mistress "20130104, [' Awayanderlance B']] Out [28]: a B2013-01-02 1.212112-0.1732152013-01-03-0.861849-2.1045692013-01-04 0.721555-0.706771
4. Reduce the dimension of the returned object
In [29]: df.loc ['20130102, [' Agrippance dagger B']] Out [29]: a 1.212112 B-0.173215 Name: 2013-01-02 00:00:00, dtype: float64
5. Get a scalar
In [30]: df.loc [dates [0],'A'] Out [30]: 0.46911229990718628
6. Quickly access a scalar (equivalent to the previous method)
In [31]: df.at [dates [0],'A'] Out [31]: 0.46911229990718628
(3) through location selection
1. Use iloc to select a location by passing a numeric value (line number, not a label) (rows are selected)
In [32]: df.iloc [3] Out [32]: a 0.721555 B-0.706771 C-1.039575 D 0.271860 Name: 2013-01-04 00:00:00, dtype: float64
2. Slice by numerical value, which is similar to that in numpy/python.
In [33]: Out [33]: df.iloc [3Out]: a B2013-01-04 0.721555-0.7067712013-01-05-0.424972 0.567020
3. By specifying a list of locations, this is similar to the situation in numpy/python
In [34]: df.iloc [[1jue 2jing4], [0dag2]] Out [34]: a C 2013-01-02 1.212112 0.1192092013-01-03-0.861849-0.4949292013-01-05-0.424972 0.276232
4. Slice the row
In [35]: df.iloc [1 Out:] Out [35]: A B C D2013-01-02 1.212112-0.173215 0.119209-1.0442362013-01-03-0.861849-2.104569-0.494929 1.071804
5. Slice the column
In [36]: df.iloc [:, 1:3] Out [36]: B C 2013-01-01-0.282863-1.5090592013-01-02-0.173215 0.1192092013-01-03-2.104569-0.4949292013-01-04-0.706771-1.0395752013-01-05 0.567020 0.2762322013-01-06 0.113648-1.478427
6. Get a specific value
In [37]: df.iloc [1jue 1] Out [37]:-0.17321464905330858
7. Quickly access a scalar (equivalent to the previous method)
In [38]: df.iat [1jue 1] Out [38]:-0.17321464905330858
(IV) Boolean index
1. Use the value of a separate column to select the data:
In [39]: df [df.A > 0] Out [39]: A B C D 2013-01-01 0.469112-0.282863-1.509059-1.135632 2013-01-02 1.212112-0.173215 0.119209-1.044236 2013-01-04 0.721555-0.706771-1.039575 0.271860
2. (get all the data in DataFrame that meet the conditions:
In [40]: df [df > 0] Out [40]: A B C D2013-01-01 0.469112 NaN NaN NaN2013-01-02 1.212112 NaN 0.119209 NaN2013-01-03 NaN 1.0718042013-01-04 0.721555 NaN NaN 0.2718602013-01-05 NaN 0.567020 0.276232 NaN2013-01-06 NaN 0.113648 NaN 0.524988
3. Use the isin () method to filter
Search in the index index, which is the most basic query:
For example, query whether there is data for the day '2013-01-01':
If len (df.query ('index = = "{0}"' .format ('2013-01-01')) > 0:
In [41]: df2 = df.copy () In [42]: df2 ['E'] = ['one',' one','two','three','four' 'three'] In [43]: df2Out [43]: A B C D E2013-01-01 0.469112-0.282863-1.509059-1.135632 one2013-01-02 1.212112-0.173215 0.119209-1.044236 one2013-01-03-0.861849-2.104569-0.494929 1.071804 two2013-01-04 0.721555-0.706771-1.039575 0.271860 Three2013-01-05-0.424972 0.567020 0.276232-1.087401 four2013-01-06-0.673690 0.113648-1.478427 0.524988 threeIn [44]: df2 [df2 ['E'] .isin (['two'] 'four'])] Out [44]: A B C D E2013-01-03-0.861849-2.104569-0.494929 1.071804 two2013-01-05-0.424972 0.567020 0.276232-1.087401 four
(5) setting up
Modify column values conditionally:
List (df ['colName']. Apply (lambda x colName' 1 if x > np.mean (df (traindf [' colName'])) else 0)) # is 1 if it is greater than the average value of this column
1. Set a new column:
In [45]: S1 = pd.Series, index=pd.date_range ('20130102, periods=6)) In [46]: s1Out [46]: 2013-01-02 12013-01-03 22013-01-04 32013-01-05 42013-01-06 52013-01-07 6Freq: d, dtype: int64In [47]: df [' F'] = S1
2. Set a new value through the label:
In [48]: df.at [dates [0],'A'] = 0
3. Set the new value through the location:
In [49]: df.iat [0jue 1] = 0
4. Set a new set of values through an numpy array:
In [50]: df.loc [:,'D'] = np.array ([5] * len (df))
5. The results of the above operations are as follows:
In [51]: dfOut [51]: A B C D F2013-01-01 0.000000 0.000000-1.509059 5 NaN2013-01-02 1.212112-0.173215 0.119209 5 1.02013-01-03-0.861849-2.104569-0.494929 5 2.02013-01-04 0.721555-0.706771-1.039575 5 3.02013-01-05- 0.424972 0.567020 0.276232 5 4.02013-01-06-0.673690 0.113648-1.478427 5
6. Set the new value through the where operation:
In [52]: df2 = df.copy () In [53]: df2 [df2 > 0] =-df2In [54]: df2Out [54]: A B C D F2013-01-01 0.000000 0.000000-1.509059-5 NaN2013-01-02-0.173215-0.119209-5-1.02013-01-03-0.861849-2.104569-0.494929-5-2.02013-01-04-0. 721555-0.706771-1.039575-5-3.02013-01-05-0.424972-0.567020-0.276232-5-4.02013-01-06-0.673690-0.113648-1.478427-5-5.04, Missing value processing
In pandas, use np.nan instead of missing values, which will not be included in the calculation by default, see: Missing Data Section for details.
1. The reindex () method can change / add / delete the index on the specified axis, which returns a copy of the original data:
In [55]: df1 = df.reindex (index=dates [0:4], columns=list (df.columns) + ['E']) In [56]: df1.loc [dates [0]: dates [1] 'E'] = 1In [57]: df1Out [57]: A B C D F E2013-01-01 0.000000 0.000000-1.509059 5 NaN 1.02013-01-02 1.212112-0.173215 0.119209 5 1.0 1.02013-01-03-0.861849-2.104569-0.494929 5 2.0 NaN2013-01-04 0.721555-0.706771-1.039575 5 3.0 NaN
2. Remove the row that contains the missing value:
In [58]: df1.dropna (how='any') Out [58]: A B C D F E2013-01-02 1.212112-0.173215 0.119209 5 1.0
3. Fill in the missing values:
In [59]: df1.fillna (value=5) Out [59]: A B C D F E2013-01-01 0.000000 0.000000-1.509059 5 5.0 1.02013-01-02 1.212112-0.173215 0.119209 5 1.0 1.02013-01-03-0.861849-2.104569-0.494929 5 2.0 5.02013-01-04 0.721555-0.706771-1.039575 5 3.0 5.0
4. Boolean filling of the data:
In [60]: pd.isna (df1) Out [60]: A B C D F E2013-01-01 False True False2013-01-02 False False2013-01-03 False True2013-01-04 False True V, related operations
For more information, please participate in Basic Section On Binary Ops
(1) Statistics (missing values are not usually included in related operations)
1. Perform descriptive statistics:
In [61]: df.mean () Out [61]: a-0.004474B-0.383981C-0.687758D 5.000000F 3.000000dtype: float64
2. Do the same on other axes:
In [62]: df.mean (1) Out [62]: 2013-01-01 0.8727352013-01-02 1.4316212013-01-03 0.7077312013-01-04 1.3950422013-01-05 1.8836562013-01-06 1.592306Freq: d, dtype: float64
3. Manipulate objects that need to be aligned with different dimensions. Pandas automatically broadcasts along the specified dimension:
In [63]: s = pd.Series. Shift (2) In [64]: sOut [64]: 2013-01-01 NaN2013-01-02 NaN2013-01-03 1.02013-01-04 3.02013-01-05 5.02013-01-06 NaNFreq: d, dtype: float64In [65]: df.sub (s) Axis='index') Out [65]: A B C D F2013-01-01 NaN NaN2013-01-02 NaN NaN2013-01-03-1.861849-3.104569-1.494929 4.0 1.02013-01-04-2.278445-3.706771-4 .2.0 0.02013-01-05-5.424972-4.432980-4.723768 039575-1.02013-01-06 NaN NaN
(2) Application
1. Apply functions to the data:
In [66]: df.apply (np.cumsum) Out [66]: A B C D F2013-01-01 0.000000 0.000000-1.509059 5 NaN2013-01-02 1.212112-0.173215-1.389850 10 1.02013-01-03 0.350263-2.277784-1.884779 15 3.02013-01-04 1.071818-2.984555-2.924354 20 6.02013-01 -05 0.646846-2.417535-2.648122 25 10.02013-01-06-0.026844-2.303886-4.126549 30 15.0In [67]: df.apply (lambda x: x.max ()-x.min ()) Out [67]: a 2.073961B 2.671590C 1.785291D 0.000000F 4.000000dtype: float64
(3) histogram
For more information, please refer to Histogramming and Discretization
In [68]: s = pd.Series (np.random.randint (0,7, size=10)) In [69]: sOut [69]: 0 41 22 13 24 65 46 47 68 49 4dtype: int64In [70]: s.value_counts () Out [70]: 4 56 22 21 1dtype: int64
(IV) string method
The Series object is equipped with a set of string handling methods in its str property, which can be easily applied to each element in the array, as shown in the following code. For more details, please refer to Vectorized String Methods.
In [71]: s = pd.Series (['Aaba',' Baca', np.nan, 'CABA',' dog', 'cat']) In [72]: s.str.lower () Out [72]: 0a1 b2 c3 aaba4 baca5 NaN6 caba7 dog8 catdtype: object VI
Pandas provides a large number of ways to easily merge Series,DataFrame and Panel objects in accordance with various logical relationships. For more information, please see: Merging section
(1) connection
Insert a dictionary into the table to form a new column: df ['column name'] [dict.keys ()] = dict.values ()
Delete a column: del df ['column name']
In [73]: df = pd.DataFrame (np.random.randn (10 4) In [74]: dfOut [74]: 01 230-0.548702 1.467327-1.015962-0.4830751 1.637550-1.217659-0.291519-1.7455052-0.263952 0.991460-0.919069 0.2660463-0.709661 1.669052 1.037882-1.7057754-0.919854-0.042379 1.247642-0.0099205 0.290213 0.495767 0.362949 1.5481066-1.131345-0. 089329 0.337863-0.9458677-0.932132 1.956030 0.017587-0.0166928-0.575247 0.254161-1.143704 0.2158979 1.193555-0.077118-0.408530-0.86249 break it into piecesIn [75]: pieces = [df [: 3] Df [3:7] Df [7:]] In [76]: pd.concat (pieces) Out [76]: 01 230-0.548702 1.467327-1.015962-0.4830751 1.637550-1.217659-0.291519-1.7455052-0.263952 0.991460-0.919069 0.2660463-0.709661 1.669052 1.037882-1.7057754-0.919854-0.042379 1.247642-0. 0099205 0.290213 0.495767 0.362949 1.5481066-1.131345-0.089329 0.337863-0.9458677-0.932132 1.956030 0.017587-0.0166928-0.575247 0.254161-1.143704 0.2158979 1.193555-0.077118-0.408530-0.862495
(2) connection
Join is similar to SQL type merging. For more information, please see: Database style joining
In [77]: left = pd.DataFrame ({'key': [' foo', 'foo'],' lval': [1,2]}) In [78]: right = pd.DataFrame ({'key': [' foo', 'foo'],' rval': [4) 5]}) In [79]: leftOut [79]: key lval0 foo 11 foo 2In [80]: rightOut [80]: key rval0 foo 41 foo 5In [81]: pd.merge (left, right, on='key') Out [81]: key lval rval0 foo 1 41 foo 1 52 foo 2 43 foo 2 5
Another example that can be shown:
In [82]: left = pd.DataFrame ({'key': [' foo', 'bar'],' lval': [1,2]}) In [83]: right = pd.DataFrame ({'key': [' foo', 'bar'],' rval': [4) 5]}) In [84]: leftOut [84]: key lval0 foo 11 bar 2In [85]: rightOut [85]: key rval0 foo 41 bar 5In [86]: pd.merge (left, right, on='key') Out [86]: key lval rval0 foo 1 41 bar 2 5
(3) additional
Append connects a line to a DataFrame. For more information, see Appending:
In [87]: df = pd.DataFrame (np.random.randn (8,4), columns=] 'D']) In [88]: dfOut [88]: A B C D0 1.346061 1.511763 1.627081-0.9905821-0.441652 1.211526 0.268520 0.0245802-1.577585 0.396823-0.105381-0.5325323 1.453749 1.208843-0.080952-0.2646104-0.727965-0.589346 0.339969-0.6932055-0.339355 0.593616 0.884345 1.5914316 0.141809 0.220390 0.435589 0.1924517-0.096701 0.803351 1.715071-0.708758In [89]: s = df.iloc [3] In [90]: df.append (s) Ignore_index=True) Out [90]: A B C D 0 1.346061 1.511763 1.627081-0.990582 1-0.441652 1.211526 0.268520 0.024580 2-1.577585 0.396823-0.105381-0.532532 3 1.453749 1.208843-0.080952-0.264610 4-0.727965-0.589346 0.339969-0.693205-0.3393550 0. 593616 0.884345 1.591431 6 0.141809 0.220390 0.435589 0.192451 7-0.096701 0.803351 1.715071-0.708758 8 1.453749 1.208843-0.080952-0.2646107, Grouping
For "group by" operations, we usually refer to one or more of the following steps:
L (Splitting) divides data into different groups according to some rules
L (Applying) executes a function for each set of data
L (Combining) combines the results into a data structure
For more information, please see: Grouping section
In [91]: df = pd.DataFrame ({'A': ['foo',' bar', 'foo',' bar',....: 'foo',' bar', 'foo',' foo'],....:'B': ['one',' one', 'two',' three' ...: 'two',' two', 'one',' three'],....:'C': np.random.randn (8) ...:'D': np.random.randn (8)})....: In [92]: df Out [92]: A B C D 0 foo one-1.202872-0.055224 1 bar one-1.814470 2.395985 2 foo two 1.018601 1.552825 3 bar three-0.595447 0.166599 4 foo two 1.395433 0.047609 5 bar two-0.392670-0.136473 6 foo one 0.007207-0.561757 7 foo three 1.928123-1.623033
1. Group and execute the sum function on each packet:
In [93]: df.groupby ('A'). Sum () Out [93]: C DA bar-2.802588 2.42611foo 3.146492-0.63958
2. Group multiple columns to form a hierarchical index, and then execute the function:
In [94]: df.groupby (['A'' Sum () Out [94]: C D A B bar one-1.814470 2.395985 three-0.595447 0.166599 two-0.392670-0.136473 foo one-1.195665-0.616981 three 1.928123-1.623033 two 2.414034 1.600434 8, remodeling
Please refer to Hierarchical Indexing and Reshaping for details.
(1) Stack
In [95]: tuples = list (* [['bar',' bar', 'baz',' baz',....: 'foo',' foo', 'qux',' qux'],....: ['one',' two', 'one',' two' ....: 'one',' two', 'one',' two'])): In [96]: index= pd.MultiIndex.from_tuples (tuples, names= ['first',' second']) In [97]: df = pd.DataFrame (np.random.randn (8,2), index=index, columns= ['A'') ) In [98]: df2 = df [: 4] In [99]: df2Out [99]: A Bfirst second bar one 0.029399-0.542108 two 0.282696-0.087302baz one-1.575170 1.771208 two 0.816482 1.100230
The stack () function compresses the columns of the data frame at one level.
In: stacked = df2.stack () In: stackedOut: first second bar one A 0.029399 B-0.542108 two A 0.282696 B-0.087302baz one A-1.575170 B 1.771208 two A 0.816482 B 1.100230dtype: float64
The reverse operation of stack () of a "stacked" data frame or sequence (with multiple indexes as indexes) is unstack (), and the above data is unstacked to the previous level by default:
In: stacked.unstack () Out: A Bfirst second bar one 0.029399-0.542108 two 0.282696-0.087302baz one-1.575170 1.771208 two 0.816482 1.100230In: stacked.unstack (1) Out: second one twofirst Bar A 0.029399 0.282696 B-0.542108-0.087302baz A-1.575170 0.816482 B 1.771208 1.100230In: stacked.unstack (0) Out: first bar bazsecond one A 0.029399-1.575170 B-0.542108 1.771208two A 0.282696 0.816482 B-0.087302 1.100230
(2) PivotTable. For more information, please refer to Pivot Tables.
In: df = pd.DataFrame ({'A': ['one',' one', 'two',' three'] * 3,.:'B': ['A','B','C'] * 4,.:'C': ['foo',' bar' 'bar',' bar'] * 2,.:'D': np.random.randn (12) .: In: dfOut: A B C D E 0 one A foo 1.418757-0.179666 1 one B foo-1.879024 1.291836 2 two C foo 0.536826-0.009614 3 three A bar 1.006160 0.392149 4 one B bar-0.029716 0.264599 5 one C bar-1.146178-0.057409 6 two A foo 0.100900-1.425638 7 three B foo-1.035018 1.024098 8 one C foo 0.314665-0.106062 9 one A bar-0.773723 1.824375 10 two B bar-1.170653 0.595974 11 three C bar 0.648740 1.167115
You can easily generate a PivotTable from this data:
In: pd.pivot_table (df, values='D', index= ['Aids,' B'] Columns= ['C']) Out: C bar foo A B one A-0.773723 1.418757 B-0.029716-1.879024 C-1.146178 0.314665 three A 1.006160 NaN B NaN-1.035018 C 0.648740 NaN two A NaN 0 .100900 B-1.170653 NaN C NaN 0.536826 IX. Time series.
Pandas has easy-to-use, powerful and efficient functions for high-frequency data resampling conversion operations (for example, converting second data to 5-minute data), which is common, but not limited to financial applications, see the time series section
In: rng = pd.date_range, periods=100, freq='S') In: ts = pd.Series (np.random.randint (0500, len (rng), index=rng) In: ts.resample ('5Min'). Sum () Out: 2012-01-01 25083Freq: 5T, dtype: int64
Time zone representation
In: rng = pd.date_range, periods=5, freq='D') In: ts = pd.Series (np.random.randn (len (rng)), rng) In: tsOut: 2012-03-06 0.464000 2012-03-07 0.227371 2012-03-08-0.496922 2012-03-09 0.306389-03-10-2.290613 Freq: d Dtype: float64 In: ts_utc = ts.tz_localize ('UTC') In: ts_utc Out: 2012-03-06 0000VO 0000VOO 0.464000 2012-03-070000VOOO 0.227371 2012-03-08 0000PUBG 000000-0.496922-2012-03-09 0000JUBG 000000: 0.306389 0.306389-03-10 0000 dtype: d, dtype: float64
Switch to another time zone
In: ts_utc.tz_convert ('US/Eastern') Out: 2012-03-05 19:00:00-05:00 0.4640002012-03-06 19:00:00-05:00 0.2273712012-03-07 19:00:00-05:00-0.4969222012-03-08 19:00:00-05:00 0.3063892012-03-09 19:00:00-05:00-2.290613Freq: d, dtype: float64
Convert different time spans
In: rng = pd.date_range, periods=5, freq='M') In: ts = pd.Series (np.random.randn (len (rng)), index=rng) In: tsOut: 2012-01-31-1.1346232012-02-29-1.5618192012-03-0.2608382012-04-30 0.2819572012-05-31 1.523962Freq: M Dtype: float64In: ps = ts.to_period () In: psOut: 2012-01-1.1346232012-02-1.5618192012-03-0.2608382012-04 0.2819572012-05 1.523962Freq: M, dtype: float64In [122]: ps.to_timestamp () Out [122]: 2012-01-01-1.1346232012-02-01-01-01-01-04-01 0.2819572012-01 1.523962Freq: MS, dtype: float64
Convert the time slot and use some operation functions. In the following example, we convert the annual report from November to the end of the quarter at 9: 00 a.m.
In [123]: prng = pd.period_range ('1990 Q1,' 2000Q4, freq='Q-NOV') In [124]: ts = pd.Series (np.random.randn (len (prng)), prng) In [125]: ts.index = (prng.asfreq ('len,' e') + 1). Asfreq ('H' Ts.head () Out: 1990-03-01 09:00-0.9029371990-06-01 09:00 0.0681591990-09-01 09:00-0.0578731990-12-01 09:00-0.3682041991-03-01 09:00-1.144073Freq: h, dtype: float64 X
Starting from version 0.15, pandas can support data of type Categorical in DataFrame. For more information, see categorical introduction and API documentation.
In [127th]: df = pd.DataFrame ({"id": [1 id: 3, 4, 5], "raw_grade": ['averse,' baked, 'baked,' ajar, 'ajar,' e']})
1. Convert the original grade to Categorical data type:
In: df ["grade"] = df ["raw_grade"] .astype ("category") In [129]: df ["grade"] Out [129]: 0a1 b2 b3 a4 a5 eName: grade, dtype: categoryCategories (3, object): [a, b, e]
2. Rename Categorical type data to something more meaningful:
In: df ["grade"] .cat.categories = ["very good", "good", "very bad"]
3. Reorder the categories to add the missing categories:
In [131l]: df ["grade"] = df ["grade"]. Cat.set_categories (["very bad", "bad", "medium", "good", "very good"]) In [132]: df ["grade"] Out [132]: 0 very good1 good2 good3 very good4 very good5 very badName: grade, dtype: categoryCategories (5, object): [very bad, bad, medium, good Very good] 4. Sorting is done in Categorical rather than dictionary order: In [133]: df.sort_values (by= "grade") Out: id raw_grade grade5 6 e very bad1 2 b good2 3 b good0 1 a very good3 4 a very good4 5 a very good
5. There are empty categories when sorting the Categorical column:
In: df.groupby ("grade"). Size () Out [134]: gradevery bad 1bad 0medium 0good 2very good 3dtype: int64 11. Drawing
For more information, please see Plotting docs.
In [135A]: ts = pd.Series (np.random.randn (1000), index=pd.date_range ('1max 2000 pound, periods=1000)) In [136i]: ts = ts.cumsum () In [137C]: ts.plot () Out [137C]:
For DataFrame, plot is an easy way to draw all columns and their labels:
In [138C]: df = pd.DataFrame (np.random.randn (1000, 4), index=ts.index,.: columns= ['Agar,' Bond, 'Clearing,' D']).: In [139A]: df = df.cumsum () In [140C]: plt.figure (); df.plot (); plt.legend (loc='best') Out [140]:
XII. Import and save data
(1) CSV, refer to: Writing to a csv file
1. Write to csv file:
In [141]: df.to_csv ('foo.csv')
2. Read from csv file:
In: pd.read_csv ('foo.csv') Out: Unnamed: 0 A B C D0 2000-01-01 0.266457-0.399641-0.219582 1.1868601 2000-01-02-1.170732-0.345873 1.653061-0.2829532 2000-01-03-1.734933 0.530468 2.060811-0.5155363 2000 -01-04-1.555121 1.452620 0.239859-1.1568964 2000-01-05 0.578117 0.511371 0.103552-2.4282025 2.4282025 01-06 0.478344 0.449933-0.741620-1.9624096 2000-01-07 1.235339-0.091757-1.543861-1.084753. .. 993 2002-09-20-10.628548-9.153563-7.883146 28.313940994 2002-09-21-10.390377-8.727491-6.399645 30.914107995 2002-09-22-8.985362-8.485624-4.6694662 31.367740996 200209-23-9.558560-8.781216-4.499815 30.518439997 2002- 09-24-9.902058-9.340490-4.386639 30.105593998 2002-09-25-10.216020-9.480682-3.933802 29.758560999 2002-09-26-11.856774-10.671012-3.216025 29.369368 [1000 rows x 5 columns]
(2) HDF5, refer to: HDFStores
1. Write to HDF5 storage:
In: df.to_hdf ('foo.h6','df')
2. Read from HDF5 storage:
In: pd.read_hdf ('foo.h6' Df') Out: A B C D 2000-01-01 0.266457-0.399641-0.219582 1.186860 2000-01-02-1.170732-0.345873 1.653061-0.282953 2000-01-03-1.734933 0.530468 2.060811-0.515536 2000-01-04-1.555121 1.452620 0.239859-1.156896 2000- 01-05 0.578117 0.511371 0.103552-2.428202 2000-01-06 0.478344 0.449933-0.741620-1.962409 2000-01-07 1.235339-0.091757-1.543861-1.084753. 2009.09-20-10.628548-9.153563-7.883146 28 . 313940 2002-09-21-10.390377-8.727491-6.399645 30.914107 2002-09-22-8.985362-8.485624-4.669462 31.367740 2002-09-23-9.558560-8.781216-4.499815 30.518439 2002-09-24-9.902058-9.340490-4.386639 30.105593 2002-09-25-10.216020-9.480682-3.933802 29.758560 2002 09-26-11.856774 10.671012-3.216025 29.369368 [1000 rows x 4 columns]
(3) Excel, reference: MS Excel
1. Write to excel file:
In: df.to_excel ('foo.xlsx', sheet_name='Sheet1')
2. Read from excel file:
In: pd.read_excel ('foo.xlsx',' Sheet1', index_col=None Na_values= ['NA']) Out: A B C D2000-01-01 0.266457-0.399641-0.219582 1.1868602000-01-02-1.170732-0.345873 1.653061-0.2829532000-01-03-1.734933 0.530468 2.060811-0.5155362000-01-04-1.555121 1.452620 0.239859-1.1568962000 -01-05 0.578117 0.511371 0.103552-2.4282022000-01-06 0.478344 0.449933-0.741620-1.9624092000-01-07 1.235339-0.091757-1.543861-1.084753. .. 2002-09-20-10.628548-9.153563-7.883146 28.3139402002-09-21-10.390377-8.727491-6.399645 30.9141072002-09-22-8.985362-8.485624-4.669462 31.367740 2002-09-23-9.558560-8.781216-4.499815 30.518439 2002-09-24-9.902058-9. 340490-4.386639 30.1055932002-09-25-10.216020-9.480682-3.933802 29.7585602002-09-26-11.856774-10.671012-3.216025 29.369368 [1000 rows x 4 columns]
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.