Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use pandas

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Editor to share with you how to use pandas, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's learn about it!

First, generate a data table

1. The pandas library is imported first, and the numpy library is usually used, so we import the standby library first:

Import numpy as npimport pandas as pd12

2. Import CSV or xlsx files:

Df = pd.DataFrame (pd.read_csv ('name.csv',header=1)) df = pd.DataFrame (pd.read_excel (' name.xlsx')) 123

3. Create a data table with pandas:

Df = pd.DataFrame ({"id": [1001Perry 1002, 1004, 1004), "date": pd.date_range ('20130102, periods=6), "city": [' Beijing', 'SH',' guangzhou', 'Shenzhen',' shanghai', 'BEIJING'], "age": [23, 4, 4, 54, 32, 32], "category": ['100 lyrics, 100 races, 100 races, 110 acorns, 110 colors, 110 colors.], "price": [1200] Np.nan,2133,5433,np.nan,4432]}, columns = ['id','date','city','category','age','price']) 12345678

2. View the information of the data table

1. Dimension view:

Df.shape12

2. Basic information of the data table (dimension, column name, data format, space occupied, etc.):

Df.info () 12

3. the format of each column of data:

Df.dtypes12

4. A column format:

Df ['B']. Dtype12

5. Null value:

Df.isnull () 12

6. View a column of null values:

Df.isnull () 12

7. View the unique value of a column:

Df ['B'] .unique () 12

8. View the values of the data table:

Df.values

9. Check the column names:

Df.columns12

10. View the first 10 rows of data and the last 10 rows of data:

Df.head () # default first 10 rows of data df.tail () # default last 10 rows of data 123

Third, data sheet cleaning

1. Fill in the null value with the number 0:

Df.fillna (value=0) 12

2. Populate the NA with the mean of the column prince:

Df ['prince'] .fillna (df [' prince']. Mean ()) 12

3. Clear the character space of the city field:

Df ['city'] = df [' city'] .map (str.strip) 12

4. Case conversion:

Df ['city'] = df [' city'] .str.lower () 12

5. Change the data format:

Df ['price'] .astype (' int') 12

6. Change the column name:

Df.rename (columns= {'category':' category-size'}) 12

7. Duplicate values that occur after deletion:

Df ['city'] .drop_duplicates () 12

8. Delete the repeated values that appear first:

Df ['city'] .drop_duplicates (keep='last') 12

9. Data replacement:

Df ['city'] .replace (' sh', 'shanghai') 12

IV. Data preprocessing

Df1=pd.DataFrame ({"id": [1001meme1002, 1003, 1004], "gender": ['male','female','male','female','male','female','male','female'], "pay": [' male','female','male','female','male','female','male','female'], "pay": ['YYZ]], "m-point": [10, 12, 40, 40, 40, 30]] 12345)

1. Data table merging

Df_inner=pd.merge (df,df1,how='inner') # match merge, intersection df_left=pd.merge (df,df1,how='left') # df_right=pd.merge (df,df1,how='right') df_outer=pd.merge (df,df1,how='outer') # merge 12345

2. Set the index column

Df_inner.set_index ('id') 12

3. Sort by the value of a specific column:

Df_inner.sort_values (by= ['age']) 12

4. Sort by index column:

Df_inner.sort_index () 12

5. If the value of the prince column > 3000 dint group column shows high, otherwise it displays low:

Df_inner ['group'] = np.where (df_inner [' price'] > 3000) 12

6. Group marking the data with multiple conditions.

Df_inner.loc [(df_inner ['city'] = =' beijing') & (df_inner ['price'] > = 4000),' sign'] = 112

7. Sort the values of the category field in turn, and create a data table. The index value is the index column of df_inner, and the column names are category and size.

Pd.DataFrame ((x.split ('-') for x in df_inner ['category']), index=df_inner.index,columns= [' category','size']) 12

8. Match the split data table with the original df_inner data table

Df_inner=pd.merge (df_inner,split,right_index=True, left_index=True) 12

5. Data extraction

Three functions are mainly used: loc,iloc and ix,loc functions are extracted by label value, iloc is extracted by position, and ix can be extracted by label and position at the same time.

1. Extract the value of a single row by index

Df_inner.loc [3] 12

2. Extract the row values of the region by index

Df_inner.iloc [0:5] 12

3. Reset the index

Df_inner.reset_index () 12

4. Set the date as the index

Df_inner=df_inner.set_index ('date') 12

5. Extract all the data before the 4th

Df_inner [: '2013-01-04'] 12

6. Use iloc to extract data by location area

The number before and after the colon is no longer the label name of the index, but the location of the data, starting with 0, the first three rows and the first two columns. twelve

7. Adapt to iloc to pick up data separately by location

Df_inner.iloc [[0pr 2pr 5], [4pr 5]] # extract rows 0, 2, 5, 4, 5 columns 12

8. Use ix to extract data by index tag and location

Df_inner.ix [: '2013-01-03] # before No. 2013-01-03, the first four columns of data 12

9. Determine whether the value of city column is Beijing.

Df_inner ['city'] .isin ([' beijing']) 12

10. Determine whether the city column contains beijing and shanghai, and then extract the qualified data

Df_ inner.loco [DF _ inner ['city'] .isin ([' beijing','shanghai'])] 12

Extract the first three characters and generate a data table

Pd.DataFrame (category.str [: 3]) 12

VI. Data screening

Use and, OR, non-three conditions to match greater than, less than, or equal to filter the data, count and sum.

1. Use "and" for filtering

Df_inner.loc [(df_inner ['age'] > 25) & (df_inner [' city'] = = 'beijing'), [' id','city','age','category','gender']] 12

2. Use "OR" to filter

Df_inner.loc [(df_inner ['age'] > 25) | (df_inner [' city'] = 'beijing'), [' id','city','age','category','gender']] .sort (['age']) 12

3. Use the "not" condition for filtering

Df_inner.loc [(df_inner ['city']! =' beijing'), ['id','city','age','category','gender']] .sort ([' id']) 12

4. Count the filtered data by city column

Df_inner.loc [(df_inner ['city']! =' beijing'), ['id','city','age','category','gender']] .sort ([' id']) .city.count () 12

5. Use the query function to filter

Df_inner.query ('city = = ["beijing", "shanghai"]') 12

6. Sum the screened results according to prince.

Df_inner.query ('city = = ["beijing", "shanghai"]') .price.sum () 12

VII. Data summary

The main functions are groupby and pivote_table

1. Count and summarize all the columns

Df_inner.groupby ('city') .count () 12

2. Count the id field by city

Df_inner.groupby ('city') [' id'] .count () 12

3. Summarize and count the two fields

Df_inner.groupby (['city','size']) [' id'] .count () 12

4. Summarize the city fields and calculate the total and mean values of prince respectively.

Df_inner.groupby ('city') [' price'] .agg ([len,np.sum, np.mean]) 12

VIII. Data statistics

Data sampling, calculating standard deviation, covariance and correlation coefficient

1. Simple data sampling

Df_inner.sample (nasty 3) 12

2. Set the sampling weight manually

Weights= [0,0,0,0,0.5,0.5] df_inner.sample (nasty 2, weights=weights) 123

3. Do not put it back after sampling

Df_inner.sample (nasty 6, replace=False) 12

4. Put it back after sampling

Df_inner.sample (nasty 6, replace=True) 12

5. Descriptive statistics of data sheet

Df_inner.describe (). Round (2). T # round function sets the display of decimal places, and T means transpose 12

6. Calculate the standard deviation of the column

Df_inner ['price'] .std () 12

7. Calculate the covariance between the two fields

Df_inner ['price'] .cov (df_inner [' mmurpoint']) 12

8. Covariance between all fields in the data table

Df_inner.cov () 12

9. Correlation analysis of the two fields

Df_inner ['price'] .corr (df_inner [' mmurpoint']) # correlation coefficient is between-1 and 1, close to 1 is positive correlation, close to-1 is negative correlation, 0 is unrelated 12

10. Correlation analysis of data table

Df_inner.corr () 12

IX. Data output

The analyzed data can be exported to xlsx format and csv format.

1. Write to Excel

Df_inner.to_excel ('excel_to_python.xlsx', sheet_name='bluewhale_cc') 12

2. Write to CSV

Df_inner.to_csv ('excel_to_python.csv') above is all the content of this article "how to use pandas". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report