In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
Editor to share with you how to use pandas, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's learn about it!
First, generate a data table
1. The pandas library is imported first, and the numpy library is usually used, so we import the standby library first:
Import numpy as npimport pandas as pd12
2. Import CSV or xlsx files:
Df = pd.DataFrame (pd.read_csv ('name.csv',header=1)) df = pd.DataFrame (pd.read_excel (' name.xlsx')) 123
3. Create a data table with pandas:
Df = pd.DataFrame ({"id": [1001Perry 1002, 1004, 1004), "date": pd.date_range ('20130102, periods=6), "city": [' Beijing', 'SH',' guangzhou', 'Shenzhen',' shanghai', 'BEIJING'], "age": [23, 4, 4, 54, 32, 32], "category": ['100 lyrics, 100 races, 100 races, 110 acorns, 110 colors, 110 colors.], "price": [1200] Np.nan,2133,5433,np.nan,4432]}, columns = ['id','date','city','category','age','price']) 12345678
2. View the information of the data table
1. Dimension view:
Df.shape12
2. Basic information of the data table (dimension, column name, data format, space occupied, etc.):
Df.info () 12
3. the format of each column of data:
Df.dtypes12
4. A column format:
Df ['B']. Dtype12
5. Null value:
Df.isnull () 12
6. View a column of null values:
Df.isnull () 12
7. View the unique value of a column:
Df ['B'] .unique () 12
8. View the values of the data table:
Df.values
9. Check the column names:
Df.columns12
10. View the first 10 rows of data and the last 10 rows of data:
Df.head () # default first 10 rows of data df.tail () # default last 10 rows of data 123
Third, data sheet cleaning
1. Fill in the null value with the number 0:
Df.fillna (value=0) 12
2. Populate the NA with the mean of the column prince:
Df ['prince'] .fillna (df [' prince']. Mean ()) 12
3. Clear the character space of the city field:
Df ['city'] = df [' city'] .map (str.strip) 12
4. Case conversion:
Df ['city'] = df [' city'] .str.lower () 12
5. Change the data format:
Df ['price'] .astype (' int') 12
6. Change the column name:
Df.rename (columns= {'category':' category-size'}) 12
7. Duplicate values that occur after deletion:
Df ['city'] .drop_duplicates () 12
8. Delete the repeated values that appear first:
Df ['city'] .drop_duplicates (keep='last') 12
9. Data replacement:
Df ['city'] .replace (' sh', 'shanghai') 12
IV. Data preprocessing
Df1=pd.DataFrame ({"id": [1001meme1002, 1003, 1004], "gender": ['male','female','male','female','male','female','male','female'], "pay": [' male','female','male','female','male','female','male','female'], "pay": ['YYZ]], "m-point": [10, 12, 40, 40, 40, 30]] 12345)
1. Data table merging
Df_inner=pd.merge (df,df1,how='inner') # match merge, intersection df_left=pd.merge (df,df1,how='left') # df_right=pd.merge (df,df1,how='right') df_outer=pd.merge (df,df1,how='outer') # merge 12345
2. Set the index column
Df_inner.set_index ('id') 12
3. Sort by the value of a specific column:
Df_inner.sort_values (by= ['age']) 12
4. Sort by index column:
Df_inner.sort_index () 12
5. If the value of the prince column > 3000 dint group column shows high, otherwise it displays low:
Df_inner ['group'] = np.where (df_inner [' price'] > 3000) 12
6. Group marking the data with multiple conditions.
Df_inner.loc [(df_inner ['city'] = =' beijing') & (df_inner ['price'] > = 4000),' sign'] = 112
7. Sort the values of the category field in turn, and create a data table. The index value is the index column of df_inner, and the column names are category and size.
Pd.DataFrame ((x.split ('-') for x in df_inner ['category']), index=df_inner.index,columns= [' category','size']) 12
8. Match the split data table with the original df_inner data table
Df_inner=pd.merge (df_inner,split,right_index=True, left_index=True) 12
5. Data extraction
Three functions are mainly used: loc,iloc and ix,loc functions are extracted by label value, iloc is extracted by position, and ix can be extracted by label and position at the same time.
1. Extract the value of a single row by index
Df_inner.loc [3] 12
2. Extract the row values of the region by index
Df_inner.iloc [0:5] 12
3. Reset the index
Df_inner.reset_index () 12
4. Set the date as the index
Df_inner=df_inner.set_index ('date') 12
5. Extract all the data before the 4th
Df_inner [: '2013-01-04'] 12
6. Use iloc to extract data by location area
The number before and after the colon is no longer the label name of the index, but the location of the data, starting with 0, the first three rows and the first two columns. twelve
7. Adapt to iloc to pick up data separately by location
Df_inner.iloc [[0pr 2pr 5], [4pr 5]] # extract rows 0, 2, 5, 4, 5 columns 12
8. Use ix to extract data by index tag and location
Df_inner.ix [: '2013-01-03] # before No. 2013-01-03, the first four columns of data 12
9. Determine whether the value of city column is Beijing.
Df_inner ['city'] .isin ([' beijing']) 12
10. Determine whether the city column contains beijing and shanghai, and then extract the qualified data
Df_ inner.loco [DF _ inner ['city'] .isin ([' beijing','shanghai'])] 12
Extract the first three characters and generate a data table
Pd.DataFrame (category.str [: 3]) 12
VI. Data screening
Use and, OR, non-three conditions to match greater than, less than, or equal to filter the data, count and sum.
1. Use "and" for filtering
Df_inner.loc [(df_inner ['age'] > 25) & (df_inner [' city'] = = 'beijing'), [' id','city','age','category','gender']] 12
2. Use "OR" to filter
Df_inner.loc [(df_inner ['age'] > 25) | (df_inner [' city'] = 'beijing'), [' id','city','age','category','gender']] .sort (['age']) 12
3. Use the "not" condition for filtering
Df_inner.loc [(df_inner ['city']! =' beijing'), ['id','city','age','category','gender']] .sort ([' id']) 12
4. Count the filtered data by city column
Df_inner.loc [(df_inner ['city']! =' beijing'), ['id','city','age','category','gender']] .sort ([' id']) .city.count () 12
5. Use the query function to filter
Df_inner.query ('city = = ["beijing", "shanghai"]') 12
6. Sum the screened results according to prince.
Df_inner.query ('city = = ["beijing", "shanghai"]') .price.sum () 12
VII. Data summary
The main functions are groupby and pivote_table
1. Count and summarize all the columns
Df_inner.groupby ('city') .count () 12
2. Count the id field by city
Df_inner.groupby ('city') [' id'] .count () 12
3. Summarize and count the two fields
Df_inner.groupby (['city','size']) [' id'] .count () 12
4. Summarize the city fields and calculate the total and mean values of prince respectively.
Df_inner.groupby ('city') [' price'] .agg ([len,np.sum, np.mean]) 12
VIII. Data statistics
Data sampling, calculating standard deviation, covariance and correlation coefficient
1. Simple data sampling
Df_inner.sample (nasty 3) 12
2. Set the sampling weight manually
Weights= [0,0,0,0,0.5,0.5] df_inner.sample (nasty 2, weights=weights) 123
3. Do not put it back after sampling
Df_inner.sample (nasty 6, replace=False) 12
4. Put it back after sampling
Df_inner.sample (nasty 6, replace=True) 12
5. Descriptive statistics of data sheet
Df_inner.describe (). Round (2). T # round function sets the display of decimal places, and T means transpose 12
6. Calculate the standard deviation of the column
Df_inner ['price'] .std () 12
7. Calculate the covariance between the two fields
Df_inner ['price'] .cov (df_inner [' mmurpoint']) 12
8. Covariance between all fields in the data table
Df_inner.cov () 12
9. Correlation analysis of the two fields
Df_inner ['price'] .corr (df_inner [' mmurpoint']) # correlation coefficient is between-1 and 1, close to 1 is positive correlation, close to-1 is negative correlation, 0 is unrelated 12
10. Correlation analysis of data table
Df_inner.corr () 12
IX. Data output
The analyzed data can be exported to xlsx format and csv format.
1. Write to Excel
Df_inner.to_excel ('excel_to_python.xlsx', sheet_name='bluewhale_cc') 12
2. Write to CSV
Df_inner.to_csv ('excel_to_python.csv') above is all the content of this article "how to use pandas". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.