In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces how to install and use Pandas, with a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.
1. Installation
It is recommended to use Anaconda for data analysis, with its own pandas numy and many libraries.
Anaconda installation address: https://www.continuum.io/downloads
After installation, you can enter commands at the beginning of conda in terminal (similar to pip), such as list to view installed packages, and to perform common actions such as install update.
2. Import
Most of the time, pandas is called pd by the user import. According to the author's description, pandas is the abbreviation of panel data (not a panda).
Import pandas as pd
3. DataFrame
DataFrame is the most common format used by Pandas to process data, a two-dimensional table with rows, columns, and values. Similar to table in a database or worksheet in excel. If there is a DataFrame called df, df.columns can get all the column tags, just as df.index can get all the row tags.
4. Read data 4.1 read data from excel
Raw = pd.read_excel ('% s% s.xlsx'% path% filename, sheetname='Data', skiprows= 1)
The data is read into a DataFrame called raw, sheetname can specify to read a worksheet, and skiprow can skip the initial N rows of data.
4.2 read data from csv
Raw = pd.read_csv ('% s% s.csv'% path% filename)
5. Add, delete, modify and check 5.1 add and delete columns
New column, located in the last column
Raw ['new column name'] = 'string'
Add a column in the middle, using df.insert ()
Df.insert (location, 'column name', value)
For example, insert a [city] column named city with a value of source_data in the second column of raw df (index is not a column)
Raw.insert (1 source source data ['data1'])
Delete column
Del raw ['column name']
5.2 change the column name
5.2.1 change all column names at once
Cols = ['name_1',' name_2', 'name_3']
Raw= raw [cols]
5.2.2 modify a column name
Using df.rename (), note that if there are multiple old_name columns in the df, they will all be renamed to new_name
Df=df.rename (columns = {'old_name':'new_name'})
5.3 change to index
If a column is set to index, the original index will be deleted
Raw = raw.set_index ('column_name')
Reset_index (), the new index is an incrementing integer sequence starting with 0, and the original index becomes a new column.
Raw = raw.reset_index ()
If you don't need the value in the original index, you can add drop = True:
Raw = raw.reset_index (drop=True)
5.4 Edit value (calculated value)
5.4.1 four operations
Raw ['data1'] = raw [' data1'] * 100
Raw ['data2'] = (raw [' data1'] + raw ['data3']) / raw [' data4']
Raw ['total'] = raw.sum (axis=1)
5.5 check column
5.5.1 filter a column that contains a value (all data in raw df where GEO CODE is CN)
Raw = raw [raw ['GEO_CODE'] = =' CN']
5.5.2 Multi-conditional filtering
Raw = raw [(raw ['GEO_CODE'] = =' CN') & (raw ['METRIC'] = =' Conversion Rate')]
5.5.3 filter multiple columns
Required_key = ['User_ID','SEO visits','SEA visits','Conversion Rate']
Raw = raw [raw ['METRIC_KEY'] .isin (required_key)]
5.6 weight removal
5.6.1 to reuse drop_duplicates (), there are two main parameters:
Subset needs to be deduplicated
Keep, keep the first (keep = 'first') or and the last (keep =' last') in case of duplicate values
Df = df.drop_duplicates (subset = 'column_name', keep =' last')
5.6.2 because when removing weight, the reserved value is simply the first or last one, so it needs to be paired with sort_values () to ensure that the left value is what you want. Sort_values () defaults to ascending ascending, from small to large.
Df = df.sort_values (by='column_name')
Df = df.drop_duplicates (subset = 'column_name', keep =' last')
6 Excel function-related PivotTable of 6.1 Excel
Pd.pivot_table ()
There are three main parameters, index, columns,value, and aggfunc
Index is the equivalent of a line label
Columns is equivalent to column label
Value is equivalent to calculating values, and aggfunc is used to calculate count/mean/average.
Note that value cannot use values that index and columns have already used, unlike excel.
Pivot= pd.pivot_table (raw, values = 'Response ID', index= [' Country'], columns= ['NPS category'], aggfunc=np.size)
Aggfunc currently uses counted np.size summary np.sum average np.average np.mean median np.median
6.2 DataFrame Convergence (vlookup or hookup)
Because excel's formula is in a cell, and DataFrame generally deals with row or column data at one time, assigning values to a row / column based on other row / column data references is equivalent to table fusion.
Two main methods are used:
Pd.merge ()
Pd.concat ()
For details, please see the next section.
6.3 pd.merge ()
Pd.merge () is very similar to the operation of join in the database, with rich parameters:
Merged_df = pd.merge (left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes= ('_ x','_ y'), copy=True, indicator=False)
Merge can provide several merging methods commonly used in relational databases, and null values are filled with NaN:
The following is a detailed explanation of several common parameters:
Parameter on
The parameter of pd.merge (df1, df2, on = 'xxx') # on is used to determine the common column of the two tables.
On here is equivalent to the positioning of lookup value in vlookup.
Parameter merge
The parameters of pd.merge (df1, df2, how= 'xxx') # how are used to determine the merge method.
The correspondence between Merge method and SQL join is as follows:
Merge methodSQL Join NameDescriptionleftLEFT OUTER JOIN uses only the key of the left table rightRIGHT OUTER JOIN uses only the key of the right table outerFULL OUTER JOIN uses the union key of two tables inerINNER JOIN uses the key of the intersection of two tables
If you use pd.merge to implement vlookup, the index of the two happens to be the common value, as long as
Pd.merge (main_data,to_lookup_data,on = 'left') is OK.
Parameter left_on right_on
To bu input
Parameter left_index right_index
To bu input
6.4 pd.concat ()
If two DataFrame column are the same, they are spliced up and down (add rows of data)
Pd.concat ([df1,df2])
If two DataFrame index are the same, they are spliced left and right (add data columns)
Pd.concat ([df1,df2], axis = 1)
If there is more than one DataFrame with the same column:
Dfs = [df1,df2,df3,df4]
Result = pd.concat (dfs)
For more details on pd.merge () and pd.concat (), please refer to the official website:
Http://pandas.pydata.org/pandas-docs/stable/merging.html
7. Data output
Suppose there is now a DataFrame named raw that needs to be exported to the root directory of disk C.
7.1Exporting to csv
DataFrame comes with to_csv () function. Note that encoding parameter is recommended if Chinese is recommended, and index= False parameter can be added if index is not needed.
Raw.to_csv ('C:\ File_name.csv', encoding = 'utf-8', index = False)
7.2Exporting to Excel
Use Excel Writer that comes with pandas to generate excel in 2010 format
From pandas import ExcelWriter
Path ='C:\'
Writer = ExcelWriter ('% sFile_name.xlsx'% path) # specify the Excel file name
Raw.to_excel (writer, sheet_name = 'worksheet_name') # specify the worksheet name
Writer.save ()
7.3 output to database
What if you want to save the database? RDBS and NOSQL
Mysql
MongoDB
To be input..
8. Use datetime for time-related operations
Using datetime in python can also achieve the same function as the date function commonly used in excel.
8.1 create the current point in time as an object
Import datetime
Now = datetime.datetime.now ()
Today = datetime.datetime.today ()
8.2 displacement of time
Start_date = dt.date (today.year-2,today.month-1,today.day)
End_date = dt.date (today.year,today.month-3,today.day+1)
If the month / date exceeds the limit, an error will be reported.
So you may need to write a loop to output these dates
Date_list = [] while start_date < end_date: if start_date.month < 12: date_list.append (start_date.strftime ('% Ymuri% m')) start_date = datetime.date (start_date.year,start_date.month + 1 Mae startdate.day) else: date_list.append (start_date.strftime ('% Ymurc% m')) start_date = datetime.date (start_date.year+1 Start_date.month-11,start_date.day) 8.3 Reformatting
As shown above, you can adjust the format of time with strftime (), and there are many options that can be adjusted, refer to:
Http://www.runoob.com/python/att-time-strftime.html
Year representation of% y double digits (00-99)
The year of% Y four digits is expressed (000-9999)
% m month (01-12)
One day in% d months (0-31)
% H 24-hour hours (0-23)
% I 12-hour hours (01-12)
% M minutes (00059)
% s seconds (00-59)
% a Local simplified week name
% A Local full week name
% b locally simplified month name
% B Local full month name
% c corresponding local date and time representations
% j one day of the year (001-366)
% p Local A.M. Or P.M. The equivalent symbol of
% U week of the year (00-53) Sunday is the beginning of the week
% w week (0-6), Sunday is the beginning of the week
% W week of the year (00-53) Monday is the beginning of the week
% x local corresponding date representation
% X local corresponding time representation
Name of Z current time zone
% number itself
Calculation of 8.4 weeks calculation of 8.5 working days
There is a convenient function in Excel called networkdays, which gives the start date, end date and holiday to calculate the number of working days between two days. However, pandas or datetime did not support this requirement well, so I found this module: business_calendar
Https://pypi.python.org/pypi/business_calendar/
8.5.1 calculate the number of working days between dates
For example, if you ask for a few working days from February 1 to 29, the conditions are known:
Work from Monday to Friday
From February 8th to 12th is a holiday
Date1 = datetime.datetime (2016, 1, 31) # Note: if you write February 1st, that day is not included. So on January 31st, date2 = datetime.datetime (2016jue 2jue 29) cal = Calendar (workdays = [MO, TU, WE, TH, FR], holidays= ['2016-02-08] bsday = cal.busdaycount (date1, date2) print (bsday) thank you for reading this article carefully. I hope the article "how to install and use Pandas" shared by the editor will be helpful to you. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un