How to install and use Pandas 07/11 Update SLTechnology News&Howtos

How to install and use Pandas

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces how to install and use Pandas, with a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

1. Installation

It is recommended to use Anaconda for data analysis, with its own pandas numy and many libraries.

Anaconda installation address: https://www.continuum.io/downloads

After installation, you can enter commands at the beginning of conda in terminal (similar to pip), such as list to view installed packages, and to perform common actions such as install update.

2. Import

Most of the time, pandas is called pd by the user import. According to the author's description, pandas is the abbreviation of panel data (not a panda).

Import pandas as pd

3. DataFrame

DataFrame is the most common format used by Pandas to process data, a two-dimensional table with rows, columns, and values. Similar to table in a database or worksheet in excel. If there is a DataFrame called df, df.columns can get all the column tags, just as df.index can get all the row tags.

4. Read data 4.1 read data from excel

Raw = pd.read_excel ('% s% s.xlsx'% path% filename, sheetname='Data', skiprows= 1)

The data is read into a DataFrame called raw, sheetname can specify to read a worksheet, and skiprow can skip the initial N rows of data.

4.2 read data from csv

Raw = pd.read_csv ('% s% s.csv'% path% filename)

5. Add, delete, modify and check 5.1 add and delete columns

New column, located in the last column

Raw ['new column name'] = 'string'

Add a column in the middle, using df.insert ()

Df.insert (location, 'column name', value)

For example, insert a [city] column named city with a value of source_data in the second column of raw df (index is not a column)

Raw.insert (1 source source data ['data1'])

Delete column

Del raw ['column name']

5.2 change the column name

5.2.1 change all column names at once

Cols = ['name_1',' name_2', 'name_3']

Raw= raw [cols]

5.2.2 modify a column name

Using df.rename (), note that if there are multiple old_name columns in the df, they will all be renamed to new_name

Df=df.rename (columns = {'old_name':'new_name'})

5.3 change to index

If a column is set to index, the original index will be deleted

Raw = raw.set_index ('column_name')

Reset_index (), the new index is an incrementing integer sequence starting with 0, and the original index becomes a new column.

Raw = raw.reset_index ()

If you don't need the value in the original index, you can add drop = True:

Raw = raw.reset_index (drop=True)

5.4 Edit value (calculated value)

5.4.1 four operations

Raw ['data1'] = raw [' data1'] * 100

Raw ['data2'] = (raw [' data1'] + raw ['data3']) / raw [' data4']

Raw ['total'] = raw.sum (axis=1)

5.5 check column

5.5.1 filter a column that contains a value (all data in raw df where GEO CODE is CN)

Raw = raw [raw ['GEO_CODE'] = =' CN']

5.5.2 Multi-conditional filtering

Raw = raw [(raw ['GEO_CODE'] = =' CN') & (raw ['METRIC'] = =' Conversion Rate')]

5.5.3 filter multiple columns

Required_key = ['User_ID','SEO visits','SEA visits','Conversion Rate']

Raw = raw [raw ['METRIC_KEY'] .isin (required_key)]

5.6 weight removal

5.6.1 to reuse drop_duplicates (), there are two main parameters:

Subset needs to be deduplicated

Keep, keep the first (keep = 'first') or and the last (keep =' last') in case of duplicate values

Df = df.drop_duplicates (subset = 'column_name', keep =' last')

5.6.2 because when removing weight, the reserved value is simply the first or last one, so it needs to be paired with sort_values () to ensure that the left value is what you want. Sort_values () defaults to ascending ascending, from small to large.

Df = df.sort_values (by='column_name')

Df = df.drop_duplicates (subset = 'column_name', keep =' last')

6 Excel function-related PivotTable of 6.1 Excel

Pd.pivot_table ()

There are three main parameters, index, columns,value, and aggfunc

Index is the equivalent of a line label

Columns is equivalent to column label

Value is equivalent to calculating values, and aggfunc is used to calculate count/mean/average.

Note that value cannot use values that index and columns have already used, unlike excel.

Pivot= pd.pivot_table (raw, values = 'Response ID', index= [' Country'], columns= ['NPS category'], aggfunc=np.size)

Aggfunc currently uses counted np.size summary np.sum average np.average np.mean median np.median

6.2 DataFrame Convergence (vlookup or hookup)

Because excel's formula is in a cell, and DataFrame generally deals with row or column data at one time, assigning values to a row / column based on other row / column data references is equivalent to table fusion.

Two main methods are used:

Pd.merge ()

Pd.concat ()

For details, please see the next section.

6.3 pd.merge ()

Pd.merge () is very similar to the operation of join in the database, with rich parameters:

Merged_df = pd.merge (left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes= ('_ x','_ y'), copy=True, indicator=False)

Merge can provide several merging methods commonly used in relational databases, and null values are filled with NaN:

The following is a detailed explanation of several common parameters:

Parameter on

The parameter of pd.merge (df1, df2, on = 'xxx') # on is used to determine the common column of the two tables.

On here is equivalent to the positioning of lookup value in vlookup.

Parameter merge

The parameters of pd.merge (df1, df2, how= 'xxx') # how are used to determine the merge method.

The correspondence between Merge method and SQL join is as follows:

Merge methodSQL Join NameDescriptionleftLEFT OUTER JOIN uses only the key of the left table rightRIGHT OUTER JOIN uses only the key of the right table outerFULL OUTER JOIN uses the union key of two tables inerINNER JOIN uses the key of the intersection of two tables

If you use pd.merge to implement vlookup, the index of the two happens to be the common value, as long as

Pd.merge (main_data,to_lookup_data,on = 'left') is OK.

Parameter left_on right_on

To bu input

Parameter left_index right_index

To bu input

6.4 pd.concat ()

If two DataFrame column are the same, they are spliced up and down (add rows of data)

Pd.concat ([df1,df2])

If two DataFrame index are the same, they are spliced left and right (add data columns)

Pd.concat ([df1,df2], axis = 1)

If there is more than one DataFrame with the same column:

Dfs = [df1,df2,df3,df4]

Result = pd.concat (dfs)

For more details on pd.merge () and pd.concat (), please refer to the official website:

Http://pandas.pydata.org/pandas-docs/stable/merging.html

7. Data output

Suppose there is now a DataFrame named raw that needs to be exported to the root directory of disk C.

7.1Exporting to csv

DataFrame comes with to_csv () function. Note that encoding parameter is recommended if Chinese is recommended, and index= False parameter can be added if index is not needed.

Raw.to_csv ('C:\ File_name.csv', encoding = 'utf-8', index = False)

7.2Exporting to Excel

Use Excel Writer that comes with pandas to generate excel in 2010 format

From pandas import ExcelWriter

Path ='C:\'

Writer = ExcelWriter ('% sFile_name.xlsx'% path) # specify the Excel file name

Raw.to_excel (writer, sheet_name = 'worksheet_name') # specify the worksheet name

Writer.save ()

7.3 output to database

What if you want to save the database? RDBS and NOSQL

Mysql

MongoDB

To be input..

8. Use datetime for time-related operations

Using datetime in python can also achieve the same function as the date function commonly used in excel.

8.1 create the current point in time as an object

Import datetime

Now = datetime.datetime.now ()

Today = datetime.datetime.today ()

8.2 displacement of time

Start_date = dt.date (today.year-2,today.month-1,today.day)

End_date = dt.date (today.year,today.month-3,today.day+1)

If the month / date exceeds the limit, an error will be reported.

So you may need to write a loop to output these dates

Date_list = [] while start_date < end_date: if start_date.month < 12: date_list.append (start_date.strftime ('% Ymuri% m')) start_date = datetime.date (start_date.year,start_date.month + 1 Mae startdate.day) else: date_list.append (start_date.strftime ('% Ymurc% m')) start_date = datetime.date (start_date.year+1 Start_date.month-11,start_date.day) 8.3 Reformatting

As shown above, you can adjust the format of time with strftime (), and there are many options that can be adjusted, refer to:

Http://www.runoob.com/python/att-time-strftime.html

Year representation of% y double digits (00-99)

The year of% Y four digits is expressed (000-9999)

% m month (01-12)

One day in% d months (0-31)

% H 24-hour hours (0-23)

% I 12-hour hours (01-12)

% M minutes (00059)

% s seconds (00-59)

% a Local simplified week name

% A Local full week name

% b locally simplified month name

% B Local full month name

% c corresponding local date and time representations

% j one day of the year (001-366)

% p Local A.M. Or P.M. The equivalent symbol of

% U week of the year (00-53) Sunday is the beginning of the week

% w week (0-6), Sunday is the beginning of the week

% W week of the year (00-53) Monday is the beginning of the week

% x local corresponding date representation

% X local corresponding time representation

Name of Z current time zone

% number itself

Calculation of 8.4 weeks calculation of 8.5 working days

There is a convenient function in Excel called networkdays, which gives the start date, end date and holiday to calculate the number of working days between two days. However, pandas or datetime did not support this requirement well, so I found this module: business_calendar

Https://pypi.python.org/pypi/business_calendar/

8.5.1 calculate the number of working days between dates

For example, if you ask for a few working days from February 1 to 29, the conditions are known:

Work from Monday to Friday

From February 8th to 12th is a holiday

Date1 = datetime.datetime (2016, 1, 31) # Note: if you write February 1st, that day is not included. So on January 31st, date2 = datetime.datetime (2016jue 2jue 29) cal = Calendar (workdays = [MO, TU, WE, TH, FR], holidays= ['2016-02-08] bsday = cal.busdaycount (date1, date2) print (bsday) thank you for reading this article carefully. I hope the article "how to install and use Pandas" shared by the editor will be helpful to you. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.