The function and usage of Pandas 07/13 Update SLTechnology News&Howtos

The function and usage of Pandas

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "the function and usage of Pandas". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Important preface

Do data analysis found that there is a universal problem in the entry stage of data analysis skills, many students who enter the pit with interest can quickly familiarize themselves with the basic grammar of Python, and then plunge into the classic "using Python for data analysis". After nibbling away, they seem to know a little bit about everything, but in practice they don't know where to start and are full of loopholes.

2. Introduction to pandas

There is a saying spread in the rivers and lakes that the analysis does not know Master Pan (PANDAS), even the veteran is in vain.

Pandas is a professional data analysis tool based on Numpy, which can deal with various data sets flexibly and efficiently, and it is also an artifact for us to analyze cases later. It provides two types of data structures, namely DataFrame and Series, we can simply and roughly understand DataFrame as a table in Excel, and Series is a column in the table, and all the Pandas operations learned and used later are based on these tables and columns (on the image relationship between Pandas and Excel, it is recommended that Zhang Junhong write "compare EXCEL, easily learn Python data analysis").

It should be emphasized here that compared with Excel and SQL, Pandas only changes the way of calling and processing data, and the core is to carry out a series of processing of source data. Before formal processing, it is more important to make a decision and move later, to clarify the significance of analysis, to process and analyze data after clarifying the thinking of analysis, and often get twice the result with half the effort.

3. Create, read, store 1, create

What should we do if we want to construct the following table in Pandas?

Don't forget, the first step must be to import our library-- import pandas as pd

The most common way to construct DataFrame is dictionary + list. The sentence is very simple. First, the dictionary is enclosed, and then each column heading and its corresponding column values are typed out in turn (the list must be used here). The order of the columns here is not important:

Import pandas as pdimport numpy as npprint (pd.__version__) df1 = pd.DataFrame ({'salary': [5000 mine7000djj8000j8500], 'performance score': [60min84pcd98pc91], 'remarks': ['failed', 'good', 'best', 'excellent']}, index= ['Lao Wang', 'Xiao Liu', 'Xiao Zhao', 'Lao Gong']) df1

On the left is what dataframe looks like in jupyter notebook. If it corresponds to excel, it is what the table on the right looks like, controlling the data by changing the values of columns,index and values.

PS: if we do not specify index when we create it, the system will automatically generate an index starting at 0.

2. Read

More often, we read the relevant file data directly into PANDAS for operation. Here we introduce two very similar reading methods, one is the file in CSV format, the other is the file in EXCEL format (.xlsx and xls suffix).

Read the csv file:

Df2 = pd.read_csv ("/ home/kg/liujinjie/phonebook/ traffic exercise data .csv", engine= "python") df2.head ()

Engine is the analysis engine used to read csv files. Python is generally specified to avoid errors caused by Chinese and coding. Reading Excel files has the same taste:

Df2 = pd.read_exce. ("/ home/kg/liujinjie/phonebook/ traffic exercise data .xls") df2.head ()

Very easy, in fact, read_csv and read_excel also have some parameters, such as header, sep, names and so on, which you can learn more about. In practice, the format of data sources is generally relatively regular, more often read directly.

3. Storage

It is as simple, rough and similar as it is stored:

Df2.to_csv ("/ home/kg/liujinjie/phonebook/sowhat.csv") df3.to_excel ("/ home/kg/liujinjie/phonebook/sowhat.xlsx") 4. Data source traffic source details payment conversion rate-A351889.98% 54.3 level-B2846711.27% 99.93 level-C137472.54% 0.08 level-D51832.47% 37.15 level-E43614.31% 91.73 level-F406311.57% 65.09 level-G212210.27% 86.45% H20417.06% 44.07 level-I19916.52% 104.57 level-J19815.75% 75 .93 level 1-K195814.71% 85.03 level-L178013.15% 98.87 level-M14471.04% 80.07 level 2-A3904811.60% 91.91 level 2-B33167.09% 66.28 level-C20435.04% 41.91 level 3-A231409.69% 83.75 level 3-B1481320.14% 82.97 level 4-A2161.85% 94.25 level 4-B310.00%

Level 4-C170.005%

Level 4-D30.00%

4. Rapid recognition of data

Here we take our case data as an example to quickly become familiar with viewing N rows, an overview of data formats, and basic statistics.

1. Check the data and pinch the head and tail.

Many times we want to do an overview of the data content, with the df.head () function can directly view the default first five rows of data, corresponding to it, df.tail () can view the tail of the data of five rows of data, these two parameters can be passed in a value to control the number of rows viewed, for example, df.head (10) to view the first 10 rows of data.

2. Format view

Df.info () helps us figure out the type of data in each column and what is missing in one step:

3. Overview of Statistical Information

Df.describe () quickly calculates the key statistical indicators of numerical data, such as average, median, standard deviation, and so on.

We had five columns of data, so why did we return only two columns of data? That's because this operation is only for numeric columns. Count counts the number of non-empty values in each column. Mean, std, min and max correspond to the mean, standard deviation, minimum and maximum values of the column, respectively, and quantiles corresponding to 25,50,75%.

5. The basic handling of the column

Here, we use the logic of the four magic weapons of SQL to simply sort out the basic processing methods for columns-add, delete, select, and change.

Warm Tip: when using Pandas, try to avoid using row or EXCEL operation cell thinking to deal with data, to gradually develop a column of thinking, each column is of the same origin, processing is swishing fast, just like HBase.

1. Increase

Add a column in the form of df ['new column name'] = new column value, and assign a value based on the original data:

2. Delete:

We use the drop function to delete the corresponding column. Axis = 1 indicates the operation on the column. If inplace is True, the source data will be modified directly, otherwise the source data will remain intact.

3. Select:

What if you want to select a column? Df ['column name'] can:

How to select multiple columns? Need to pass in a list: df [['first column', 'second column', 'third column'...]]

4. Change:

Complicated filtering and modification for specific conditions and rows will be discussed later in detail in combination with a case study. Here we will only talk about the simplest change: df ['Old column name'] = a certain value or a certain column value, which completes the modification of the original column value.

6. Common data types and operations 1. String

String type is one of the most commonly used formats. String operations in Pandas are almost the same as native string operations, except that you need to add .str before the operation.

Tip: when we first use df2.info () to view the data type, non-numerical columns return the object format, and the difference in the deep mechanism of the str type is not expanded. In conventional practical applications, we can first understand that object corresponds to the str format, int64 corresponds to the int format, and float64 corresponds to the float format.

In the case data, we found that the column of source details may be a historical problem exported by the system. There is a-symbol in front of each string, which is ugly and useless, so take it off:

Generally speaking, the column after cleaning should be replaced with the original column:

2. Numerical type

Numerical data, the common operation is calculation, divided into the operation with a single value, the operation of equal length columns. Take the case data as an example, we know the number of visitors to the source data. Now we want to add 10000 visitors to all channels. How do we do this?

Just select the column for the number of visitors and add 10000, and pandas automatically adds 10000 to each row of values, as do other operations (subtraction, multiplication and division) for a single value.

The operation statements between columns are also very concise. The source data includes the number of visitors, conversion rate and customer unit price, while in practice we are more interested in the sales contributed by each channel. (sales = number of visitors X conversion rate X customer unit price)

Corresponding operation statement: df ['sales'] = df ['number of visitors'] * df ['conversion rate'] * df ['guest unit price']

But why report the wrong message like crazy?

The cause of error reporting is caused by the mutual calculation of numerical data and non-numerical data. PANDAS recognizes the conversion rate with the% sign as a string type. We need to remove the percent sign first, and then convert this column into floating-point data:

Note that this changes 9.98% to 9.98, so we also need to divide the payment conversion rate by 100 to restore the true percentage:

Then, the sales volume is calculated by multiplying three indicators:

3. Time type

The water related to time series in PANDAS is very deep. Here, we only explain the most basic time format in daily life. Students who are interested in time series can consult the relevant information by themselves and have an in-depth understanding.

Take the case data as an example. Our channel data was extracted on August 2, 2019, and may involve channel data of other dates, so we need to add a list of time to distinguish. The time format commonly used in EXCEL is 2019-8-3 or 2019-8-3. Let's use PANDAS to implement it:

In actual business, sometimes PANDAS will read the date field in the file into a string format. Here, we first assign the string 2019-8-3 to the new date column, and then use the to_datetime () function to convert the string type to the time format:

After converting to the time format (datetime64 in this case), we can use the idea of processing time to process the data efficiently. For example, I now want to know how many days to extract the data until the end of the year ('2020-12-31'), and do subtraction directly (this function accepts a sequence of strings in time format, as well as a single string):

Finally, let's take a quick review:

The first step is to understand what a PANDAS really is.

The second step is to learn how to build and read stored data.

The third step, after getting the data, how to quickly view the data.

The fourth step, with a basic understanding of the data, you can simply add, delete, select and change.

In the fifth step, after understanding the basic operation, we take a preliminary look at the basic data types in Pandas.

This is the end of the introduction to the function and usage of Pandas. Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.