How to use Pandas to index and select data 07/09 Update SLTechnology News&Howtos

How to use Pandas to index and select data

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

Today, I will talk to you about how to use Pandas index and select data, many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

We will use 2019 national sales data of new energy vehicles as demonstration data. The data are saved in a csv file, and readers can download them to https://github.com/pythonlibrary/practice-pandas-skills.git in the GitHub warehouse.

The article uses two libraries, pandas and numpy, to make sure they are installed correctly, and the work environment uses jupyter notebook. If you need to learn how to build an environment, you can read one of the work environments of data scientists-virtualenv and Jupyter Notebook.

The code used for the demonstration in the article can also be found in the GitHub repository mentioned earlier, with the corresponding notebook source file named index_select_data.ipynb.

First, import pandas and numpy in notebook. According to common habits, pandas imports into pd,numpy and imports into np.

Import pandas as pdimport numpy as np1. Import dataset

In this section, we will import the data file as a data object in pandas, and show some basic information for this data object, so that we can understand the data we are going to work on.

Our original data file is in csv format, so we can quickly use the read_csv method provided by pandas to import the csv file into pandas's DataFrame, and use the head method of the DataFrame object to view the contents of the first two lines of data.

Df = pd.read_csv ('NEV_sales.csv') df.head (2)

The head method takes an integer as a parameter, which represents the number of rows we want, which defaults to five rows. Here we get two rows. You can see that the index of the data is a number, and the column (column) contains the brand and sales from January to December 2019.

For later demonstration, we will use the df dataset to create a new dataset called df_brand_index, which differs from df in that it will use brand as the index to list sales from January to December 2019. We use the set_index method to implement this. This time, we use the tail method corresponding to the head method to see the difference between df_brand_index and df.

Df_brand_index = df.set_index ('brand') df_brand_index.tail (2)

As you can see, the output of the first column name brand has moved down a bit compared to other column names, it has become an index, unlike the original index, it has an index name, that is, brand, but the supported operation method is the same as the index.

In addition, we see two interesting messages in the output, one is that the last line of brand is the total, which indicates that the last line of our original data is the total of all rows, in the real data analysis, it will definitely have a bad impact on the results, so it can be excluded, while this article focuses on the operation of pandas, so it has no effect on us. Second, the monthly sales of the Mercedes-Benz brand in 2019 are NaN. NaN is a special value in pandas, which represents the missing value, while in our data set, the sales volume is actually zero.

Next, let's take a simple look at our data from three aspects.

The first is the amount of data. How much data are we analyzing? the shape property of DataFrame will tell us how many rows and columns the data has. The shapes of df_brand_index and df are obtained from the following code.

Df_brand_index.shape, df.shape

For df, there are always 77 rows and 15 columns, while df_brand_index has 77 rows and 14 columns because we converted the brand column to an index, so one column is missing.

Then there are data types, whether there are any illegal data types in our data, we are analyzing the sales of new energy vehicles, so we expect all the data to be numbers, not strings or anything else, the dtypes attribute of DataFrame will tell us this information.

Df_brand_index.dtypes

No problem, all columns are numeric (because the type is float64). If the word object appears in any column, it means that the column contains non-numeric content.

Finally, there is the basic information of sales, such as maximum, minimum, average, etc., which can be obtained by using DataFrame's describe () method.

Df_brand_index.describe ()

For example, in November 2019, the average sales of new energy vehicles in all brands was 3831, with a maximum of 72795 (unreasonable, right? Why, remember that we used the tail method earlier to see the last behavior total of the data, so this maximum value is actually the total value.

two。 Column selection

We try to avoid using the term column index, because if you look at English documents, pandas does not use column index to address columns, but directly uses column, which may lead to ambiguity if you call column indexes.

The data of a column or columns can be obtained from DataFrame through square brackets []. Note here: if we get multiple columns of data, we still get a DataFrame, while if we get only one column of data, we get a Series (pandas object at the same level as DataFrame, another common data structure in pandas).

Sr_brand_index = df_brand_index ['2019-11'] sr_brand_index.head (2)

Above, we got the sales data for November 2019 and looked at the content and found that the index name was still brand, but the column had no name.

Below, we obtained the sales data for November and December 2019, and found that not only the index name, but also the same DataFrame has two columns, corresponding to two months of data.

Df_brand_index [['2019-12', '2019-11'] .head (2)

3. Row selection

In this section, we will select data by row. The two most commonly used and officially recommended methods for column selection in pandas are loc and iloc, which are easy to be confused. Here, according to the official method, loc represents location and uses tags to locate, while I in iloc is interpreted as integer, that is, integer location is located by numbers. What do you mean? Look at the comparison below.

Digital Index

First of all, we use the DataFrame of df, remember, this object uses numbers as the index, index, and we use loc to get lines with index tags from 0 to 4:

Df.loc [0:4]

We end up with five lines of content, while the index ranges from 0 to 4. 0.

Then, let's use iloc to get the line of index from position 0 to position 4:

Df.iloc [0:4]

Here we find the difference, we only get four lines, index is from 0 to 3, why, iloc represents the use of integer numbers to get, its behavior is similar to the python built-in data structure list operation method, the result is [0P4).

At this point, the reader may be a little dizzy, don't worry, after reading the following example, you will suddenly understand and understand why loc and iloc are sometimes easily confused.

String Index

Let's use loc and iloc on df_brand_index to see the effect. Remember, Index in df_brand_index is the brand name.

Add the same as above, use df_brand_index.loc [0:4] to get the first five lines, and we get an exception

Why? Because loc uses tags to make choices, and the Index tag for this dataset is a string rather than a number, the correct usage is:

Df_brand_index.loc ['Beijing': 'Baojun']

Then the use of iloc is easy to understand and obvious.

Df_brand_index.iloc [0:4]

4. Row + column selection, find the element

To avoid confusion, we will continue to use df_brand_index for demos. If we want to find the sales of a BAIC brand in November 2019, or the sales of the first five brands from October to December 2019, we need to make a selection together. Pandas will intelligently return the appropriate data type according to the shape of the element found, for example:

Obtain the sales df_brand_index.loc of BAIC in November 2019 ['Beijing', '2019-11']

If the iloc method is used, the following method can get the equivalent result.

Df_brand_index.iloc [0,1] # 1st column is 2019-11 in df_brand_index obtains the sales df_brand_index.loc of the first five brands from October to December 2019 ['Beijing': 'Baojun', '2019-12]

Similarly, using the iloc method, the following method can also get an equivalent result

Df_brand_index.iloc [0:5, 0:3] 5. Condition selection

Another common method of data selection and screening is based on the content of the element. for example, we want to get brand data that sold more than 3000 units by November and December 2019.

Df_brand_index [(df_brand_index ['2019-12'] > 3000) & (df_brand_index ['2019-11'] > 3000)]

Here, we use pandas's Boolean selection function, that is, to provide a Boolean condition in [], (df_brand_index ['2019-12'] > 3000) & (df_brand_index ['2019-11'] > 3000) represents sales of more than 3000 in November and December. We have to pay special attention to the fact that the parentheses in the condition are very important and must be used, otherwise, pandas will throw an exception.

6. Find element location

Finally, in the actual project, there will be a need to find its location in the data by the value of an element. For example, we want to know brands with sales of 6046 in November 2019, or we want to know all brands and corresponding months with sales of 6046 in the entire DataFrame.

Find in a known column

If we want to know about brands with sales of 6046 in November 2019, we can use the conditional selection in section 5 to select the appropriate data first, and then use the index attribute of DataFrame to get the corresponding brand.

Df_brand_ index [DF _ brand_index ['2019-11'] = = 6046] .index

Look in the entire DataFrame

If we want to know all the brands and corresponding months with sales of 6046 in the entire DataFrame, then the built-in approach provided by pandas can't meet this requirement, and we can do it quickly with the help of numpy.

Numpy provides the where method, which can return the number of elements that meet the criteria in the row and column position of the input numpy arry, and the brand can be obtained in the DataFrame through the location number.

There is a to_numpy method in DataFrame that converts DataFrame to numpy array and joins the two parts together to get the numbers of index and column.

Idx = np.where (df_brand_index.to_numpy () = = 6046) [0] [0] col = np.where (df_brand_index.to_numpy () = = 6046) [1] [0] idx, col

The above code will be entered (2J1). Where 2 is the number of index, 1 is the number of column, corresponding to the number in df_brand_index is:

After reading the above, do you have any further understanding of how to use Pandas indexing and selecting data? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.