In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article will explain in detail what the commonly used indexing methods of Pandas are, and the editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.
Preface
Two commonly used indexing methods are introduced in detail according to the scenario:
The first is a location-based (integer) index. the case is short and quick, with a rough understanding, which is occasionally used in practice, but its scope of application is not as extensive as the second one.
The second is the index based on name (label), which is the focus of blackboard practice, because it will be an important cornerstone for data cleaning and analysis later.
First of all, a brief introduction to the case data of the exercise:
Details of traffic source number of visitors payment conversion rate guest unit price level 1-A351889.98% 54.3 level-B2846711.27% 99.93 level 1-C137472.54% 0.08 level 1-D51832.47% 37.15% E43614.31% 91.73 level-F406311.57% 65.09 level-G212210.27% 86.45% H20417.06% 44.07 level-I199116.5% 104.57 level-J19815.75% 75.93 level 1 Class-K195814.71% 85.03 level 1-L178013.15% 98.87 level 1-M14471.04% 80.07 level 2-A3904811.60% 91.91 level 2-B33167.09% 66.28 level-C20435.04% 41.91 level 3-A231409.69% 83.75 level 3-B1481320.14% 82.97 level 4-A2161.85% 94.25 level 4-B310.00%
Level 4-C170.005%
Level 4-D30.00%
Like the first dataset, it records the number of visitors, payment conversion rate and customer unit price corresponding to the details of each channel under different traffic sources. Although the dataset is short (the complex case dataset will arrive as promised at the end of the base article), it is representative enough, so let's start with the performance of our index.
1. Location-based (number) index
Let's take a look at how the index works:
Df.iloc [row index, column index]
The first position is the row index, enter the parameters of which row positions we want to take
The second position is the column index. Enter the position parameters of which columns we want to take.
We need to fill in the corresponding row and column parameters according to the actual situation.
Scene 1 (row selection)
Goal: select all lines where the traffic source is equal to level 1.
Idea: finger poke screen count, a level of channels, is from line 1 to line 13, the corresponding line index is 0-12, but Python slices are default is not including the tail, to select 0-12 index row, we have to enter 0:13, the column want to select all, enter the colon: can.
Scenario 2 (column selection)
Goal: we want to take a look at the traffic sources and customer unit prices of all channels.
Idea: for all traffic channels, that is, all rows, at the location of the first row parameter, we enter:; look at the column, the traffic source is the first column, the passenger unit price is the fifth column, and the corresponding column indexes are 0 and 4, respectively:
It is worth noting that if we want to select across columns, we must first construct the position parameters into the form of a list, here is [0Force 4], if it is a continuous selection, then there is no need to construct a list, just enter 0:5 (select the column with index 0 to the column with index 4).
Scenario 3 (row and column cross selection)
Goal: we want to take a look at the secondary and tertiary traffic sources, source details of visitors and payment conversion rates.
Idea: look at the row first. The corresponding row index of the second-level and third-level channel is 13:17. Again, the principle that the index contains the beginning but not the tail is emphasized. The row parameter we passed in is 1318; for columns, we need traffic source, source details, visitors and transformation, that is, the first four columns, pass in the parameter 0:4.
two。 Index based on name (label)
In order to establish a sense of horizontal contrast, we still use the above three scenes.
Scenario 1: select all the lines of the first-level channel.
Idea: this time we do not have to count the locations one by one, to filter all the rows whose traffic channels are first-level, we only need to make a judgment to determine which values are equal to the first-level traffic source column.
The returned result consists of True and False (Boolean), which in this case means that the result is equal to first and non-first level, respectively. In the loc method, we can pass the judged value of this column to the position of the row parameter. Pandas will by default return the row whose result is True (here is the row with index from 0 to 12), and discard the row with the result of False, as shown in the example:
Scenario 2: we want to take a look at the traffic sources and customer unit prices of all channels.
Idea: all channels are equal to all rows. We enter directly in the line parameter position: to extract the traffic source and customer unit price column, enter the name directly to the column parameter location. Since two columns are involved here, you have to wrap it in a list:
Scenario 3: we want to extract visitors and payment conversion rates corresponding to secondary and tertiary traffic sources, source details.
Idea: row extraction with judgment, column extraction input specific name parameters.
Df2.loc [df2 ['traffic source'] .isin (['second level', 'third level']), ['traffic source', 'source details', 'visitors', 'payment conversion rate']]
Here is an advertisement for the isin function, which can help us quickly determine whether the value of a column (Series) in the source data is equal to the value in the list. For example, df ['traffic source'] .isin (['level 2', 'level 3]) determines whether the value of the traffic source column is equal to "level 2" or "level 3". If it is equal to (equal to any one), it returns True, otherwise it returns False. If we pass the Boolean judgment result into the row parameters, we can easily get the channel whose traffic source is equal to the second or third level.
Since the application scene of loc is more extensive, we should add a drumstick to him and practice in an approachable scene.
Before inserting the scene, let's take 30 seconds to figure out how to evaluate the column (Series) in Pandas. The details are as follows:
Df2 ['number of visitors'] .mean () df2 ['number of visitors'] .std () df2 ['number of visitors'] .visitors () df2 ['number of visitors'] .max () df2 ['number of visitors'] .min ()
Only need to add a tail, the mean, standard deviation and other statistical values come out, after understanding this, let's officially enter scenario 4.
Scenario 4: for traffic channel data, we should really focus on high-quality channels. If we define that the number of visitors, conversion rate and customer unit price are all higher than the average, how can we find these channels?
Idea: high-quality channels, at the same time to meet the visitor, transformation, guest list above the average of these three conditions, which is the key to solving the problem. Let's take a look at the averages:
Then judge whether each index column is greater than the average value:
Df2 ['number of visitors'] > df2 ['number of visitors'] .mean () df2 ['payment conversion rate'] > df2 ['payment conversion rate'] .mean () df2 ['guest unit price'] > df2 ['guest unit price'] .mean ()
If the three conditions are satisfied at the same time, the relationship between them is a "and" (satisfied at the same time). In pandas, to indicate that the conditions are satisfied at the same time, the conditions should be connected with & symbols, and the conditions are best distinguished by parentheses; if the relationship is or (one can be satisfied), use the | symbol connection:
(df2 ['number of visitors'] > df2 ['number of visitors'] .mean ()) & (df2 ['payment conversion rate'] > df2 ['payment conversion rate'] .mean ()) & (df2 ['guest unit price'] > df2 ['guest unit price'] .mean ())
After this connection, returning True means that the channel satisfies the conditions that the visitor, conversion rate and customer unit price are all higher than the average. Then we only need to pass these values to the location of the line parameters.
Df2.loc [(df2 ['number of visitors'] > df2 ['number of visitors'] .mean ()) & (df2 ['payment conversion rate'] > df2 ['payment conversion rate'] .mean ()) & (df2 ['guest unit price'] > df2 ['guest unit price'] .mean ()),:]
At this stage, we have directly screened out four high-quality channels whose key indicators are all higher than the average.
3. Mixed use of numeric location and name location
Use pandas.ix [rows, columns], but the new version of pandas is no longer recommended to use the change method, it is recommended to use either 1 or 2.
This is the end of this article on "what are the commonly used indexing methods of Pandas?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.