Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the 11 Python Pandas tips?

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Today, I will talk to you about what the 11 Python Pandas tips are, which many people may not know very well. in order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

Maybe you already know some of the commands in this article, but you just don't realize that it still has this way to open it.

Pandas is a data analysis package widely used in Python. There are many classic tutorials on Pandas on the market, but this article introduces a few hidden cool tips that I'm sure will help you.

1. Read_csv

This is an entry-level command for reading data. When you want to read a very large amount of data, try to add this parameter nrows = 5, you can read a small part of the data before loading all the data. In this way, errors such as choosing the wrong delimiter can be avoided (data is not necessarily separated by commas).

(or on linux, you can use 'head' to display the first five lines of any text file: head-c 5 data.txt)

Next, you can extract each column and convert it to list with df.columns.tolist (). You can also add usecols = ['C1','c2', … ] to load the required specified column. In addition, if you know certain column types, you can add dtype = {'c1columns: str,' c2columns: int,... }, which will speed up loading. Another advantage of adding these parameters is that if the column contains both string and numeric types, and you declare in advance that the column is a string, then the column will not be wrong when it is used as a primary key to merge multiple tables.

2. Select_dtypes

If you have finished preprocessing the data in Python, this command can save you some time. After reading the table, the default data type for each column will be bool,int64,float64,object,category,timedelta64, or datetime64. First, you can take a look at the general situation and use:

Df.dtypes.value_counts ()

To learn about each data type of your dataframe, then use:

Df.select_dtypes (include= ['float64',' int64'])

Gets a sub-dataframe that consists only of numeric types.

3. Copy

If you haven't heard of it, I must not emphasize its importance. Enter the following command:

Import pandas as pd df1 = pd.DataFrame ({'aqu: [0je 0je 0], 'baked: [1rect 1]}) df2 = df1 df2 [' a'] = df2 ['a'] + 1 df1.head ()

You will find that df1 has changed. This is because instead of generating a copy of df1 and assigning it to df2, df2 = df1 sets a pointer to df1. So as long as it is a change to df2, it will work on df1 accordingly. To solve this problem, you can either do this:

Df2 = df1.copy ()

You can also do this:

From copy import deepcopy df2 = deepcopy (df1)

4. Map

This cool command makes your data conversion easy. First define a

Dictionary, "key" is the old value before conversion, and "values" is the new value after conversion.

Level_map = {1: 'high', 2:' medium', 3: 'low'} df [' centering level'] = df ['c'] .map (level_map)

Several applicable scenarios: convert True, False, to 1, 0 (for modeling); define levels; use dictionary coding.

5. Do you need apply?

Sometimes the apply function can be quite helpful if we want to generate a new column from the existing columns and use them as input.

Def rule (x, y): if x = = 'high' and y > 10: return 1 else: return 0 df = pd.DataFrame ({' C1: ['high',' high', 'low',' low'],'c2: [0,23,17,4]}) df ['new'] = df.apply (lambda x: rule (x [' c1'], x [[c2']), axis = 1) df.head ()

In the above code, we define a function with two input variables and rely on the apply function to act on columns "C1" and "c2".

But the apply function is too slow in some cases. If you want to calculate the * * values of the "C1" and "c2" columns, you can certainly do this:

Df ['maximum'] = df.apply (lambda x: max (x [' c1'], x ['c2']), axis = 1)

But you will find that apply is much slower than the following command:

Df ['maximum'] = df [[' C1']] .max (axis = 1)

Conclusion: if you can use other built-in functions (they are generally faster), please do not use apply. For example, if you want to approximately round the value of the "c" column, use round (df ['c'], 0) or df ['c'], round (0) instead of the apply function above.

7. Value counts

This command is used to check the distribution of values. To check the values that appear in the "c" column and how often each value appears, you can use:

Df ['c'] .value_counts (

Here are some useful tips / parameters:

Normalize = True: look at the frequency at which each value appears instead of the number of frequencies.

Dropna = False: keep the missing values in this statistic.

Sort = False: sorts data by value rather than by number of occurrences.

Df ['c] .value _ counts () .reset_index (): convert this statistical table to the dataframe of pandas and process it.

8. The number of missing values

When building a model, we may remove rows that contain too many missing values or all of them. At this point, you can use .isnull () and .sum () to calculate the number of missing values for the specified column.

Import pandas as pd import numpy as np df = pd.DataFrame ({'id': [1dje 2jin3],' c1jewelry: [0Lie0Eng np.nan], 'c2recording: [np.nan,1,1]}) df = df [[' id', 'c1records,' c2']] df ['num_nulls'] = df [[' c1mom, 'c2']] .isnull () .sum (axis=1) df.head ()

We can use SELECT * FROM in SQL. WHERE ID in ('A001' c022', …) To get the record containing the specified ID. If you want to do something similar in Pandas, you can use:

Df_filter = df ['ID'] .isin ([' A001''C022'']) Df[df _ filter]

10. Grouping based on quantile

In the face of a column of values, you want to group the values of this column, for example, the first 5% in group 1, 5-20% in group 2, 20% in group 3, and 50% in group 4. Of course, you can use pandas.cut, but you can also use the following options:

Import numpy as np cut_points = [np.percentile (df ['c'], I) for i in [50,80,95]] df ['group'] = 1 for i in range (3): df [' group'] = df ['group'] + (df [' c'] < cut_ points [I]) # or

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report