In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article is to share with you what practical tips for Pandas data analysis have. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.
Tip 1: how to use map to do feature engineering on certain columns?
Sir, the data:
D = {"gender": ["male", "female", "male", "female"], "color": ["red", "green", "blue", "green"], "age": [25,30,15,32]} df = pd.DataFrame (d) df
On the gender column, use the map method to quickly complete the following mapping:
D = {"male": 0, "female": 1} df ["gender2"] = df ["gender"] .map (d)
Tip 2: clean data using replace and regularization
The strength of Pandas lies in data analysis, so data cleaning is essential.
A quick data cleaning tip, using the replace method and regularization on a column to quickly clean the values.
Source data:
D = {"customer": ["A", "B", "C", "D"], "sales": [1100, "950.5RMB", "$400", "$1250.75"]} df = pd.DataFrame (d) df
Print the results:
Customer sales0 A 11001 B 950.5RMB2 C $4003 D $1250.75
See the value of the sales column, there are integers, floating point + RMB becomes string, and dollar + integer, dollar + floating point.
Our goal: to clean out the RMB,$ symbol and convert this to float.
One line of code is done: (click on the code area, swipe to the right to see the complete code)
Df ["sales"] = df ["sales"] .replace ("[$, RMB]", "", regex = True)\ .astype ("float")
Using regular substitution, put the character to be replaced in the list [$, RMB] and replace it with an empty character, that is, "
Finally, use astype to convert to float
Print the results:
Customer sales0 A 1100.001 B 950.502 C 400.003 D 1250.75
If you are worried, check the type of the following values:
Df ["sales"] .apply (type)
Print the results:
0 1 23 [python Learning Communication Group] Tip 3: how to use melt to analyze data perspective?
Construct a DataFrame:
D = {\ "district_code": [12345, 56789, 101112, 131415], "apple": [5.2,2.4,4.2,3.6], "banana": [3.5,1.9,4.0,2.3], "orange": [12345, 7.5,6.4,3.9]} df = pd.DataFrame (d) df
Print the results:
District_code apple banana orange0123455.23.58.01567892.41.97.521011124.24.06.431314153.62.33.9
Represents the apple price of region 12345, and apple, banana, orange, these three columns are all a kind of fruit, so how to merge these three columns into one column?
Use pd.melt
The value of specific parameters is determined according to this example:
Df = df.melt (\ id_vars = "district_code", var_name = "fruit_name", value_name = "price") df
Print the results:
District_code fruit_name price012345 apple 5.2156789 apple 2.42101112 apple 4.23131415 apple 3.6412345 banana 3.5556789 banana 1.96101112 banana 4.07131415 banana 2.3812345 orange 8.0956789 orange 7.510101112 orange 6.411131415 orange 3.9
The above is the long DataFrame, and the corresponding original DataFrame is the wide DF.
Tip 4: know year and dayofyear, how to transfer datetime?
Original DataFrame
D = {\ "year": [2019, 2019, 2020], "day_of_year": [350,365,1]} df = pd.DataFrame (d) df
Print the results:
Year day_of_year0201935012019365220201
Tips for transferring to datetime
Step 1: create an integer
Df ["int_number"] = df ["year"] * 1000 + df ["day_of_year"]
Print df results:
Year day_of_year int_number0201935020193501201936520193652202012020001
Step 2: to_datetime
Df ["date"] = pd.to_datetime (df ["int_number"], format = "% Y% j")
Note the conversion format j in "% Y% j"
Print the results:
Year day_of_year int_number date0201935020193502019-12-161201936520193652019-12-3122020120200012020-01-01 Tip 5: how to classify a value with fewer occurrences in a classification as others?
This is also a task we face in data cleaning and feature construction.
The following DataFrame:
D = {"name": ['Jone','Alica','Emily','Robert','Tomas','Zhang','Liu','Wang','Jack','Wsx','Guo'], "categories": ["A", "C", "A", "D", "A", "B", "B", "C", "A", "E", "F"]} df = pd.DataFrame (d) df
Results:
Name categories0 Jone A1 Alica C2 Emily A3 Robert D4 Tomas A5 Zhang B6 Liu B7 Wang C8 Jack A9 Wsx E10 Guo F
D, E and F appear only once in the classification, and An appears more times.
Step 1: count the frequency and normalize it
Frequencies = df ["categories"] .value_counts (normalize = True) frequencies
Results:
A 0.363636B 0.181818C 0.181818F 0.090909E 0.090909D 0.090909Name: categories, dtype: float64
Step 2: set the threshold and filter out the value with less frequency
Threshold = 0.1small_categories = frequencies [frequencies < threshold] .indexsmall _ categories
Results:
Index (['favored,' estranged,'D'], dtype='object')
Step 3: replace the value
Df ["categories"] = df ["categories"]\ .replace (small_categories, "Others")
Replaced DataFrame:
Name categories0 Jone A1 Alica C2 Emily A3 Robert Others4 Tomas A5 Zhang B6 Liu B7 Wang C8 Jack A9 Wsx Others10 Guo Others Thank you for reading! This is the end of this article on "what are the practical tips for Pandas data analysis?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.