In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "how dataframe multiplies two columns to construct new features". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "dataframe how to multiply two columns to construct new features"!
If we're going to build a new feature b
The goal is to filter out data with values between 4 and 6 from a, which is True if it matches, or False if not.
Then the code is as follows: import pandas as pdlists=pd.DataFrame ({'a']: [1, lists 2, 3, 4, 5, 6, 7, 8, 9]}) lists ['b'] = (lists ['a'] 4)
Add: dataframe multiplies two columns, and then output to a new column
Look at the code ~ df ["new"] = df3 ["rate"] * df3 ["duration"]
New is the column name of the new column
Rate and duration are the columns that need to be multiplied
Addition, subtraction, multiplication and division all apply!
Supplement: DataFrame derives new feature operation
The value of a column in 1.DataFrame is derived into a new feature # derive the value of a LBL1 feature into a new feature in the form of one-hot piao=df_train_log.LBL1.value_counts () .index # first construct a temporary dfdf_tmp=pd.DataFrame ({'USRID':df_train_log.drop_duplicates (' USRID'). USRID.values}) # set all new feature columns to 0for i in piao: df_tmp ['PIAO_'+i] = convenient for grouping With this feature, it is set to 1, and each USRID of the original data has multiple records, so the grouping statistics group=df_train_log.groupby (['USRID']) for k in group.groups.keys (): t = group.get_group (k) id=t.USRID.value_counts (). Index [0] tmp_list=t.LBL1.value_counts (). Index for j in tmp_list: df_tmp [' PIAO_'+j]. Location [DF _ tmp.USRID==id] = 12. Grouping statistics Select the item group=df_train_log.groupby (['USRID']) lt= [] list_max_lbl1= [] list_max_lbl2= [] list_max_lbl3= [] for k in group.groups.keys (): t = group.get_group (k) # find the item argmx = np.argmax (t [' EVT_LBL']. Value_counts ()) ) lbl1_max=np.argmax (t ['LBL1']. Value_counts () lbl2_max=np.argmax (t [' LBL2'] .value_counts ()) lbl3_max=np.argmax (t ['LBL3'] .value_counts () list_max_lbl1.append (lbl1_max) list_max_lbl2.append (lbl2_max) list_max_lbl3.append (lbl3_max) # leaves only the items that appear the most C = t [t ['EVT_LBL'] = = argmx] .drop _ duplicates (' EVT_LBL') # put into list lt.append (c) # construct a new dfdf_train_log_new = pd.concat (lt) # and construct three more features The item df_train_log_new ['LBL1_MAX'] = list_max_lbl1df_train_log_new [' LBL2_MAX'] = list_max_lbl2df_train_log_new ['LBL3_MAX'] = list_max_lbl33 with the most occurrence of LBL1-LBL3 respectively. Derive a new feature of ont-hot that occurs on a certain day or not # create temporary df, Wednesday, Saturday, Sunday, default to 0df_day=pd.DataFrame ({'USRID':df_train_log.drop_duplicates (' USRID'). USRID.values}) df_day ['weekday_3'] = 0df_day [' weekday_6'] = 0df_day ['weekday_7'] = grouping statistics, set to 1 if available Not set to 0group=df_train_log.groupby (['USRID']) for k in group.groups.keys (): t = group.get_group (k) id=t.USRID.value_counts (). Index [0] tmp_list=t.occ_dayofweek.value_counts (). Index for j in tmp_list: if jacks 3: df_day [' weekday_3']. Locs [DF _ tmp.USRID==id] = 1 elif jacks 6 : df_day ['weekday_6']. Lok [DF _ tmp.USRID==id] = 1 elif jacks 7: df_day [' weekday_7']. Lok [DF _ tmp.USRID==id] = 14. Check how many seconds the user stays on APP, and how many days after watching APP# first convert the date into a time stamp And give a new feature tmp_list= [] for i in df_train_log.OCC_TIM: d=datetime.datetime.strptime (str (I), "% Y-%m-%d% H:%M:%S") evt_time = time.mktime (d.timetuple ()) tmp_list.append (evt_time) df_train_log ['time'] = tmp_list# subtract the previous line from each next line Get the app residence time df_train_log ['diff_time'] = df_train_log.time-df_train_log.time.shift (1) # to construct a new dataFrame Group to get the number of days to view app df_time=pd.DataFrame ({'USRID':df_train_log.drop_duplicates (' USRID'). USRID.values}) # several days to view df_time ['days'] = 0group=df_train_log.groupby ([' USRID']) for k in group.groups.keys (): t = group.get_group (k) id=set (t.USRID). Pop () df_time ['days'] .Lok [DF _ time] .USRID = = id] = len (t.occ_day.value_counts () .index) # remove some abnormal timestamps For example, the subtraction between two days is definitely not appropriate. Na also removes df_train_log=df_train_log [(df_train_log.diff_time > 0) & (df_train_log.diff_time).
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.