In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
1. Calculate the annual growth ratio in the same month
Esproc
A1=now () 2=file ("C:\\ Users\\ Sean\\ Desktop\\ kaggle_data\\ music_project_data\\ sales.csv"). Import@t () 3=A2.groups (year (ORDERDATE): y 3=A2.groups (ORDERDATE): 3=A2.groups (AMOUNT): X) 4=A3.sort (m) 5=A4.derive (if (massim [- 1], xUnix [- 1]-1): yoy) 6=interval@ms (A1 line now ())
A3: group by the year and month of ORDERDATE, and name the column ydepartment m, and calculate the sales volume of this group.
The group () function is grouped but not aggregated, and the groups grouping is summarized at the same time.
A4: sort by month m
A5: add a column. If the month is equal to the month of the previous row, calculate the growth ratio and assign it. Otherwise, assign null and name the column yoy.
Python:
Import time
Import numpy as np
Import pandas as pd
S = time.time ()
Sales = pd.read_csv ("C:\\ Users\\ Sean\\ Desktop\\ kaggle_data\\ music_project_data\\ sales.csv", sep='\ t')
Sales ['ORDERDATE'] = pd.to_datetime (sales [' ORDERDATE'])
Sales ['y'] = sales ['ORDERDATE'] .dt.year
Sales ['m'] = sales ['ORDERDATE'] .dt.month
Sales_g = sales [] .groupby (by= ['yangjingjingm'], as_index=False)
Amount_df = sales_g.sum () .sort_values (['masks dint'])
Yoy = np.zeros (amount_df.values.shape [0])
Yoy= (amount_df ['AMOUNT']-amount_df [' AMOUNT'] .shift (1)) / amount_df ['AMOUNT'] .shift (1)
Yoy [amount _ df ['m'] .shift (1)! = amount_df ['m']] = np.nan
Amount_df ['yoy'] = yoy
Print (amount_df)
E = time.time ()
Print (eMurs)
Pd.to_datetime (), converted to date format. The newly added y and m columns represent the year and month. Df.groupby (by,as_index) is grouped by a field or fields, where the parameter as_index=False returns objects indexed by the group label. Df.sort_values () groups the new dataframe by month and year. Create a new array to store the calculated growth ratio for the same period. Df.shift (1) means that the next row of the original df, that is, relative to the previous row of the current behavior, is assigned to the growth ratio (the value of the current row minus the value of the previous row divided by the value of the previous row), because the month is different, the same month as the previous row is assigned to nan, and finally the array is assigned to the newly added column yoy of df.
Results:
Esproc
Python
Time-consuming esproc0.007python0.0302. Calculate the top n customers who accounted for half of sales in 1998
Esproc
A1=now () 2=file ("E:\\ esproc\\ esProc\ demo\\ zh\\ txt\\ Contract.txt"). Import@t () 3=file ("E:\\ esproc\\ esProc\\ demo\ zh\\ txt\\ Client.txt"). Import@t (). Keys (ID). Index () 4=A2.select (year (SellDate) = = 1998) 5=A4.groups (Client) Sum (Amount): Amount) 6=A5.sort (- Amount) 7=A5.sum (Amount) / 28=09=A6.pselect ((A8=A8+Amount) > = A7) 10=A3.find@k (A6.to (A9). (Client)). (Name) 11=interval@ms
A3:T.keys (Ki, …) , define the key Ki for the inner table T, …. ; T.index (n), creates an index table of length n for the key of the ordinal table T, clears the index table when n is 0 or resets the key of the ordinal table, and automatically selects the length if omitted. If you need to find data by key multiple times, you can be more efficient after you have established an index table. When indexing, it is assumed that the primary key of the record is unique, otherwise an error occurs.
A4: sift out 1998 transaction records
A5: group according to Client and calculate the sum of transaction volume Amount
A6: sort by Amount
A9: find the location where Amount accumulates to half the trading volume
A10:A.find (k), find the member whose primary key equals k from the permutation / ordering table A, and use the index table if there is an index table. @ k is considered a sequence of key values when the parameter k is a sequence, returning a member of A corresponding to the key value. Here is the sequence of Name fields of the member whose value of the return key ID is equal to A6.to (A9). (Client).
Python:
Import time
Import pandas as pd
Import numpy as np
S = time.time ()
Contract_info=pd.read_csv ('E:\\ esproc\\ esProc\\ demo\\ zh\\ txt\\ Contract.txt',sep='\ t')
Client_info = pd.read_csv ('E:\\ esproc\\ esProc\\ demo\\ zh\\ txt\\ Client.txt',sep='\ THERAPHY
Contract_info ['SellDate'] = pd.to_datetime (contract_info [' SellDate'])
Contract_info_1998 = contract_ info [contract _ info ['SellDate'] .dt.year = = 1998]
Contract_1998_g = contract_info_1998 [['Client','Amount']] .groupby (by='Client',as_index=False)
Contract_sort = contract_1998_g.sum () .sort_values (by = 'Amount',ascending=False) .reset_index (drop=True)
Half_amount = contract_sort ['Amount'] .sum () / 2
Sm=0
For i in range (len (contract_sort ['Amount'])):
Sm+=contract_sort ['Amount'] [I]
If sm > = half_amount:
Break
N_client = contract_sort ['Client'] .loc [: I]
Client_info = client_info.set_index ('ID')
Print (client_ info. Lok [n _ client] ['Name'])
E = time.time ()
Print (eMurs)
Sift out the records for 1998
Because only the Client and Amount fields of the transaction information are used here, only these two fields are selected and grouped according to the Client field. Df.sort_values (by,ascending), which is sorted in reverse order by Amount.
Df.reset_index (drop) rebuilds the index, and drop=True means to lose the original index, otherwise insert the original index as a column.
Get half the value of all transactions, cycle through the Amount field, and find the position where the sum is greater than or equal to half of the transaction volume. Take the value of the Client field 0 to that location to form a Series.
According to this Series, go to client_info to find the name value of the corresponding row.
Results:
Esproc
Python
Time-consuming esproc0.007python0.0163. Find out the top eight salespeople with monthly sales in 1995.
Esproc
A1=now () 2=file ("E:\\ esproc\\ esProc\ demo\\ zh\\ txt\\ SalesRecord.txt"). Import@t () 3=A2.groups (clerk_name:name,month (sale_date): month;sum (sale_amt): amount) 4=A3.group (month) 5=A4. (~ .sort (- amount) .to (8) 6=A5.isect (~. (name)) 7=interval@ms (A1 now ())
A3: group by clerk_name and month month (sale_date), and name these two fields name and month at the same time, calculate the sum of sale_amt in each group, and name it amount
A4: grouped by month and summed up.
A5: sort the amount in reverse order and take the top 8
A6: A.isect (), the member of the sequence A can be a sequence, resulting in a new sequence consisting of members of all the subsequences. Here is to ask for the intersection of all members.
Python:
Import time
Import pandas as pd
Import numpy as np
S = time.time ()
Sale_rec = pd.read_csv ('E:\\ esproc\\ esProc\\ demo\\ zh\\ txt\\ SalesRecord.txt',sep='\ t')
Sale_rec ['sale_date'] = pd.to_datetime (sale_rec [' sale_date'])
Sale_rec ['m'] = sale_rec ['sale_date'] .dt.month
Sale_rec_g = sale_rec.groupby (by= ['clerk_name','m'], as_index=False) .sum ()
Sale_month_g = sale_rec_g.groupby ('masked last indexation false)
Sale_set = set (sale_rec ['clerk_name']. Drop_duplicates (). Values.tolist ())
For index,group in sale_month_g:
Group_topn = group.sort_values (by='sale_amt',ascending=False) [: 8]
Sale_set = sale_set.intersection (set (group_topn ['clerk_name'] .values.tolist ()
Print (list (sale_set))
E = time.time ()
Print (eMurs)
Add a new column m to indicate the month
Group according to clerk_name,m and get the sum of sale_amt
Grouped by m
Initialize a collection that contains all clerk_name
Loop grouping, using the clerk_name of the initial set and each group to find the intersection, and assign values to the initial set, and finally get the intersection of all sets.
Results:
Results time-consuming esprocJenny,Steven0.010pythonJenny,Steven0.0234. Find out the modified records.
Esproc
A1=now () 2=file ("C:\\ Users\\ Sean\\ Desktop\\ kaggle_data\\ music_project_data\\ old.csv"). Import@t () 3=file ("C:\\ Users\\ Sean\\ Desktop\\ kaggle_data\\ music_project_data\\ new.csv"). Import@t () 4=A2.sort (userName,date) 5=A3.sort (userName,date) 6 = [A5 Magic A4] .merge @ d (userName,date) 7 = [A4 Magi A5] .merge @ d (userName,date) 8 = [A5 A4] .merge @ d (userName,date,saleValue,saleCount) 9 = [A8 Magi A6] .merge @ d (userName,date) 10=interval@ms (A1 Magna now ())
A4, A5: sort by userName and date
A6: A.merge (xi, …) , merge calculation A (I) |... , A (I) to [xi, …] Order, merge multiple order tables / permutations according to the specified field xi, xi omit and press the primary key, if xi is omitted and A has no primary key, then merge according to r.v (). A (I) must be isomorphic. @ d option to remove A (2) & … from A (1). The new order table / arrangement formed after the members in A (n), that is, the subtraction set. The difference between the new table and the old table is the newly added record.
A7: find the difference between the old table and the new table, that is, the records deleted from the old table.
A8:xi is all fields, get all modified records in the new table, including new and modified
A9: the modified record is obtained by using the difference set between all the modified records and the new records.
Python:
Import time
Import pandas as pd
Import numpy as np
S = time.time ()
Old = pd.read_csv ('C:\\ Users\\ Sean\\ Desktop\\ kaggle_data\\ music_project_data\\ old.csv',sep='\ t')
New = pd.read_csv ('C:\\ Users\\ Sean\\ Desktop\\ kaggle_data\\ music_project_data\\ new.csv',sep='\ t')
Old_delet_rec = pd.merge (old,new,how='left',on= ['userName','date'])
Delet_rec = old_delet_ rec [np.isnan (old_delet_rec ['saleValue_y'])] [[' userName', 'date',' saleValue_x', 'saleCount_x']]
Print ('delet_rec')
Print (delet_rec)
New_add_rec = pd.merge (old,new,how='right',on= ['userName','date'])
New_rec = new_add_ rec [np.isnan (new_add_rec ['saleValue_x'])] [[' userName', 'date',' saleValue_y', 'saleCount_y']]
Print ('new_rec')
Print (new_rec)
All_rec = pd.concat ([old,new])
All_update_rec = all_rec.drop_duplicates (keep=False)
Update_rec = all_update_ rec [all _ update_rec [['userName','date']] .dedicated ()]
Print ('update_rec')
Print (update_rec)
E = time.time ()
Print (eMurs)
First of all, merge (old,new,on='left') joins the old table to the left of the new table, and the row containing nan in the new table is the row deleted by the old table. Because the field name is the same, the suffix added by python by default is _ x _ merge, and the deleted record is to intercept the first four fields after merge.
In the same way, use the right join to get the new rows of the new table.
Pd.concat ([df1,df2]) joins the old table and the new table vertically, df.drop_duplicates (keep=False), deletes all duplicate rows, gets all the different records of the two tables, and selects ['userName','date'] two fields with the same fields, that is, the modified fields.
Results:
Esproc
Delet_rec
New_rec
Update_rec
Python
Time-consuming esproc0.003python0.0385. Calculate the inventory status of each kind of goods every day for a specified period of time.
Topic introduction: there are four fields of data in stocklog.csv: STOCKID goods number, DATE date (discontinuous), QUANTITY in and out of the warehouse, INDICATOR Peugeot, if INDICATOR is empty, ISSUE means out of the warehouse.
The data are as follows:
Our purpose is to use this data to calculate the inventory status of all kinds of goods within a specified period of time, that is, STOCKID, item number, DATE date (continuous), OPEN opening quantity, ENTER incoming quantity on the same day, TOTAL maximum quantity on the same day, ISSUE outgoing quantity on the same day, and CLOSE closing quantity.
Esproc
AB1=now ()
2=file ("C:\\ Users\\ Sean\\ Desktop\ kaggle_data\\ music_project_data\\ stocklog.csv") .import@t ()
3=A2.select (DATE > = date (start) & & DATEb=c=08
= B6.new (A6. STOCKID _ TOTAL,ISSUE STOCKID _ TOTAL,ISSUE A5 (#): DATE,c:OPEN,ENTER, (b=c+ENTER): TOTAL,ISSUE, (STOCKID): CLOSE) 9
> B8.run (ENTER=ifn (ENTER,0), ISSUE=ifn (ISSUE,0)) 10
= @ | B811=interval@ms (A1 focus now ())
A3: select the data within the specified date. Start and end are grid variables set in advance (can be set at the program-grid parameters of the aggregator. )
A4: group according to STOCKID and DATE, and calculate each group at the same time, if, here is if the INDICATOR==ISSUE,if () function is equal to the value of QUANTITY, otherwise it is 0, add this result to the field ISSUE after summing in the group, if the INDICATOR==ISSUE,if () function is equal to 0, otherwise it is the value of QUANTITY, add the result to the field ENTER after summing in the group. Finally get the total amount of each item in and out of the warehouse every day.
A5: periods can generate time series
A6: circular grouping
B6: P.align (Avu xmeny), while xmemy is omitted and the current record of P is aligned with the member of A. Align the records of P to A through the associated fields x and y. The value of y is calculated for permutation P, and the result is equal to the value of x in A to indicate that the two are aligned. Here is the entry and exit record of the current product aligned with the time series in B5.
B7: define two variables, b, as the initial value of the OPEN field
B8: create a new table, where STOCKID is the STOCKID of A6, insert the time series B5 into the new table in order, as the new field DATE,c as the OPEN field, take the ENTER field in B6 as the current ENTER field, assign a value to b as the c+ENTER field as the TOTAL field, take the ISSUE field in B6 as the current ISSUE field, and finally assign c to b-ISSUE as the CLOSE field.
B9: ifn (valueExp1, valueExp2) determines whether the value of valueExp1 is empty, returns valueExp2 if it is empty, and returns the value of the expression if it is not empty. Here is to fill in the null as 0.
B10PUR @ represents the current case, which means to constantly summarize the results in the current box and get the result.
Python:
Import time
Import pandas as pd
Import numpy as np
S=time.time ()
Starttime = '2015-01-01'
Endtime = '2015-12-31'
Stock_data = pd.read_csv ('C:\\ Users\\ Sean\\ Desktop\\ kaggle_data\\ music_project_data\\ stocklog.csv',sep='\ t')
Stock_data ['DATE'] = pd.to_datetime (stock_data [' DATE'])
Stock_data = stock_ data [stock _ data ['DATE'] > = starttime]
Stock_data = stock_ data [stock _ data ['DATE'] = stock_data [' DATE'] > = starttime]) date filtering will report an error, and simultaneous calculation is not supported, so you can only intercept the time twice.
Create two new fields of ENTER,ISSUE, and determine whether INDICATOR is ISSUE. If so, assign the value of QUANTITY to ISSUE, if not, assign the value of QUANTITY to ENTER.
Four fields of STOCKID,DATE,ENTER,ISSUE are taken and grouped according to STOCKID,DATE. At the same time, the sum of each group is summed up to get the entry and exit records of each kind of goods every day.
Pd.date_range (starttime,endtime) generates a Series,pd.DataFrame () of starttime~endtime and generates it as a dataframe (date_df)
Group data by STOCKID
Create a new list, ready to add each goods in and out of the warehouse status.
Loop each group, add a STOCKID column for date_df, generate a dataframe,pd.merge (df1,df2,on,how) containing two columns of DATE,STOCKID, and connect the dataframe with the group according to STOCKID,DATE to get a continuous date.
Df.fillna (0) assigns nan in df to 0
Add three columns open and TOTAL,CLOSE are all assigned to 0. 0.
Take out the fields in ['STOCKID','DATE','OPEN','ENTER','TOTAL','ISSUE','CLOSE'] order, and get the values of these fields (type is numpy.ndarray).
Initialize open=0
Loop the elements in this array, and the corresponding values of the 'OPEN','ENTER','TOTAL','ISSUE','CLOSE' field are value [2], value [3], value [4], value [5], value [6], open to value [2], TOTAL=OPEN+ENTER,CLOSE=TOTAL-ISSUE, and then assign close to open as the value of the next element [2].
Finally, the array is converted to dataframe to get the in-and-out status of the goods.
Put the status of all goods in and out of the warehouse into the list that starts the new project.
Finally, pd.concat ([df1,df2, … , dfn], ignore_index) merge these dataframe, ignore the original index, and get the in-and-out status of all goods.
Results:
Esproc
Python
Time-consuming esproc0.015python0.0896. Calculate the starting and ending duty time of each person
Topic introduction: table duty records the duty, one person will usually stay on duty for several working days and then change to others, the data are as follows:
Our aim is to calculate the start and end time of each shift according to the duty table.
Esproc
A1=now () 2=file ("C:\\ Users\\ Sean\\ Desktop\\ kaggle_data\\ music_project_data\\ duty.csv") .import@t () 3=A2.group@o (name) 4=A3.new (name,~.m (1) .date: begin,~.m (- 1) .date: end) 5=interval@ms (A1 now ())
This example is still simple.
A3:A.group (xi, …) The sequence / arrangement is equivalently grouped according to one or more fields / expressions, and the result is a sequence of group sets. @ o means that there is no reordering when grouping, but another group only when the data changes.
A4:A.new () generates a new sequence table / arrangement with the same number of records as A based on the length of the sequence table / arrangement A, and the field value of each record is xi and the field name is Fi. Here, a new two-dimensional table is created according to the grouping subset A3, where ~ .m (1) indicates taking the first row of each group, and ~ .m (- 1) means taking the tail row of each group.
Python:
Import time
Import pandas as pd
Import numpy as np
Import random
S=time.time ()
Duty = pd.read_csv ('C:\\ Users\\ Sean\\ Desktop\\ kaggle_data\\ music_project_data\\ duty.csv',sep='\ t')
Name_rec =''
Start = 0
Duty_list = []
For i in range (len (duty)):
If name_rec ='':
Name_rec = duty ['name'] [I]
If name_rec! = duty ['name'] [I]:
Begin = duty ['date'] .locus [start: iMur1] .values [0]
End = duty ['date'] .locus [start: iMur1] .values [- 1]
Duty_list.append ([name_rec,begin,end])
Start = I
Name_rec = duty ['name'] [I]
Begin = duty ['date'] .locus [start: I] .values [0]
End = duty ['date'] .locus [start: I] .values [- 1]
Duty_list.append ([name_rec,begin,end])
Duty_b_e = pd.DataFrame (duty_list,columns= ['name','begin','end'])
Print (duty_b_e)
E=time.time ()
Print (eMurs)
Description: the editor did not find the method of grouping without reordering in pandas, so he can only choose this stupid method, and because it has always been a contrastive pandas, he did not use the IO reading method that comes with python to complete this problem. Here's a brief introduction to the code:
Initialization name_rec is used to retain the value of the name field, strat is used to retain the intercept position, and duty_list is used to hold the final result.
Create a loop, start assigning the value of the first name in the data to name_rec, and then the next loop, if the name_rec is the same, continue. Until it is different, take the value of date in start~i-1 position, assign the 0th value to begin, the penultimate value to end, put the three values of name_rec,begin,end into the initialized duty_list, then cache the start value as I, update name_rec to the current name value, and proceed to the next loop.
Use pd.DataFrame () to generate dataframe.
Results:
Esproc
Python
Time-consuming esproc0.003python0.0207. Count the total number of people at each level in each project
Topic introduction: the sports table stores the results of various events (sprint, long run, long jump, high jump, shot put) (excellent, good, passing, failing). The data are as follows.
Our aim is to count the number of people at each level on each project.
Esproc
AB1=now ()
2=file ("C:\\ Users\\ Sean\\ Desktop\ kaggle_data\\ music_project_data\\ sports.csv") .import@t ()
3 = []
4for A2.fname () .to (2,) = A2.group (${A4}) 5
= B4.new (A4VRO count). ${A4}: mark,~.count (): count)
> A3=A3 | B57=A3.pivot (subject;mark,count)
8=interval@ms (A1 focus now ())
A3: initialize an empty sequence to summarize the statistical results
A4: because the first field of the sports table is name, there is no need for loops. Cycle through the fields of each project
B4: group according to this field of the loop
B5: create a new table with the field name as the value of the subject field, the value in the grouping of the field as the mark field, and the number of members in the group as the count field
B6: summarize the results of each project into A3
A7: A.pivot (g, … ; Fmaine Venturi Nianconi, … Taking the field / expression g as the group, the data with F and V as field columns in each group is converted into the data with Ni and Nimi as field columns, so as to realize the conversion between rows and columns. Ni defaults to the non-repeating field value in F, and Nimi defaults to Ni. Realize the transformation between rows and columns to form a PivotTable.
Python:
Import time
Import pandas as pd
Import numpy as np
S = time.time ()
Sports = pd.read_csv ('C:\\ Users\\ Sean\\ Desktop\\ kaggle_data\\ music_project_data\\ sports.csv',sep='\ t')
Subject_mark_cnt_list = []
For col in sports.columns [1:]:
Sports_g = sports.groupby (by=col,as_index=False) count () [[col,'name']]
Sports_g.rename (columns= {'name':'count',col:'mark'}, inplace=True)
Sports_g ['subject'] = col
Subject_mark_cnt_list.append (sports_g)
Subject_mark_cnt = pd.concat (subject_mark_cnt_list,ignore_index=True)
Subject_mark_count = pd.pivot_table (subject_mark_cnt [['subject','mark','count']], index = [' subject'], columns = 'mark',values =' count')
Print (subject_mark_count)
E = time.time ()
Print (eMurs)
Initialize subject_mark_cnt_list to prepare the results of the summary loop
Cycle through all fields of the first field
Df.groupby () groups according to this field, counts the number of members in the group, and takes the current col field and the name field.
Df.rename (columns= {}) modifies the column name of this dataframe
Add a new column of subject and assign it to the current colvalue.
Put this dataframe in the initialized subject_mark_cnt_list list.
Pd.concat () connects the data in the list into a new dataframe
Pd.pivot_table (data,index,columns,values) changes it to a PivotTable.
Results:
Esproc
Python
Time-consuming esproc0.004python0.083
Summary: in this section, we calculate some common problems on the Internet, in which the dynamic calculation of field values and assignment are used many times. Esproc supports this function very well, which greatly simplifies the code. Python does not support this feature, which brings trouble, and the ~ of esproc represents the current record, leaving out the loop statement (which is still a loop), and python can only be done through a loop. In addition, the merge function in python does not support subtraction (or other functions), which makes it particularly troublesome in the fourth example. The dataframe structure of python pandas is stored by column, which is particularly troublesome when looping by row.
Sales.csv
Contract.txt
Client.txt
SalesRecord.txt
Old.csv
New.csv
Stocklog.csv
Duty.csv
Sports.csv
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.