In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
1. Small dimensional table
Using SPL for association, when the dimension table is small, it can be read into memory.
1.1 single field primary key
If the dimension table is a single-field primary key, you can use switch as a join. For example, three tables of orders, customers and employees are stored in the set file, and the table structure is as follows:
OrderCustomerEmployeeorderIDcustomerIDemployeeIDcustomerIDnamenameemployeeIDCitytitle. Country
……
Now associate the order table with the customer table and the employee table:
A1=file ("order.btx"). Cursor@b () 2=file ("customer.btx"). Import@b () 3=file ("employee.btx"). Import@b () 4=A1.switch (customerID,A2: customerID;employeeID,A3: employeeID) 5=A4.new (orderID, customerID.name, employeeID.name:name)
A1: the order table has a lot of data, so cursors are used.
A2: the customer table has less data, all loaded into memory, and indexed.
A3: similarly, employee tables also do dimension table memory.
A4: associate with switch, associate order table and customer table according to customer ID field, and associate order table and employee table according to employeeID field.
As you can see from A4, switch can handle multiple associated calculations at a time.
You can also use the serial number to locate when the primary key of the dimension table is the serial number.
A1=file ("order.btx"). Cursor@b () 2=file ("customer.btx"). Import@b (). Index () 3=file ("employee.btx"). Import@b () 4=A1.switch (customerID,A2: customerID; employeeID D Magazine A3 2=file #) 5=A4.new (orderID, customerID.name, employeeID.name:name)
A5: the employeeID field of the employee table is a natural number starting at 1, so you can do foreign key serialization.
If the primary key of the dimension table is not an ordinal value, you cannot directly use foreign key serialization for performance optimization. For example, the value of the customerID field is made up of letters. At this point, you can convert the primary key of the dimension table to a serial number and then use a foreign key to serialize it.
First, convert the customer ID of the customer table to the serial number:
A1=file ("customer.btx") .import@b () 2=A1.derive (#: newCustomerID) 3=file ("newAndOldCustomerID.btx") .export @ b (A2, newCustomerID, customerID) 4=file ("newCustomer.btx") .export @ b (A2, newCustomerID: customerID, name,city)
The serialized customers are saved to the set file newCustomer.btx. Among them, what is saved in newAndOldCustomerID.btx is the corresponding relationship between new and old customer ID.
Then serialize the customerID of the order table:
A1=file ("newAndOldCustomerID.btx"). Import@b () 2=file ("order.btx"). Cursor@b () 3=A2.switch (customerID,A1: customerID) 4=A3.run (customerID. NewCustomerID: customerID) 5=file ("newOrder.btx") .export @ ba (A4)
The serialized order is saved to the set file order .btx.
At this point, the customerID field can also be concatenated by serialization.
A1=connect ("demo") 2=file ("newOrder.btx"). Cursor@b () 3=file ("newCustomer.btx"). Import@b () 4=file ("employee.btx"). Import@b () 5=A2.switch (customerID,A3:#; employeeID,A4:#) 6=A5.new (orderID, customerID.name,employeeID.name:name) 1.2Multi-field primary key
Use join to join when the primary key of a dimension table is multiple fields. For example, there are student table (Students) and class table (Classes). The major number and class number of the student table are foreign key fields, pointing to the joint primary key (professional number and class number) of the class table, respectively. The table structure is as follows:
StudentClassstudentIdmajorIdnameclassIdmajorIdteacherclassId
Now we need to check the student's student number, name, major, class and head teacher:
A1=file ("student.btx"). Import@b () 2=file ("class.btx"). Import@b (). Keys (majorId,classId) 3=A1.join (majorId:classId,A2,teacher)
A2: import class data and set the primary keys to majorId and classId
The A3:join () function carries on the primary key association of the two fields and adds the head teacher information to the student information.
two。 Large dimensional table
If the dimension table cannot be loaded into memory, and the fact table can be loaded into memory, you can use the joinx function to associate it. At this time, the dimension table should be stored in an orderly manner according to the primary key, and the file or group table can be segmented, and the latter is more efficient.
For example, the return table and the product table are stored in the set file, and the table structure is as follows:
ReturnsProductorderIDproductIDproductIDnamepricepricequantitycategorydate.
Here, the return table is unordered for the associated field producID, and the product table is ordered according to the producID field. The calculation goal is to obtain the product category of each return record, and you need to associate the return form with the product table:
A1=file ("returns.btx") .import@b () 2=file ("product.btx") 3=A1.joinx@q (productID,A2:productID,categroy: categroy;) 4=A3.fetch
A1: load the return table into memory
A2: the file object that gives the product table
A3: use the joinx function for association.
If the fact table is also ordered for the associated fields, you can add @ c to speed up further. For example, to calculate the customer ID for each return record, associate the return form with the order form:
A1=file ("returns.btx") .import@b () 2=file ("order.btx") 3=A1.joinx@qc (orderID,A2:orderID,customerID: customerID;) 4=A3.fetch
In fact, the above two examples can be done at once:
A1=file ("returns.btx") .import@b () 2=file ("order.btx") 3=file ("product.btx") 4=A1.joinx@qc (orderID,A2:orderID,customerID:customerID; productID,A3:productID,category: category;) 5=A4.fetch
A4: two associations have been made. The return form is first associated with the order table, and then the result is associated with the product table. Here, the return table is ordered for the associated field orderID, so you can add @ c, but write it first.
Cursors can be used in cases where the fact table cannot be loaded into memory. For example, the table structure with order details is as follows:
OrderDetailsorderIDproductIDpricequantitydate...
The order details are stored in the group table. To calculate the types of products sold in a certain month, you need to associate the monthly data of the order schedule with the product table:
A1=file ("orderDetails.ctx") .create () .cursor (; year (date) = = 2018 years & month (date) = = 11) 2=file ("product.btx") 3=A1.joinx@q (productID,A2:productID,categroy: categroy;) 4=A3.fetch ()
A1: the order details for November 2018 still cannot be loaded into memory and are accessed using cursors.
A2: the file object that gives the product table
A3: use the joinx function for association.
Joinx supports situations where the fact table is a cursor, but it should not be too large, otherwise it will be less efficient.
3. Same dimensional master child table
Joinx can be used to implement orderly merge joins for primary and child tables that are sequentially stored by primary key. The order table and order schedule are already ordered for the order ID field, and the total consumption of each customer is calculated:
A1=file ("order.btx"). Cursor@b () 2=file ("orderDetail.btx"). Cursor@b () 3=joinx (A1 cursor@b ordermand orderlemen A2: detail,orderID) 4=A3.groups (order.customerID:customerID;sum (detail.price*detail.quantity): amount)
Orderly merging can also be used with cursor foreign keys, such as calculating the name of a customer whose total consumption is greater than 1000:
A1=file ("order.btx"). Cursor@b () 2=file ("orderDetail.btx"). Cursor@b () 3=file ("customer.btx"). Import@b () 4=A1.switch@i (customerID, A3: customerID) 5=joinx
= A5.groups (order.customerID.name:customer
Sum (detail.price*detail.quantity): amount) .select (amount > 100000)
If the master child table is not ordered by holding down the key, you must sort in advance before you can use this method.
4. Parallel merging
For primary subgroup tables of the same dimension, they can be associated in parallel. For example, in the two group tables, order and order details, the segmented fields are both order ID. Calculate the name of the customer whose total consumption is greater than 1000, and still use joinx for association:
A1=file ("order.ctx"). Create (). Cursor@m (;; 4) 2=file ("orderDetail.ctx"). Create () .cursor (;; A1) 3=joinx (A1) 3=joinx (A1) 4=A3.groups (order.customerID:customer; sum (detail.price*detail.quantity): amount) .select (amount > 100000)
A1: the multipath cursor to get the order. The number of cursors used here is 4.
A2: according to the value of the multipath cursor A1, the multipath cursor of the order details is obtained, and the number of cursor paths is also 4.
The actual test results show that the speed is about 2.5 times faster after using 4-thread parallelism.
It is important to note that the segmentation field is specified when the group table is generated:
A1=file ("order.ctx"). Create (# orderID,customerID,employeeID;orderID) 2=file ("orderDetail.ctx"). Create (# orderID,productID,price,quantity,date;orderID)
The appended data will be segmented by the orderID field, and records with the same value of orderID will not be divided into two segments.
5. Master and child table in one
The group table supports the master child table to be saved in the same file, and the performance of extracting the associated data will be higher after merging. For example, you can save the order as the main table to the group table file order.ctx, and then create a schedule on the main table, named detail, and save the order details to this schedule. At this time, you can calculate the names of customers whose total consumption is greater than 1000, just like this:
A1=file ("order.ctx") .create () .attach (detail) 2=A1.cursor (orderID,customerID,price,quantity) 3=A2.groups (customerID:customer; sum (price*quantity): amount) .select (amount > 100000)
A1: open the attached order details
A2: create schedule cursors
A3: group and calculate the total consumption.
This method also supports parallelism, as long as you modify A2 slightly:
A1=file ("order.ctx") .create () .attach (detail) 2=A1.cursor@m (orderID,customerID,price,quantity;;4) 3=A2.groups (customerID:customer; sum (price*quantity): amount) .select (amount > 100000)
A2: multipath Vernier for establishing schedule, the number of paths is 4
According to the actual test results, using schedules is faster than using joinx. The main sub-table is the relationship of 1Val N, and the greater the N is, the greater the read speed is; when the primary key of the primary table is multi-field, the read speed is improved more obviously.
When the order is 100 million, each corresponding to about 10 order details, the actual test results of this case:
Time consuming (seconds) 2 group tables joinx group tables combined with one set of tables in parallel (4 threads) 781602368
After using the master and child table, not only the speed is optimized, but also the space is smaller. The combined group table saves about 20% of the hard disk space than the two group tables.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.