In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Optimization of computing performance of lie data
The performance optimization of big data's analysis, at the end of the day, optimizes one thing: for a certain computing task (data determination, result determination), get the result with the most economical scheme.
This most economical solution mainly considers three costs: time cost, hardware cost and software cost.
Time cost: the maximum tolerable time varies according to the characteristics of the computing task. The real-time requirements of those Thum0 computing tasks are relatively high, and it will be meaningless for Thum1 to calculate the results.
Hardware cost: the hardware resources that can be used generally do not change frequently for a company, and there are only so many machine configurations and clusters. Even if the use of cloud computing products, it is only more flexibility to expand, the cost can not be reduced.
Software cost: the labor cost of writing this algorithm + the cost of the software environment. This cost is also related to the first two items, the program control is more rugged, the implementation logic is simpler, the program is easier to write, then the software cost will be lower, the side effect is the long running time or the need for expensive hardware.
Among these three factors, generally speaking, for computing tasks, the faster the better, of course, as long as it is not slower than the tolerable time, it is still a meaningful calculation; while the flexibility of hardware factors is relatively small, how many resources are relatively fixed; therefore, the rest is the cost of software. In the software cost, the salary of the programmer is a very important item, and whether there is a convenient software environment that allows the programmer to describe the calculation efficiently becomes the key. The most typical example is that you can theoretically write all programs with an assembler, but it is obviously not as easy as SQL or JAVA to do a routine calculation.
When it comes to SQL and JAVA, some maintainers of large-scale computing centers may also frown, and the longer they use them, the more they will experience the pain of changing requirements or optimizing the algorithm. It is clear that the algorithm process is very clear, but it is difficult to write a runnable program. These difficulties mainly come from two aspects:
First of all, some basic data manipulation methods are gradually accumulated by themselves, without overall optimization design, these personal tools can improve the efficiency of personal development, but they are not universal and not comprehensive. This difficulty is mainly reflected in some UDF implemented in high-level languages such as JAVA.
Second, mainly in the way of thinking, we are accustomed to using SQL queries in production scenarios, and the performance problems encountered in computing scenarios naturally want to alleviate the problems by optimizing SQL statements. But in fact, it may be a process of boiling frogs in warm water. The deeper you go, the more likely it is to turn a simple process problem into a huge inseparable logical block, and in the end, only the original author or master dares to touch it. As a veteran programmer, when I first joined the profession more than a decade ago, I heard from gossip that ORACLE system administrators, especially those with performance optimization capabilities, were much more expensive than ordinary programmers. It can be seen that this problem was highlighted ten years ago when the data scale was relatively small.
(note: production scenarios and computing scenarios are generally difficult to be completely separated in the initial stage of the software system. Data is accumulated from production scenarios, and so on, computing requirements will gradually increase, and computing centers and data warehouses will gradually become independent. If the process of qualitative change caused by this quantitative change does not change in thinking and introduce new methods, it will become a boiled frog. )
In order to save readers' time, we first summarize the common means of performance optimization to facilitate the actual operation of users who need to compare them one by one.
1. Only load and calculate the relevant data.
Storing data in column storage mode
Commonly used fields and less commonly used fields are stored separately
Using independent dimension table to store dimension attributes to reduce the information redundancy of fact table.
Separate storage according to some fields commonly used as query criteria, such as by year, sex, region, etc.
2. Simplify the data involved in computing
When used for analysis, some lengthy numbers can be sequenced, with 1, 2, …... Replace TJ001235-078, TJ001235-079, etc. In this way, it can not only accelerate the speed of loading data, but also accelerate the speed of calculation.
Date time, if stored in a string type in the familiar format (2011-03-08), will be slow to load and calculate. The previous date can be stored as a numeric type such as 110308, or as milliseconds relative to a start time (such as milliseconds relative to the earliest data 2010-01-01).
3. Optimization of algorithm.
The conditions with small amount of computation are written in front, such as the judgment of boolean type, which is earlier than string search, so that data that does not meet the requirements can be excluded with less computation.
Reduce the number of times to traverse the real table of events. The specific methods are: in a traversal process, provide data to multiple independent operations at the same time (the pipeline concept in the aggregator mentioned later), rather than traversing the data once for each operation; when doing JOIN, retrieve the fact table data in the in-memory dimension table instead of traversing the fact table with each dimension table data.
HASH index, dichotomy and direct alignment of serial numbers are used to speed up the search.
4. Parallel computing
Both loading data and computing can be done in parallel. Considering the characteristics of computing, according to the larger amount of loading data and operation, we can judge whether the bottleneck is the computer disk or CPU. Disk array is suitable for parallel loading data, and multi-core CPU is suitable for parallel computing.
In the parallel task of multi-computer cluster, the communication between main program and subroutine should be considered, the complex calculation should be completed independently on the node machine as far as possible, the network transmission is slow, and the data exchange between node computers should be reduced.
Practical effect
With the above guiding ideology, we will cut into the main topic to achieve the optimization of funnel calculation and take a look at the actual optimization effect.
1. Without any optimization, start work directly
Data:
Program: (the 1-First.dfx in the attachment also comes with a test data file, which can be executed directly in the aggregator)
The logical details of the core code of funnel conversion calculation were described in detail in the previous article, so I won't repeat them here.
Results: (note: subsequent tests were all based on 1.18 million pieces of data, which increased exponentially.)
1.18 million records / number of 70MB/ users 8000UBG 31 seconds
5.9 million records / 350MB/ users 40,000 / 787s.
Analysis:
The amount of data has increased to 5 times, but the time-consuming has increased to 26 times, the performance has degraded greatly, and it is not linear. The reason is that the list of users analyzed has been expanded five times, and the number of records analyzed has also increased five times, so the number of users retrieved has theoretically increased by five to five times. Next, the following optimization methods are adopted: ↓
2. Insert the user list according to the user ID order, and find the user by dichotomy. Program: (2-BinarySearch.dfx)
B12 adds the @ b option to find, indicating that it is looked up by dichotomy; when the first position parameter 0 of insert is removed in D13, the new user is not directly appended to the last, but inserted in primary key order.
A
B
C
D
eleven
……
twelve
For A11
> user=A10.find@b (A12. User ID)
thirteen
If user==null
If A12. Event ID==events (1)
> A10.insert (, A12. User ID: user ID,1:maxLen, [A12. Time, 1]]: seqs)
fourteen
……
Results:
1.18 million records / number of 70MB/ users 8000Universe 10 seconds
5.9 million records / the number of 350MB/ users is 40,000 / 47 seconds.
Analysis:
After optimization, the time consuming of 1 times the amount of data is reduced to 5 times the amount of data of 1 beat 3 times, and the speed is obviously reduced to 1 big 16. Further observation, 5 times the amount of data is 350MB, the slow speed of loading data from the hard disk will also have 100M/ seconds, if CPU is fast enough, the limit speed should be about 4 seconds, but now 47 seconds prove that CPU time is still serious, according to experience can continue to optimize ↓
3. Program for reading cursor data in batch: (3-BatchReadFromCursor.dfx)
After the overall cut of row 12-17, and after moving a grid to the right, a cycle of bulk loading cursor data is added in A12, indicating that the cursors in A11 take 10000 at a time, and B12 loops the 10000 data taken out.
A
B
C
eleven
……
twelve
For A11,10000
For A12
……
thirteen
……
Results:
1.18 million records / number of 70MB/ users 8000max 4 seconds
5.9 million records / number of 350MB/ users 40,000 / 10 seconds
59 million records / number of 3.5GB/ users 400000 / 132s
118 million records / number of 7GB/ users 800000 / 327s.
Analysis:
After optimization, the time consuming of 1 times of data is reduced to 2 times of data volume, and the amount of data of 5 times is reduced to 5 times of data volume. The performance of 50 times and 100 times of the new test is generally linear with the amount of data. Note that there are some fields in the original data that are not needed, and the fields used can be further simplified by means of serialization, so that the simplified file will be several times smaller, thus achieving the purpose of reducing the reading time from the hard disk. The specific optimization methods are as follows: ↓
4. Streamlining data ideas:
First take a look at the original data: the user ID is replaced by a sequence number starting from 1, which not only reduces a little storage space, but also can be quickly located to the user through the sequence number in subsequent calculation, reducing the search time. The time and year, month and day field information is duplicated. Excluding the year, month and day, the time field of the long integer type can also be further reduced to milliseconds relative to the start time of 2017-01-01. We know in advance that there are only 10 events, and the event ID and event name can be extracted separately into a dimension table record. In this fact table, only the numbered event ID (1, 2, 3... 10) it is enough; the event attribute is in JSON format, and there are not many kinds, so for a certain kind of event, you can use the sequence to store the value of the event attribute, and the position in the sequence represents a certain attribute, which not only reduces the storage space, but also improves the efficiency of finding attributes.
In addition to the simplification of the above field values, the format in which we store data is changed from text to aggregator binary format, resulting in smaller storage space and faster loading speed. The simplified fact table is as follows:
Achieve:
Before simplifying the fact table data, we should first generate the genDims.dfx of the user table and the event table through the fact table, and then generate user.bin and event.bin after running:
A
one
> beginTime=now ()
two
> fPath= "e:/ldsj/demo/"
three
= file (fPath+ "src-11800.txt") .cursor@t ()
four
= channel (A3)
five
= A4.groups (# 1: user id)
six
= A3.groups (# 3: event ID,#4: event name; iterate (~ & # 5.import@j (). Fname (): attribute name)
seven
= file (fPath+ "event.bin") .export @ b (A6)
eight
= file (fPath+ "user.bin") .export @ b (A4.result ())
nine
= interval@s (beginTime,now ())
The procedure of extracting dimension table still has the means of optimization. Extract two dimensional tables, the conventional thinking is to traverse the data every time, generate a dimensional table; read a large amount of data from the hard disk to traverse, read slowly, but the amount of calculation after reading is very small. In view of this situation, is there any means that can be used for a variety of independent calculations when reading data at the same time? the answer is "pipeline". If you define several more channels, you will define several more operations. A4 defines pipes for A3 cursors, A5 defines grouping calculations for A4 pipes, A6 defines another grouping calculation, A7 derives the results of A6, and A8 exports the results of A4 pipes. The resulting two dimension tables are as follows:
The fact table is simplified (toSeq.dfx) based on the above two dimensional tables. After 6.8g text files are simplified, 1.9g binary files are obtained, which is reduced by 3.5times.
A
B
C
D
one
> beginTime=now ()
two
> fPath= "e:/ldsj/demo/"
three
= file (fPath+ "src-11800.txt") .cursor@t ()
four
= file (fPath+ "event.bin") .import@b ()
= A4. (event ID)
= A4. (attribute name)
five
= file (fPath+ "user.bin") .import@b ()
= A5. (user ID)
six
seven
Func
eight
= A7.import@j ()
nine
= []
ten
For B7
eleven
> B9.insert (0Query eval ("B8." + B10))
twelve
Return B9
thirteen
fourteen
For A3,10000
fifteen
= A14.new (C5.pos@b (user ID): user ID,C4.pos@b (event ID): event ID, time: time, func (A7, event attribute, D4 (C4.pos@b (event ID): event attribute)
sixteen
= file (fPath+ "src-11800.bin") .export @ ab (B15)
seventeen
= interval@s (beginTime,now ())
A new knowledge point appears in this code, and line 7-12 defines a function to handle event properties in json format, which is called when each row of data is condensed in B15. B16 traces 10,000 streamlined records to the same binary file.
Program: (4-Reduced.dfx)
The following squares have been modified on the basis of the last program:
Time in A3/A4 relative to 2017-01-01
Change the A6 event sequence to the serial number
Attribute filtering in A7 replaces the previously inefficient fuzzy matching string mode with exact matching values; A10 initializes the user sequence with a length of the number of users, and the position in the sequence represents the user's serial number
C12 finds users by serial number
E13 stores new users with serial numbers:
A
B
C
D
E
two
……
three
> begin=interval@ms (date (2017 recorder 1), date (2017 recorder 1))
four
> end=interval@ms (date (2017 record1), date (2017 beacon3))
five
> dateWindow=10*24*60*60*1000
six
> events= [3, 4, 6, 7]
seven
> filter= "if (event ID uploaded 4 | | (event attribute .len () > 0 attribute & event attribute (1) = =\" Apple\ "); true)"
eight
nine
/ start execution of funnel conversion calculation program
ten
= to (802060). (null)
eleven
= file (dataFile) .cursor@b () .select (time > = begin&& time 0 & & ${filter})
twelve
For A11,10000
For A12
> user=A10 (B12. User ID)
thirteen
If user==null
If B12. Event ID==events (1)
> A10 (B12. User ID) = [B12. User ID,1, [B12. Time, 1]]
fourteen
……
Results:
Number of 118 million records / 1.93GB/ users 800000 / 225s.
Analysis:
After optimization, the time consuming of 100 times the amount of data is reduced to 2can3 in the previous step. In addition to simplifying the query fields involved, let's take a look at ↓, another method that can effectively reduce the amount of query data.
5. Split and store the data in advance, and load only the data ideas involved in the calculation:
How to split the data has something to do with the characteristics of the query. In this example, we often query indefinite time periods, so it is more appropriate to split according to the date, but it is meaningless to split according to the event ID.
Program for splitting data (splitData.dfx):
A4 fetches 100000 pieces of data at a time; B4 circulates for 60 days; C6 queries the data according to the date and appends it to the files of their respective dates through C9.
A
B
C
D
one
= dataFile=file ("e:/ldsj/demo/src-11800.bin") .cursor@b ()
two
> destFolder= "e:/ldsj/demo/dates/"
three
> oneDay=24*60*60*1000
four
For A1,100000
For 60
> begin=long (B4-1) * oneDay
five
> end=long ((B4)) * oneDay
six
= A4.select (time > = begin & & time filename= string (date (long (date (2017)) + begin), "yyyyMMdd") + ".bin"
nine
= file (destFolder+fileName) .export @ ab (C6)
After execution, a 59-day data file is generated:
Program: (5-SplitData.dfx)
Change the previously analyzed file definition into a directory in A2
The start and end date conditions of A3/A4 have changed. It used to be a query date field, but now it becomes a search date file.
A11 sorts the date files in the directory, selects multiple date files to be analyzed, and then combines them into a cursor and then filters the events.
A
one
……
two
> fPath= "e:/ldsj/demo/dates/"
three
> begin= "20170201.bin"
four
> end= "20170205.bin"
five
……
eleven
= directory (fPath+ "2017*") .sort () .select (~ > = begin&&~0 & & ${filter})
twelve
……
Results:
The target data is selected from 2017-02-01 to 2017-02-05, and the full scan data is 168 seconds; only 5 files are scanned to get the same results for 7 seconds, the effect is significant. So far, reading data and computing are single-threaded, so let's try parallel computing ↓ again.
6. Parallel computing single-thread loading data, multi-thread computing program: (6-mulit-calc.dfx)
Add column B, start four threads in B2 to deal with 100000 pieces of data loaded in A12, and divide them into four groups according to the remainder of the user ID%4 in C12, and give operations to four threads respectively.
A
B
C
eleven
……
twelve
For A11,100000
Fork to (4)
For A12.select (user ID%4==B12-1)
thirteen
……
Results:
11800 million / 1.93GB/ users 800000 / 4 threads / one-time read 100000 data / 262s
11800 million / 1.93GB/ users 800000 / 4 threads / one-time read 400000 data / 161s
11800 million / 1.93GB/ users 800000 / 4 threads / one-time read 800000 data / 233s
11800 million / 1.93GB/ users 800000 / 4 threads / 4 million data read at a time / 256s.
Analysis:
The author tests that the machine is a single mechanical hard disk, and the speed of loading data is the bottleneck, so the speed increase is not obvious. However, there is still a significant performance difference when adjusting the amount of data loaded at a single time. The performance is best when processing 400000 pieces of data at a time.
Preprocessing of multithreaded loading data: (splitDataByUserId.dfx)
Although four threads can read the same file with all the data at the same time, each thread must slow down reading the useless data of 3 picks and 4, so it is faster to split the file according to the user's ID%4 in advance. C3 queries the data of ID%4, and C6 stores the queried data in the corresponding split file.
A
B
C
D
one
= file ("e:/ldsj/demo/src-11800.bin") .cursor@b ()
two
E:/ldsj/demo/users/
three
For A1,100000
For to (4)
= A3.select (user ID%4==B3-1)
four
If (C3 = = null)
Next
five
= "src-11800-" + string (B3) + ".bin"
six
= file (A2+C5) .export @ ab (C3)
Program: (6-mulit-read.dfx)
Move the multithreaded code forward to A11, and each thread reads its own file for calculation (B11).
A
B
C
nine
……
ten
= to (802060). (null)
eleven
Fork to (4)
= file (fPath+ "src-11800-" + string (A11) + ".bin") .cursor@b () .select (time > = begin&& time 0 & & ${filter})
twelve
For B11,10000
……
thirteen
……
Results:
118 million records / 1.93GB/ users 800000 / 4 threads / 113s.
Analysis:
Also limited by the speed of loading data, the speed increase is also limited. If you use a cluster of multiple machines, each machine processes the data of 1Compact 4. Because it is multiple hard drives in parallel, the speed will certainly be greatly improved. Let's take a look at how to implement multi-machine parallel ↓.
Multi-computer cluster parallel computing
How to deploy cluster computing and how to write the knowledge points of the main and subroutines of the cluster is not the focus of this article, you can move to the relevant documents to learn more: http://doc.raqsoft.com.cn/esproc/tutorial/jqjs.html.
Main program: (6-multi-pc-main.dfx)
In A3, the subroutine 6-multi-pc-sub.dfx is called by callx, and the parameter sequence is called. Each subroutine is passed in to control which part of the data is processed; the returned results are summarized together by B6, and the results are stored in the A4 grid.
A
B
one
> beginTime=now ()
two
[127.0.0.1:8281127.0.0.1:8282]
three
= callx ("e:/ldsj/demo/6-multi-pc-sub.dfx", to (2), "e:/ldsj/demo/users/"; A2)
four
five
For A3
six
> A4=if (A4 girls written nullpaper, A5 girls, A4 girls girls, A5)
seven
= interval@s (beginTime,now ())
A3 gets the result sequence:
A4 summarizes the final result:
Node machine subroutine: (6-multi-pc-sub.dfx)
Compared with the program in the previous step, the multithreaded fork to (4) of A11 is removed; the node machine calculates which split file is transmitted by the main program through taskSeq parameters (B11); A22 returns the results in A20 to the main program.
A
B
C
nine
……
ten
= to (802060). (null)
eleven
= file (fPath+ "src-11800-" + string (taskSeq) + ".bin") .cursor@b () .select (time > = begin&& time 0 & & ${filter})
twelve
For B11,10000
……
thirteen
……
twenty-two
Return A20
Results:
118 million records / number of 1.93GB/ users 800000 / single node machine processing 1/4 data / 38 seconds. The summary time of the main program is very short and ignored, that is, when four hard drives of four PC load data in parallel, the speed can be increased to 38 seconds.
The program and test data are downloaded from Baidu network disk. Install the aggregator, modify the file path in the program, and you can run to see the effect.
Concluding remarks-looking ahead
Seeing so many optimization details above, some people may question whether it is nitpicking to do this to the extreme with so much effort. Database should be built-in some automatic optimization algorithm, there is a consensus is that ORACLE has done very detailed in this area, these details do not need to worry about users at all. Indeed, the importance of automatic performance optimization is positive, but in recent years, with the complexity of the data environment and the sharp increase in the amount of data, there are more and more application scenarios with the ability to control data more finely, although it will increase the cost of learning. but it will also bring higher data benefits. And this learning cost can not only solve performance problems, but also better solve the fundamental business requirements of describing complex calculations and collating data, not to mention that such problems cannot be automated, because "making decisions about what to do" has become complex, so it can only provide a more convenient programming language to improve description efficiency and face the problem squarely. No matter how intelligent computers are, they can't replace human beings to make decisions. The two ways of automatic and manual are not opposite, but complementary!
The above optimization ideas are what we programmers can think of in advance, and we can probably choose effective optimization methods according to the characteristics of computing tasks. But what I want to say is that the computer system is too complex: different computing requirements, unstable hard disk read and write speed, unstable network speed, inestimable amount of CPU calculation! Therefore, in the actual business, we also need to rely on experience to choose the optimization method according to the actual optimization effect.
Before the emergence of SPL, because the implementation and maintenance of optimization methods were difficult, it was difficult to carry out intensive experiments, and it was natural that there were few optimization results; at the same time, due to the lack of intensive "tumbling" data training, the accumulation of optimization experience was not easy, which verified the expensive status quo of senior data analysts from another point of view. The first group of people who use efficient tools will always benefit the most, the first to use bows and arrows, the first to use guns, the first to use tanks, and the first to use ×. And you were the first programmers to use SPL. It is an inevitable trend that a large team of programmers is divided into a professional data processing and analysis data programmer to form a career with independent skills. You should also have a plan for your career planning and direction choice as soon as possible before you can occupy a certain high ground.
Finally, there is still room for optimization in this result. If the data is compressed and stored, the hard disk access time can be further reduced, and the data can indeed be re-compressed after a certain sort and column storage. In addition, the cluster operation here is divided into four sub-tasks, and even if the same machine is configured, the operation performance may be different. At this time, the operation is fast and the operation is slow. The final completion time is based on the machine with the slowest calculation. If we can split the task into more details, we can achieve more average efficiency, thus further improving the computing speed. We will continue to talk about these contents in later articles.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.