A cheaper implementation Scheme based on spark sorting-with performance Test based on spark 04/21 Update SLTechnology News&Howtos

A cheaper implementation Scheme based on spark sorting-with performance Test based on spark

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Sorting can be said to be the hard index of many log systems (such as sorting in reverse order of time). If a big data system cannot be sorted, basically the system belongs to an unavailable state, sorting can be regarded as a "rigid requirement" of big data system, no matter what big data uses is hadoop, or spark, or impala,hive, sorting is essential, sorting performance testing is also essential.

The Sort Benchmark global ranking, which is known as the Computing Olympic Games, is held once a year, and every year the giants make a huge investment in ranking, which shows how important the sorting speed is! But for most enterprises, hundreds of millions of hardware investment is not worthwhile, or even far beyond the enterprise's project budget. Is there a cheaper way to achieve the ranking of violence in big data's field?

Here, we introduce you to a new cheap sorting method, which we call blockSort.

With 30 billion pieces of 500 GB data, a virtual machine with only 4 16 cores, 32 GB memory and gigabit network cards can be sorted for 2 to 15 seconds (it can be sorted in a full table or filtered with any filter).

First, the basic idea is as follows:

1. Data is pre-divided according to size, such as large, medium and small blocks (block).

two。 If you want to find the largest data, you only need to find it in the largest block.

3. This fast still has a hierarchical structure. If there is a large amount of data in each block, you can go to the sub-fast below to continue to search, and you can sort it into multiple layers.

4. With this method, a trillion-trillion-level data (such as long type) can be filtered out in the worst-case and worst-case scenario by 2048 file seek.

How, the principle is not very simple, so that even if the amount of data is very large, then the number of sorting and search is fixed.

Second, this is our previous performance test based on spark for your reference.

In sorting, YDB has an absolute advantage, whether it is a full table, or based on any combination of filtering conditions, basic seconds kill Spark any format.

Test results (time in seconds)

Video address during testing

Https://v.qq.com/x/page/q0371wjj8fb.html

Https://v.qq.com/x/page/n0371l0ytji.html

Interested readers can also read the YDB programming guide http://url.cn/42R4CG8. You can also refer to the book to install Yanyun YDB for testing.

Third, of course, in addition to sorting, our other performance is also much higher than spark, you can also learn about this.

1. The performance comparison test with Spark txt in retrieval.

Note: memo. In fact, the following figure is nothing special, but because of the index characteristics of YDB itself, it is not as violent as spark, which leads to a much higher performance on scanning than spark, so it is not surprising that the performance is 100 times higher.

The following figure shows the multiple of ydb relative to spark txt.

2. These are compared with Parquet format (in seconds)

3. Performance comparison with ORACLE

Compared with the traditional database, it is no longer meaningful. Oracle is not suitable for big data. Any big data tool far exceeds the performance of oracle.

4. Performance test of inspection and control scene

4. How does YDB speed up spark?

Based on the Hadoop distributed architecture, the real-time, multi-dimensional and interactive query, statistics and analysis engine has a second performance under the trillion data scale, and has the stable and reliable performance of the enterprise.

YDB is a fine-grained index, a precise-grained index. The data is imported immediately, the index is generated immediately, and the relevant data is located efficiently through the index. YDB is deeply integrated with Spark, and Spark directly analyzes and calculates the YDB retrieval result set, and the same scenario speeds up the performance of Spark a hundred times.

5. Which users are suitable to use YDB?

1. Traditional relational data has been unable to accommodate more data, query efficiency has been seriously affected by the users.

two。 Currently, SOLR and ES are used for full-text search. It is found that solr and ES provide too few analysis functions to complete complex business logic, or SOLR and ES become unstable after a large amount of data, resulting in a vicious circle in falling pieces and balance, and the service cannot be restored automatically. OPS staff often need to get up in the middle of the night to restart the cluster.

3. Based on the analysis of massive data, but suffering from the speed and response time of the existing offline computing platform can not meet the business requirements of users.

4. Users who need to do multi-dimensional directional analysis of user profile behavior data.

5. Users who need to retrieve large amounts of UGC (User Generate Content) data.

6. When you need to make a quick, interactive query on the big data collection.

7. When you need to do data analysis, not just simple key-value pair storage.

8. When you want to analyze the data generated in real time.

Video address (students who can't see it clearly can play it in Tencent Video HD)

Https://v.qq.com/x/page/q0371wjj8fb.html

Https://v.qq.com/x/page/n0371l0ytji.html

Interested readers can also read the YDB programming guide http://url.cn/42R4CG8. You can also refer to the book to install Yanyun YDB for testing.

Tags: spark, hadoop, hive, lucene, sorting, big data

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.