1 billion comparison between spark and kylin for impromptu query of data volume 07/03 Update SLTechnology News&Howtos

1 billion comparison between spark and kylin for impromptu query of data volume

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The amount of data is about 1 billion +, need to do an impromptu query, users can actively enter search criteria, such as time. A certain amount of pretreatment time can be provided. New data is added every day.

1 billion + data is still a bit stressful for ordinary rdbms, and the data is still growing every day, so we used our spark technology to do a computational acceleration. About incremental updates, I will introduce them in a subsequent blog.

The statement is as follows

Select count (*) a from table_a where c = '20170101' group by an order by a

First of all, the amount of data tested with our spark-local model has increased exponentially.

Multiple of data volume table_a spark-local i3 ddr3 8gspark-local i7 ddr4

32gspark-yarn

5g recorder 3core 3instance800w10.512s0.40.7711600w20.512s0.50.5523200w40.794s0.680.5796400w81.126s0.9450.652100w161.98s1.40.9222600w323.579s2.5741.47551200w646.928s5.0013.3841000w12813.395s9.5285.372

We can see that the performance of a single-core cpu will also affect the performance of spark, so when measuring the computing performance of a spark cluster, we should not only look at how many core and instance there are, but also the computing power of a single core. Moreover, the larger the amount of data, the greater the gap.

Spark-sql has a characteristic, the same sentence, second, the third calculation will be faster than before, if you keep running the same statement, you will find that the time will decline until a stable value, usually 2-3 times will reach the minimum. At the level of 1 billion, in yarn mode, the performance can reach about 3 seconds after 3 times, which is very satisfactory for a small test cluster. If it is a regular spark cluster, I believe the performance will be much better, and it is enough to do ad hoc queries. Personal understanding is that spark will have a certain caching mechanism, but not many things will be cached, which is different from the Kylin that we will talk about later. If kylin is the same sentence, the second time is definitely seconds out.

Next we tried to use Kylin, the effect is very shocking.

The official website of Kylin is as follows

Http://kylin.apache.org/

Apache Kylin is an open source distributed analysis engine that provides SQL query interface and multidimensional analysis (OLAP) capabilities over Hadoop to support very large-scale data, originally developed by eBay Inc. Develop and contribute to the open source community. It can query huge Hive tables in subseconds.

Kylin China's only top-level Apache project. Support it.

Kylin is based on the Hive table, so we still have to import our data into hive first.

The installation of Kylin will not be done here, it is very simple, there is an introduction on the official website, just start the service. Then open port 7070

To use Kylin computing, we need to first create a Model. A model needs to correspond to a hive table, and then build a cube based on model. Here, we can create a default cube.

On the Model page, we can see the cube we created as shown in the following figure.

We can see the cube size, number of records, last build time and other information in the list.

Build can be queried after the cube is good. If the build succeeds, the status of cube will change to READY, and you can query it at this time.

Still use the previous 3-node cluster for performance testing.

As shown in the figure above, we can see which project,cubes this statement has gone, time-consuming and other information in the query interface.

The test results are as follows

Data volume construction time (full) sql time 800w6.5min0.15s12800w48min0.15s205600w90.17min0.15s5200w142.40min execution error

Here, building cube can be successful by 500 million, but running sql will report an error. I think I didn't configure it properly.

As you can see, the query performance of kylin is very considerable, and the query resource consumption is very low, which is different from spark. Because of this feature, compared with spark, kylin on clusters with the same performance can support much higher concurrency and provide much higher query performance, but the cost of kylin is a long time to build cube and more disk footprint, and in my experience Kylin's support for sql is not as good as spark-sql 's. If you want to migrate the original rdbms query to big data cluster, spark-sql will be more adaptable.

Although kylin also supports incremental builds, data preparation takes much longer than spark, because spark also supports incremental updates. If you allow time for data preprocessing, such as putting the build after 12:00 in the evening, kylin may be more appropriate in this scenario. However, if the data needs to change in real time, and there are many dimensions, and sql is completely uncertain, maybe spark-sql is more suitable. How to choose it depends on the application scenario.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.