How to use Alluxio to speed up data query 04/28 Update SLTechnology News&Howtos

How to use Alluxio to speed up data query

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to use Alluxio to speed up data query, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

The following is the experience of using Alluxio in a production environment and why Alluxio can bring significant performance improvements. Using Alluxio to convert previous batch queries into interactive queries enables Baidu to analyze data interactively, resulting in increased productivity and improved user experience.

Business Challen

Baidu is the largest search engine in China, which means that we have a lot of data. How to manage this scale of data and quickly extract useful information has always been a challenge.

For example, a large amount of data often takes tens of minutes or even hours to complete a query, so it takes the product manager hours to start the next query. Even more frustrating is that you will need to rerun the entire process after changing the query. About a year ago, we realized that we needed a special query engine to solve these problems. First of all, we propose a target specification: the query engine needs to manage the data of several PB and can complete 95% of the queries in 30 seconds.

We switched the query engine from Hive to Spark SQL (many use cases have demonstrated its advantage over Hadoop MapReduce in terms of latency), and we expect Spark SQL to reduce the average query time to less than a few minutes. However, it does not achieve the query response time we expected. Although Spark SQL does increase the average query speed by four times, it still takes about 10 minutes to complete.

Therefore, we think carefully again, dig and analyze more details. Facts have proved that the bottleneck of query at this stage is no longer CPU but data transmission network. Because PB-level data is distributed in multiple data centers, data queries are likely to need to transfer data from the remote data center to the data center where the computing is located, which is why there is a large delay when users run the query. Because the data storage center node and the data computing center node have different optimal hardware specifications, the solution is not as simple as moving the computing process to the storage data center. We need a memory-level storage system to store commonly used "hot" data, and the system can be located on computing nodes.

Why choose Alluxio?

We need a memory-level storage system. The storage system can not only provide high performance and reliability, but also manage data of several PB. We have developed a query system that uses Spark SQL as its computing engine and Alluxio as a local memory-level storage solution. We use Baidu's internal standard query as the stress test solution, we need to extract 6TB data from the remote data center, and then run other analysis on the data, the entire stress test lasted one month.

The results show that Alluxio brings excellent performance improvement. If the system uses only Spark SQL, the average query takes 100-150 seconds to complete. With the addition of Alluxio, the average query time is 10-15 seconds. In addition, if all the data is stored on the Alluxio local node, it only takes about 5 seconds, which is 30 times faster than using Spark SQL alone. Based on the above results and system reliability considerations, we build a complete big data query system around Alluxio and Spark SQL.

Our system contains the following components:

Operation manager: a persistent Spark application that wraps Spark SQL. It accepts queries from the query UI and provides query parsing and query optimization functions.

View manager: manages cached metadata and processes query requests from the operation manager.

Alluxio: used as a memory-level storage system for commonly used data, providing computing locality.

Data warehouse: a remote data center based on a HDFS system for storing data.

Next, we will introduce the implementation process of the entire system:

Query has been submitted. The action manager parses the query and asks the view manager if the data is already in the Alluxio.

If the data is already in the Alluxio, the operation manager fetches the data from the Alluxio and performs analysis on it.

If the data is not in the Alluxio, then the data missed the cache. The operation manager will request data directly from the data warehouse. At the same time, the view manager starts another job to request the same data from the data warehouse and store the data in Alluxio. So the next time the same query is submitted, the data is already in the Alluxio.

Income

After the system is deployed, we use a typical Baidu query to measure its performance. Using the original Hive system, it takes more than 1000 seconds to execute this typical query. With Spark SQL alone, the time can be reduced to 150 seconds, and with Alluxio, the time can be further reduced to about 20 seconds. The query runs 50 times faster and meets the interactive query requirements we set for the project. Therefore, by using Alluxio, you can convert a batch query that takes 15 minutes to execute into an interactive query that takes less than 30 seconds.

In the past year, the system has been deployed in a cluster of more than 100 nodes. The Alluxio system stores and manages more than 2 PB data and uses Alluxio's advanced features-tiered storage. This feature allows us to use memory as primary storage, SSD as secondary storage, and HDD as final storage. By combining these storage media, we can provide more than 2 PB of storage space.

In addition to the improvement in query performance, what is more important to us is the reliability of the whole system. Over the past year, Alluxio has been running steadily in our data infrastructure with few problems, which gives us a lot of confidence. Therefore, we are preparing for a large-scale deployment of Alluxio. First, we verify the scalability of Alluxio by deploying a cluster with 1000 Alluxio worker nodes. Over the past month, the cluster with 1000 Alluxio worker nodes has been running steadily, providing more than 50 TB of memory. As far as we know, this is one of the largest Alluxio clusters in the world.

We have demonstrated that Alluxio can greatly improve performance and is reliable and scalable. Next, we are gradually migrating different Baidu workload tasks to the Alluxio cluster. For example, to improve the performance of online image services and online image analysis, we are working closely with the Alluxio community to try to develop a high-performance Key-Value storage on top of Alluxio. In this way, only one Alluxio storage system is needed: Key-Value storage can perform effective online services; for offline analysis, we can directly access Alluxio to obtain image data. This has greatly reduced our development and operating costs.

As early users of Alluxio, we verified what it describes as a "memory-centric distributed storage system that achieves reliable data sharing across cluster frameworks at memory speed". In addition to reliability and memory speed, Alluxio provides an extended memory-based storage to provide sufficient storage capacity.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.