In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
Today, I will talk to you about the exploration and practice of Presto in software, which may not be well understood by many people. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.
Introduction to 1.Presto introduction to ▍ 1.1
Presto is Facebook's open source MPP (Massive Parallel Processing) SQL engine, whose idea comes from a parallel database called Volcano, which proposes a model for parallel execution of SQL, which is designed for high-speed, real-time data analysis. Presto is a SQL computing engine that separates the computing layer from the storage layer. It does not store data and accesses various data sources (Storage) through Connector SPI.
▍ 1.2 architecture
Presto follows the general Master-Slave architecture, one Coordinator, multiple Worker. Coordinator is responsible for parsing SQL statements, generating execution plans, and distributing execution tasks to Worker nodes; Worker nodes are responsible for actually executing query tasks. Presto provides a set of Connector interfaces for reading meta-information and raw data. Presto has a variety of built-in data sources, such as Hive, MySQL, Kudu, Kafka and so on. At the same time, the extension mechanism of Presto allows you to customize Connector, thus realizing the query of custom data sources. If Hive Connector is configured, you need to configure a Hive MetaStore service to provide Hive meta-information to Presto, and the Worker node interacts with HDFS through Hive Connector to read the raw data.
▍ 1.3implements the principle of low delay
Presto is an interactive query engine. What we are most concerned about is the principle of Presto implementing low-latency queries. The following are the main reasons why its performance stands out:
Fully memory-based parallel computing pipeline localized computing dynamic compilation execution plan careful use of memory and data structures GC control no fault tolerance
Application of 2.Presto in Didi
▍ 2.1Business scenario
Hive SQL query acceleration data platform Ad-Hoc query reports (BI reports, custom reports) campaign marketing data quality testing asset management fixed data products
▍ 2.2 Business scale
▍ 2.3 Business growth
▍ 2.4 Cluster deployment
Currently, Presto is divided into mixed clusters and high-performance clusters. As shown in the figure above, mixed clusters share HDFS clusters and are mixed with offline Hadoop large clusters. In order to prevent large queries in the cluster from affecting small queries, building clusters alone will lead to too many clusters and high maintenance costs. We can achieve physical cluster isolation by specifying Label (more on this later). In high-performance clusters, HDFS is deployed separately and can access Druid, so that Presto has the ability to query real-time data and offline data.
▍ 2.5 access mode
The secondary development of JDBC, Go, Python, Cli, R, NodeJs, HTTP and other access methods, through the company's internal authority system, so that the business side of convenient and quick access to Presto, to meet the access needs of a variety of technology stacks.
When Presto is connected to query routing, Gateway,Gateway will intelligently select the appropriate engine. Users will first request Presto for query. If the query fails, Spark will be used for query. If it still fails, Hive will be requested finally. In the Gateway layer, we have made some optimizations to distinguish between large queries, medium queries and small queries. If the query time is less than 3 minutes, we think it is suitable for Presto queries, such as distinguishing the query size by HBO (historical statistics) and the number of JOIN. See the architecture diagram:
3. Engine iteration
We began to investigate Presto in September 2017, experienced 0.192 and 0.215, and released a total of 56 versions. In early 1919 (version 0.215 is the community split version), the Presto community split into two projects, called PrestoDB and PrestoSQL, both of which set up their own foundations. We decided to upgrade to the latest version of PrestoSQL (version 340) because:
PrestoSQL community is more active, PR and user questions can be answered in time, the main force of PrestoDB is Facebook maintenance, and its internal demand is the main direction of PrestoDB. The future direction of PrestoDB is mainly related to ETL, we have Spark background, and ETL function depends on Spark and Hive.
4. Engine improvement
In Didi, Presto is mainly used for Ad-Hoc query and Hive SQL query acceleration. In order to facilitate users to migrate SQL to Presto engine as soon as possible, and to improve the query performance of Presto engine, we have done a lot of secondary development of Presto. At the same time, because we use Gateway, even if there is an error in the SQL query, SQL will be forwarded to Spark and Hive, so we do not use the Spill to Disk feature of Presto. Such a pure memory SQL engine will encounter a lot of stability problems in the process of using, and we have accumulated a lot of experience in solving these problems, which will be described below:
▍ 4.1 Hive SQL compatibility
In the first half of 18 years, Presto just started, and many users within Didi are reluctant to migrate their business, mainly because Presto is ANSI SQL, which has a large gap with HiveQL, and the query results are inconsistent, and the migration cost is relatively high. In order to facilitate Hive users to migrate business smoothly, we have done Hive SQL compatibility for Presto. In the technology selection, we did not do SQL compatibility in the upper layer of Presto, that is, we did not do SQL compatibility in Gateway layer, mainly because the development volume is large, and the development and conversion costs related to UDF are too high, in addition, we need to do more SQL parsing, query performance will be affected, while increasing the number of Hive Metastore requests, at that time, the pressure of Hive Metastore is relatively large, considering the cost and stability, we finally choose to be compatible on the Presto engine layer.
Main work:
Implicit type conversion semantic compatibility syntax compatibility support Hive view Parquet HDFS file reading support a large number of UDF support other
Hive SQL is compatible. We have iterated three major versions. Currently, the online SQL pass rate is 9799%. After the business is migrated from Spark/Hive to Presto, the query performance is improved by an average of 30% to 50%, or even by 10 times in some scenarios, resulting in a total savings of 80% machine resources in Ad-Hoc scenarios. The following figure shows the proportion of SQL query pass rate and failure reasons of online Presto cluster. 'null' represents the SQL of successful query, and others indicate the cause of error:
▍ 4.2 physical Resource isolation
As mentioned above, if the business with high performance requirements is mixed with the large query business side, the query performance is easily affected, and the only way is to build a cluster separately. Building clusters alone leads to too many Presto clusters and high maintenance costs. Because we have not encountered a bottleneck in Presto Coordinator, large queries mainly affect the performance of Worker. For example, a large SQL causes Worker CPU to fill up, which leads to slow SQL queries from other business parties. So we modify the scheduling module so that Presto support can dynamically type Label and dynamically schedule specified Label machines. As shown in the following figure:
Different label is divided according to different business. The label and its corresponding machine list specified by the business side are configured through the configuration file. The Coordinator will load the configuration and maintain the cluster label information in memory. At the same time, if the label information in the configuration file changes, the Coordinator will update the label information regularly, so that the corresponding Worker machine can be obtained according to the label information specified by SQL during scheduling. For example, when label An is specified, only Worker An and Worker B can be selected in the scheduling machine. In this way, the machine can be physically isolated, and the business queries with high performance requirements can be guaranteed.
▍ 4.3 Druid Connector
There are some pain points in using Presto + HDFS:
Real-time data cannot be checked with high latency and low QPS. If there is a real-time data demand, it is necessary to build another real-time data link, which increases the complexity of the system. In order to achieve ultimate performance, DataNode must be mixed with HDFS DataNode, and DataNode uses advanced hardware and has the need to build its own HDFS, which increases the burden of operation and maintenance.
So we implemented Presto on Druid Connector in version 0.215, and this plug-in has the following advantages:
Combine the pre-aggregation, computing power (filter aggregation) and Cache capabilities of Druid to improve the performance of Presto (RT and QPS) so that Presto has the ability to query Druid real-time data to provide comprehensive SQL support for Druid. The application scenario of expanding Druid data obtains Druid metadata information directly from Druid Historical to realize Limit push, Filter push, Project push and Agg push.
In PrestoSQL version 340, the community also implements Presto on Druid Connector, but this Connector is implemented through JDBC, and the disadvantages are obvious:
Unable to divide multiple Split, poor query performance request query Broker, and then query Historical. One more network communication for some scenarios, such as a large number of Scan scenarios, will lead to imperfect Broker OOMProject and Agg push-down support.
For details of the architecture, please see:
After using Presto on Druid, the performance of some scenarios has been improved by a factor of 4 to 5.
▍ 4.4 ease of use Construction
In order to support several core data platforms of the company, including: counting Dream, extraction tools, data Exchange and feature acceleration and various retail investors, we have done a lot of secondary development of Presto, including rights management, syntax support, etc., to ensure fast access to the business. Main work:
The tenant communicates with internal Hadoop, uses HDFS SIMPLE protocol for authentication, uses Ranger for authentication, parses SQL so that Presto has the ability to transmit column information to the downstream, provides user name + database name / table name / list name, quad authentication ability, and provides the ability to authenticate multiple tables at the same time. Users specify user names for authentication and authentication. Large accounts are used to read and write HDFS data support view and table alias authentication.
Syntax extensions support add partition support for tables that start with numbers support fields that start with numbers
When the feature enhances insert data, the total number of rows inserted into the data is written into HMS, which provides the business side with millisecond metadata awareness to support query progress rolling updates, enhances the user experience support query can specify priority, provides users with different levels of business priority control ability to modify communication protocols, and supports business parties to convey custom information. To meet the user's log audit needs and other support DeprecatedLzoTextInputFormat format to support reading HDFS Parquet file path
Stability Construction of ▍ 4.5
Many stability problems will be encountered in the use of Presto, such as Coordinator OOM,Worker Full GC. In order to solve and facilitate the location of these problems, we first built a monitoring system, which mainly includes:
The log audit function is implemented through Presto Plugin. The monitoring information is written to Ganglia through JMX acquisition engine metrics. Log audit is collected to HDFS and ES; and unified access to the operation and maintenance monitoring system. All indicators are sent to Kafka;Presto UI for improvement: you can view Worker information, you can view Worker information.
Through the above features, it is convenient for us to locate stability problems in time, including metrics viewing and SQL playback. As shown in the figure below, you can check the number of successful and failed SQL of a cluster, and we can trigger the alarm by defining the query failure rate:
In the Presto communication community, the stability of Presto has perplexed many Presto users, including the failure of Coordinator and Worker, and the slow query performance after the cluster has been running for a period of time. We have accumulated a lot of experience in solving these problems. Here are the ideas and methods to solve them.
According to the division of responsibilities, Presto is divided into Coordinator and Worker modules. Coordinator is mainly responsible for SQL parsing, generating query plan, Split scheduling and query status management, so when Coordinator encounters OOM or Coredump, obtaining meta-information and generating Splits is the key suspicion. As for memory problems, it is recommended to use MAT to analyze the specific reasons. The following figure shows that FileSystem Cache is turned on through MAT analysis, and the memory leak causes OOM.
Here we summarize the common problems and solutions of Coordinator:
Use HDFS FileSystem Cache to cause memory leak, the solution forbids FileSystem Cache. Subsequent Presto maintains FileSystem CacheJetty itself and causes out-of-heap memory leak, because Gzip leads to out-of-heap memory leak. Upgrade Jetty version to solve Splits too many, no ports available, TIME_WAIT too high, modify TCP parameters to solve JVM Coredump, display "unable to create new native thread", solve Presto kernel Bug by modifying pid_max and max_map_count, query too many SQL failed. Caused Coordinator memory leak, community fixed
While Presto Worker is mainly used for computing, the performance bottleneck is mainly memory and CPU. On the memory side, there are three ways to secure and find problems:
Control business concurrency through Resource Group, prevent serious overselling and solve some common memory problems through JVM tuning, such as Young GC Exhausted making good use of MAT tools to find memory bottlenecks
However, Presto Worker often encounters the problem of slow query. One is to determine whether Swap memory is enabled. When Free memory is insufficient, the use of Swap will seriously affect query performance. The second is the CPU problem, to solve this kind of problem, we should make good use of Perf tools, do more Perf to analyze why CPU is not working, and see what CPU is mainly doing, whether it is GC problem or JVM Bug. As shown in the figure below, JVM Bug is triggered for the online Presto cluster, causing the query to slow down after running for a certain period of time, and then resuming after restart. The reason is found after Perf and the JVM code is analyzed, which can be solved by tuning JVM or upgrading the JVM version:
Here we also summarize the common problems and solutions of Worker:
Sys load is too high, which has a great impact on business query performance. Study the principle of jvm and solve it through parameters (- XX:PerMethodRecompilationCutoff=10000 and-XX:PerBytecodeRecompilationCutoff=10000). You can also upgrade the latest JVM to solve the Worker query hang residence problem. The reason is that bug exists in the HDFS client. When Presto and HDFS are mixed deployment, when the data and client are on the same machine, there is always wait lock during short circuit reading, resulting in query Hang living timeout. Hadoop community has solved the problem of Worker Young GC Exhausted caused by overselling, and optimized GC parameters. If the setting of-XX:G1ReservePercent=25 and-XX:InitiatingHeapOccupancyPercent=15ORC is too large, resulting in OOM in Presto reading ORC Stripe Statistics, the solution is to limit the size of ProtoBuf packets, while assisting the business side in reasonable data governance to modify Presto memory management logic and optimize Kill policies to ensure that when memory is insufficient, Presto Worker will not OOM, only large queries need to be Kill down, and the subsequent fuse mechanism will be based on JVM, similar to ES fuses, such as 95% JVM memory, Kill drops the maximum SQL.
▍ 4.6engine Optimization and investigation
As an Ad-Hoc engine, the faster the query performance of Presto, the better the user experience. In order to improve the query performance of Presto, we have done a lot of engine optimization work in Presto on Hive scenario.
A business cluster has carried out JVM tuning, changed Ref Proc from single thread to parallel execution, reduced ordinary query from 30S~1 minutes to 3-4S, improved performance by 10 times + ORC data optimization, added Bloom filter to the specified string field, improved query performance by 20-30%, tuned data governance and small file merging for some businesses, reduced query performance from 20s to 10s, and doubled performance. And the query performance is stable, the ORC format performance is optimized, the query time is reduced by 5%, the partition clipping optimization is reduced, the problem of obtaining meta-information of all partitions is solved, the pressure push down optimization of HMS is reduced, and Limit, Filter, Project, Agg are pushed down to the storage layer.
In 18 years, we also investigated a number of technical solutions, including Presto on Alluxio and Presto on Carbondata, to improve the performance of Presto queries, but both of them were abandoned because:
The performance of Presto on Alluxio query has increased by 35%, but the memory footprint is not proportional to the performance improvement, so we abandoned Presto on Alluxio. The subsequent use of Presto on Carbondata for businesses that are sensitive to performance requirements was tested in August 18. At that time, the stability of Carbondata was poor, there was no obvious advantage in performance, and ORC was faster in some scenarios, so we did not continue to track and investigate Presto on Carbondata. Because Didi has a team that specializes in maintaining Druid, we docked with Presto on Druid, and the performance of some scenarios has been improved by 4 to 5 times. In the future, we will pay more attention to Presto on Clickhouse and Presto on Elasticsearch.
5. Summary
Through the above work, Didi Presto is gradually connected to the company's big data platforms and has become the company's preferred Ad-Hoc query engine and Hive SQL acceleration engine. The following figure shows the performance improvement of a product after being connected:
As can be seen in the figure above, the platform began to connect to Presto around October 2018, and the query-time TP50 performance has been improved by 10 + times, from 400s to 31s. And in the case of the gradual increase in the number of tasks, the query time to ensure stability.
As for the high-performance cluster, we have done a lot of stability and performance optimization to ensure that the average query time is less than 2S. As shown in the following figure:
6. Prospect
The main application scenario for Presto is Ad-Hoc query, so its peak is mainly in the daytime. As shown in the following figure, it is a query from 12 pm to 16:00 in the afternoon of online ride-hailing business. You can see that the average CPU utilization is more than 40%.
However, if you look at the CPU utilization rate in the last month, you will find that the average CPU utilization rate is relatively low, and the peak is between 10: 00 and 18: 00 during the day, there is basically no query at night, and the CPU utilization rate is less than 5%. As shown in the following figure:
Therefore, solving the problem of waste of resources at night is a difficult problem that we need to solve in the future.
After reading the above, do you have any further understanding of Presto's exploration and practice in software? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.