Big data management engine HAWQ++ in cloud era 07/19 Update SLTechnology News&Howtos

Big data management engine HAWQ++ in cloud era

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

P.p1 {margin: 0.0px 0.0px 24.0px 0.0px; font: 12.0px 'PingFang SC'; color: # 0000000;-webkit-text-stroke: # 000000} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 16.0px' Trebuchet MS'; color: # 0000000;-webkit-text-stroke: # 000000} p.p3 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'PingFang SC'; color: # 000000 -webkit-text-stroke: # 000000} p.p4 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'PingFang SC'; color: # 0000000;-webkit-text-stroke: # 0000000; min-height: 17.0px} p.p5 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; color: # 000000;-webkit-text-stroke: # 000000 Min-height: 14.0px} p.p6 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 12.0px 'Times New Roman'; color: # 000000;-webkit-text-stroke: # 0000000; min-height: 15.0px} p.p7 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 10.0px' Times New Roman'; color: # 000000 -webkit-text-stroke: # 000000} p.p8 {margin: 0.0px 0.0px 24.0px 0.0px; font: 12.0px 'Times New Roman'; color: # 0000000;-webkit-text-stroke: # 0000000; min-height: 15.0px} p.p9 {margin: 0.0px 0.0px 24.0px 0.0px; text-align: center; font: 10.0px' Times New Roman'; color: # 000000 -webkit-text-stroke: # 000000} p.p10 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 12.0px Times; color: # 0000000;-webkit-text-stroke: # 0000000; min-height: 14.0px} span.s1 {font: 12.0px Helvetica; font-kerning: none} span.s2 {font-kerning: none} span.s3 {font: 12.0px 'Times New Roman'; font-kerning: none} span.s4 {font: 10.0px Times Font-kerning: none} span.s5 {font: 16.0px Times; font-kerning: none} span.s6 {font: 12.0px Times; font-kerning: none} span.s7 {font: 12.0px 'Songti SC'; font-kerning: none}

HAWQ, taken from Hadoop With Query, is a native Hadoop parallel SQL engine. At the same time, as an enterprise-oriented analytical database HAWQ has many excellent features, such as it is fully compatible with ANSI-SQL standard syntax, supports standard JDBC/ODBC connections, supports ACID transaction characteristics, high performance, has a more advanced flexible execution engine than traditional MPP database, can dynamically add and subtract nodes in seconds, has a variety of fault-tolerant mechanisms, and supports multi-level resource and load management. Provides high-performance interactive query capabilities for PB-level data on Hadoop, and provides descriptive analysis support for major BI tools, as well as machine learning libraries that support predictive analysis. At present, HAWQ belongs to the incubation project of Apache and is about to become the top-level project of Apache. HAWQ++, launched by even Technologies, founded by HAWQ's founding team, is an enhanced enterprise version based on Apache HAWQ.

HAWQ++ architecture

HAWQ++ is a typical master-slave architecture. There are several Master nodes: HAWQ++ master node, HDFS master node NameNode,YARN master node ResourceManager. At this stage, the HAWQ++ metadata service is still integrated in the HAWQ++ master node, and will become a separate Catalog Service in the future. The independence of metadata will bring many benefits. On the one hand, HAWQ++ metadata can be fused with Hadoop cluster metadata, on the other hand, HAWQ++ master/slave roles can no longer be distinguished, and any node can receive and process queries to better achieve load balancing. HAWQ++ has a HDFS DataNode, a YARN NodeManager, and a HAWQ++ Segment deployed on each Slave node. Where YARN is an optional component. If there is no YARN, HAWQ++ will use its own built-in resource manager. HAWQ++ Segment launches multiple QE (Query Executor, query executors) when executing a query. The query executor runs in the resource container. Under this architecture, nodes can join the cluster dynamically and there is no need for data redistribution. When a node joins the cluster, it sends a heartbeat to the HAWQ++ Master node and can then receive future queries.

Figure 1 HAWQ++ architecture

HAWQ++ internal architecture

Figure 2 is a diagram of HAWQ++ 's internal architecture. You can see that there are the following important components within the HAWQ++ Master node: query parser, optimizer, resource agent, resource manager, HDFS metadata cache, fault tolerance service, query dispatcher and metadata service. A physical Segment is installed on the Slave node. When the query is executed, the flexible execution engine starts multiple virtual Segment to execute the query at the same time, and the data exchange between nodes is carried out through Interconnect (High Speed Internet). If a query starts 1000 virtual Segment, it means that the query is evenly divided into 1000 tasks, which will be executed in parallel. So the virtual Segment number actually indicates the degree of parallelism of the query. The degree of parallelism of the query is dynamically determined by the resilient execution engine based on the size of the query and current resource usage. Here is a brief description of the role of several components. Parser does lexical and grammatical analysis, generates a Parse Tree, gives it to Analyzer for semantic analysis to generate a Query Tree, and then rewrites a Query Tree into Query Tree List through the Rewriter based on the rule system, and gives it to the optimizer for logic optimization and physical optimization based on cost to generate optimized parallel Plan. Through the resource agent, the resource manager dynamically requests and caches resources from a global resource manager, such as YARN, and returns resources when they are not needed. The HDFS metadata cache is used to determine which parts of the Segment table are scanned by HAWQ++. Because the calculation and data of HAWQ++ are completely separate, data locality information is required to dispatch the calculation to where the data is located. If each query accesses NameNode to get location information, it will cause the bottleneck of NameNode, so the metadata cache is established. The fault-tolerant service is responsible for detecting which nodes are available and which are not. Unavailable machines are excluded from the resource pool. The optimized Plan is sent by the query dispatcher to each node for execution, and the whole process of query execution is coordinated. The metadata service is responsible for storing all kinds of metadata for HAWQ++, including database and table information, as well as access rights, and so on. High-speed Internet is responsible for transmitting data between nodes, which is based on UDP protocol by default. UDP protocol does not need to establish connections, which can avoid the limit of high number of concurrent connections in TCP. HAWQ++ accesses HDFS through the libhdfs3 module. Libhdfs3 is Hadoop Native's Camp Candle + interface, which has the advantages of easy deployment, low resource consumption and high performance compared with JNI interface.

Figure 2 HAWQ++ internal architecture

HAWQ++ parallel optimizer

Next, let's explain the HAWQ++ parallel optimizer module in detail, because in a database system, the optimizer largely determines the performance of SQL execution. The HAWQ++ native optimizer is developed on the basis of the PostgresSQL optimizer, which simply inserts Motion operations into the serial plan generated by pg. Motion represents the movement of data, and the underlying layer is realized through the high-speed Internet. The Motion,plan based on the insertion is cut into a number of Slice. The same Slice can be executed in parallel on different nodes. There are three types of Motion: 1.Redistribute Motion, which is responsible for redistributing data according to hash key values; 2.Broadcast Motion, which is responsible for broadcasting data; and 3.Gather Motion, which is responsible for collecting data together. The query plan on the left in figure 3 shows an example where there is no need to redistribute data, because tables lineitem and orders both use join keys for distribution. If both tables are randomly distributed, the query plan on the right is generated, with one more Redistribute Motion node than the query plan on the left. Some people may wonder that the data of HAWQ++ is stored on HDFS, and if the block on a Datanode of a HDFS addition or subtraction node may be rebalance to another Datanode, how can you do HashJoin directly without Redistribute Motion for the table of hash distribution? The reason is that the table HAWQ++ distributed by hash has a mapping relationship between maintaining QE and writing to the file, so even if a block in the file is no longer local, it only affects whether the block is read locally or remotely, and it has nothing to do with whether you need to do Redistribute Motion. In addition, the input data for physical optimization based on cost comes from statistics, so collecting table statistics in advance through the analyze command can help the optimizer produce a more optimized Plan.

Figure 3 parallel query plan

HAWQ++ query processing flow

Figure 4 shows the processing flow of the query plan on the right in figure 3. When receiving the connection request from the client, the Master node of HAWQ++ starts QD (Query Dispatcher, query dispatcher), enters lexical analysis, syntax analysis, semantic analysis, optimizer generates parallel Plan, and then calculates how many virtual segment needs to be started and on which segment nodes to start these VSEG according to the size of query data and the current resource usage, combined with the information of datalocality. Then the dispatcher module will connect these segment nodes through the libpq protocol to start the QE, and serialize the parallel plan and compress the dispatch. VSEG is a logical concept, as figure 4 contains any set of QE processes that execute slice1 and slice2, respectively. A collection of QE processes with the same slice on all VSEG is called a gang. Each QE receives its own slice to build a query executor tree, and each node in the tree is called an operator, corresponding to its own executor node to implement logic. The entire execution process of HAWQ++ is the mode of Pipeline, pull data from top to bottom. The slice between Gang and gang transmits data through Motion, and finally all the data is collected on the Master node through Gather Motion and returned to the client.

Figure 4 query processing flow

HAWQ++ resilient execution engine

HAWQ++ flexible execution engine is a key technology different from traditional MPP database. For traditional MPP databases, such as Greenplum Database, because Segment configuration is rigid, SQL computing execution often has to mobilize all cluster nodes, resulting in a waste of resources and constraints on SQL concurrency. Each node has its own exclusive directory and data, which has relatively strict requirements for the availability of each node, and the expansion is complex. The flexible execution engine introduced by HAWQ++ enables us to launch any number of virtual Segment to execute queries through the complete separation of storage and computing. Each Segment is stateless, and metadata and transaction management are implemented in Master nodes, so Segment nodes that join the cluster do not need state synchronization, and users can add and subtract nodes dynamically as needed. For each specific query, the computing concurrency of SQL is dynamically determined according to user configuration, SQL characteristics and real-time database running status, and healthy low-load nodes are dynamically allocated. At the same time, according to the distribution of table data blocks, IO tasks are dynamically assigned to parallel VSEG to achieve the optimal local read ratio and ensure the optimal SQL execution performance.

HAWQ++ pluggable external Stora

HAWQ++ pluggable external storage based on the enhanced external table read-write framework development, through the new framework HAWQ++ can be more efficient access to more types of external storage, can achieve pluggable file systems, such as S3Query Cevers, as well as pluggable file formats, such as ORC,Parquet. Like internal tables, HAWQ++ can dynamically adjust the number of read and write concurrency of external tables in the cluster according to the size of query data and the utilization of database resources, and select the optimal computing node according to the data distribution, so as to optimize and flexibly control the access performance of external tables. Compared with Apache HAWQ's original external data access scheme PXF, pluggable external storage avoids multiple data transformations in the data transmission path, breaks the way of providing external proxies through inherent parallelism, and provides users with a simpler and more effective data import and export scheme, and the performance is several times higher.

HAWQ++ Container Cloud support

HAWQ++ is the world's first MPP SQL engine that can run natively on a container cloud platform. It is well known that it is easy to migrate simple stateless applications (such as Web servers) to containers, while migrating big data's platform to containers faces many technical challenges. HAWQ++ supports installation and deployment on mainstream Kubernetes CaaS platforms, and HAWQ++ services run in Docker containers managed by the CaaS platform. Deploying HAWQ++ on Kubernetes, like other application clusters, can be deployed through the Dashboard user interface or command line, or you can manage HAWQ++ clusters just like other CaaS platform applications. The combination of HAWQ++ and cloud platform brings the integration of applications and services, which makes it easy to do elastic expansion, self-recovery and rolling upgrade. It also brings a lot of convenience in resource management and automated operation and maintenance.

Prospect of HAWQ++

At present, HAWQ++ is still in the process of continuous development, and update/delete will be added in the near future, making it a well-deserved leader in big data's management engine in the cloud era.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.