How to choose the right SQL engine for the right job 04/26 Update SLTechnology News&Howtos

How to choose the right SQL engine for the right job

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to choose the right SQL engine for the right work. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it.

We are all eager to get data. Not only more data... There are also new data types so that we can best understand our products, customers and markets. We are looking for real-time insight into the latest available data of various shapes and sizes (structured and unstructured). We want to embrace a new generation of business and technical professionals who have a real passion for data and a new generation of technologies that can change data that are relevant to our lives.

I can give an example of what I mean. About two years ago, the data saved my friend's daughter's life. At birth, she was diagnosed with seven heart defects. Thanks to new technologies such as 3D interactive, virtual modeling and smarter EKG analysis, modern bed monitoring solutions and other improved data-based medical procedures, she survived two open heart surgeries and now leads a healthy life. The data saved her life. This is what motivates me every day to find new innovations and ways to provide data to those who need it most as soon as possible. CDP is built from scratch as an enterprise data cloud (EDC). EDC has a variety of functions and can implement many use cases on one platform. By using hybrid and multi-cloud deployments, CDP can exist anywhere from bare metal to public and private clouds. As we adopt more cloud solutions in the central IT plan, we see hybrid clouds and multiple clouds as the new normal. However, most hybrid matching environments create gaps in management, resulting in new risks in terms of security, traceability, and compliance. To address this problem, CDP has advanced security and control capabilities that democratize data without risking failure to comply with compliance and security policies. CDW on CDP is a new service that enables you to create a self-service data warehouse for a team of business intelligence (BI) analysts. You can quickly configure a new data warehouse and share any dataset with a specific team or department. Do you remember when you can set up your own data warehouse? Without the participation of the infrastructure and platform team? It's never gonna happen. CDW accomplished this task. However, CDW makes several SQL engines available, bringing more choices and more confusion. Let's explore the SQL engine available in CDW on CDP and discuss which is the right SQL option for the right use case. So many choices! Impala? Hive LLAP? Spark? When will you use it? Let's explore. Impala SQL engine Impala is a popular open source and scalable MPP engine in Cloudera Distribution Hadoop (CDH) and CDP. Impala has won market trust in low-latency, highly interactive SQL queries. Impala is very extensible, supporting not only Parquet's Hadoop distributed file system (HDFS), optimized rows (ORC), JavaScript object representation (JSON), Avro, and text format, but also native support for Kudu, Microsoft Azure Data Lake Storage (ADLS) and Amazon Simple Storage Service (S3). Impala has strong security for both Sentry and Ranger and is known to support clusters of more than 1000 users on datasets the size of 1000 PB. Let's take a brief look at the entire Impala architecture. Impala uses StateStore to check the health of the cluster. If the Impala node goes offline for any reason, StateStore notifies all other nodes and avoids unreachable nodes. The Impala directory service manages metadata for all SQL statements to all nodes in the cluster. StateStore and directory services communicate with Hive MetaStore to get the location of blocks and files, and then communicate metadata with the worker node. When a query request comes in, it goes to one of many query coordinators, where it compiles the request and starts planning. Return to the plan segment, and the coordinator will arrange the implementation. Intermediate results are streamed between Impala services and returned. This architecture is well suited when we need business intelligence data marts to have low-latency query responses (usually found in exploratory ad hoc, self-service, and discovery use case types). In this case, we asked the customer to report subsecond to five-second response times for complex queries. For Internet of things (IoT) data and related use cases, Impala, together with streaming solutions such as NiFi,Kafka or Spark Streaming, and appropriate data storage such as Kudu, can provide end-to-end pipeline latency of less than ten seconds. Impala has native read / write capabilities to S3 Magic ADLS, HDFS, HiveMagee HBASE, etc., and is an excellent SQL engine for running clusters of less than 1000 nodes (with 100 trillion rows or more tables, or 50PBB size or larger datasets). Hive LLAP "real-time long-term processing" or "long-latency analytical processing" (also known as LLAP) is an execution engine under Hive that supports long-running processes by caching and processing using the same resources. The execution engine provides us with a very low latency SQL response because we do not have the resources for acceleration time.

Most importantly, LLAP complies with and enforces security policies, so it is completely transparent to users, helping Hive workloads compete with even today's most popular traditional data warehouse environments. Hive LLAP provides the most mature SQL engine in big data's ecosystem. Hive LLAP is built for big data and provides users with a highly scalable enterprise data warehouse (EDW) that supports heavy transformations, long-running queries or brute force-style SQL (with hundreds of joins). Hive supports materialized views, surrogate keys, and constraints to provide a SQL experience similar to traditional relational systems, including built-in caching of query results and query data. Hive LLAP can reduce the load of repeated queries to provide subsecond response time. In cooperation with Kafka and Druid, Hive LLAP can support federated queries against HDFS and object storage, as well as streams and real-time. Therefore, Hive LLAP is very suitable as an enterprise data warehouse (EDW) solution, where we will encounter many long-running queries that require a large number of transformations, or multiple joins between tables in massive datasets. With the caching technology included in Hive LLAP, our customers were able to connect 330 billion records to 92 billion records, with or without a partitioning key, and return the results in seconds. Spark SQLSpark is a general-purpose high-performance data engine designed to support distributed data processing and is suitable for a variety of use cases. There are many Spark libraries for data science and machine learning that support higher-level programming models to speed up development. On top of Spark are Spark SQL,MLlib,Spark Streaming and GraphX.

Spark SQL is a module for structured data processing and is compatible with a variety of data sources inherent in Hive,Avro,Parquet,ORC,JSON and JDBC. Spark SQL is very efficient on semi-structured datasets and integrates natively with Hive MetaStore and NoSQL storage such as HBase. Spark usually works well with programming API in our favorite languages, such as Java,Python,R and Scala. Spark is useful when you need to embed SQL queries and Spark programs into your data engineering workload. We have many users in the top 100 global enterprises running Spark to reduce the overall processing of streaming data workloads. When combined with MLlib, we see that many customers like Spark for machine learning of data warehouse applications. With high performance, low latency, and excellent integration of third-party tools, Spark SQL provides the best environment for switching between programming and SQL. So, what is the correct use of the SQL engine? Because you can mix and match the same data in CDP's CDW, you can choose the appropriate engine for each workload based on the workload type, such as data engineering, traditional EDW, ad hoc analysis, BI dashboards, online analytical processing (OLAP) or online transaction processing (OLTP). The chart below provides guidelines on which engines and technologies are suitable for each purpose.

If you are running EDW that supports BI dashboards, Hive LLAP will work best for you. When you need temporary, self-service, and exploratory data marts, check out the advantages of Impala. If you are using long-running queries without highly concurrency data engineering, Spark SQL is a good choice. If you need high concurrency support, you can check Hive on Tez. To get support for OLAP with time series data, consider adding Druid to the mix, and if you are looking for OLTP that requires low latency and high concurrency, consider adding Phoenix to the mix.

Bottom line-there are many SQL engines in CDW on CDP, which is purposeful. Providing options is the ultimate method of large-scale and high concurrency optimization for large amounts of data without compromise. CDW on CDP provides a common data context and shared data experience through a single layer of security, governance, traceability, and metadata, allowing you to mix the SQL engine on optimized storage. This gives you the freedom to use the best SQL engine optimized for your workload. The above is how to choose the right SQL engine for the right job. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.