How to understand the Development of big data query engine in distributed SQL 04/19 Update SLTechnology News&Howtos

How to understand the Development of big data query engine in distributed SQL

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to understand the development of distributed SQL big data query engine". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to understand the development of distributed SQL big data query engine.

Introduction

From a high-level perspective, many data and analysis solutions have been built in the same way for many years. In short, it consists of various integration processes that load all data into a central location, which is the only factual source of upcoming data modeling and analysis use cases. Although in the early days, most of these central locations were expensive and inflexible tightly coupled hardware / software systems, today they often take advantage of cloud and distributed architectures, including the separation of computing and storage. However, although great technological progress has been made in recent years, the overall approach to centralized data is still the most obvious way to make the most effective use of its data and appropriate data management.

Centralization of power

So what's wrong with this centralized approach? First of all, what does it have to do with distributed query engines?

First of all, there is nothing to oppose. In fact, on the contrary, building a massive data warehouse or data lake containing all data in a clear, fresh state in one place is usually the only way to ensure consistency, so everyone uses the same definition. In this respect, cloud data lake services, in particular, such as Microsoft's Azure Data Lake Storage or Amazon Web Service's S3, show interesting changes by enabling more advantages of centralization, thanks to their very flexible and inexpensive way to store large amounts of data of any type.

Matters needing attention

However, there are many reasons that make it more and more difficult to centralize all data. The number of data sources is growing, and so is the versatility of data sets needed to meet the increasing number of different business areas that rely on this data. In general, in contrast to static pre-built datasets, business users are getting closer and closer to data that requires more flexibility. The same is true of advanced analysis use cases, which usually require methods to be applied to raw and untransformed data. And, in some cases, organizations are even prohibited from migrating data because of any internal or external regulations. In other cases, pipelines still exist on top of centralized data, which can be further loaded into any downstream system to meet all analysis requirements. In turn, this may even lead to the same locking as traditional local systems. Or use cases where centralized data is not sufficient to justify the work involved, or where the data is too large and takes too long to move. And so on...

So what should we do in this situation?

Federation

Today, there are many options for analyzing solutions and their data management. Not only does it include different suppliers of its quotations, but also a wide variety of technologies are unstoppable, and the pace of technological progress is faster than ever before. There is no clear winner, and there is no doubt that they will help turn more data calories into something useful. However, it is true that there is a clear trend in SQL-based distributed query engines to help deal with data explosion. This also confirms the product lineup of existing data and analysis service providers and their latest developments. They all try to seamlessly integrate cost-effective cloud storage and allow interactive SQL queries on it using exactly the same query engine. Therefore, they can fill the above gaps and allow mature enterprises to achieve extended big data functions while maintaining organizational and platform stability by maintaining core facts.

Data virtualization

The basic idea behind a distributed query engine is nothing more than data virtualization and an attempt to create an abstraction layer that provides data access across different data sources. The difference from traditional data virtualization software (linked servers, DBLink, etc.) is that you can scale out to query relational and non-relational data together to improve query performance. Therefore, the word distributed refers not only to the query itself, but also to computing and storage. They are basically designed for intensive OLAP queries, so they are not so fragile and inconsistent in terms of performance.

SQL on Hadoop

The technology used for this purpose was originally or is still often called Hadoop-based SQL-on-Hadoop, which relies on the MPP (massively parallel processing) engine. It allows you to query and analyze data stored on HDFS (Hadoop distributed file system) in a familiar SQL-like language to hide the complexity of MapReduce / Tez and make it easier for database developers to access. Hive can be said to be the first SQL engine on Hadoop, and because its development over the years has been proved to be very powerful, Hive is still widely used in batch data processing. Hive converts the SQL query into multiple phases and stores the intermediate results to disk. At the same time, other specialized tools, such as Impala, have been natively developed in the Hadoop ecosystem, and HBase is also supported as a data source. Compared to Hive, it leverages memory and caching techniques, and is more suitable for interactive analysis than long-running batch jobs-another example in this category is SparkSQL. All of these require pre-completed metadata definitions, also known as read schemas, such as views or external tables. This definition is stored in centralized storage, such as Hive metastore.

SQL-on-Anything

With the development of technology, there is a need for more openness and not strictly bundled with Hadoop, but to support many other kinds of other databases in a loosely coupled way. In this way, the query engine can implement plug-and-play discovery on a large amount of data without a large number of prerequisites and preparations. In addition, a standard ANSI SQL is provided as an interface to make it easier for data analysts and developers to access. At the same time, there is no need to predefine the schema, and some engines can even automatically parse it at the original storage layer through a push-down query (such as Drill). Another pioneering tool in this field is Presto, which can even query real-time streaming data from Kafka and Redis. Presto is an in-memory distributed SQL query engine developed by Facebook to meet this requirement, which can be analyzed interactively in different data sets. For companies such as Netflix,Twitter,Airbnb or Uber, this is critical to their daily business, otherwise they will not be able to process and analyze PB-level data. Presto can be used with many different BI tools, including Power BI,Looker,Tableau,Superset or any other tool that conforms to ODBC and JDBC. In this case, the name "SQL-on-Anything" was finally invented for the first time.

Data lake engine

The technical approach of the data lake engine is not very different. After all, it's just data virtualization and merging data from different sources. They usually differ in providing more functionality about data modeling, data transformation, data rows, and data security. In general, they are also more cloud-oriented and may think that they also have a rich user interface, bringing a data self-service concept to non-technical users. This approach can take full advantage of the data concentration in the public cloud and can conduct interactive analysis at a lower cost without any locking risk because of the price elasticity of the cloud. Data Lake Engines does not necessarily support more data sources, but due to delays, they can take advantage of the latest technologies from scratch. For example, Databricks recently released SQL Analytics, a database powered by its Delta engine that can directly query Delta Lake tables on the data lake. In addition, it provides a SQL native interface for data browsing, and dashboards can be shared with each other. Another very promising tool and one of my favorites in this respect is Dremio, which is basically open source but supported by a company of the same name, which offers a commercial enterprise version with some additional features.

Contrary to the traditional multi-tier architecture, Dremio is building a direct bridge between BI tools and queried data source systems. The main technologies used behind the scenes are Drill,Arrow,Calcite and parquet. This combination provides schemaless SQL for a variety of data sources and a cylindrical memory analysis execution engine with push-down capabilities, and queries can be easily implemented to improve query performance. By the way, Arrow is regarded as the de facto standard for memory analysis.

At this point, I believe you have a deeper understanding of "how to understand the development of distributed SQL big data query engine". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.