How to analyze data Virtualization engine openLooKeng 04/27 Update SLTechnology News&Howtos

How to analyze data Virtualization engine openLooKeng

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how to analyze the data virtualization engine openLooKeng. Many people may not know much about it. In order to make you understand better, the editor has summarized the following for you. I hope you can get something according to this article.

The present situation and problems of big data's Analysis

The 21st century is the century of information explosion. With the rapid development of IT technology, more and more applications continue to produce hundreds of millions of data. In the past century, scientists and engineers have invented a variety of data management systems to store and manage all kinds of data: relational database, NoSQL database, document database, Key-value database, object storage system and so on. Various forms of data management systems bring convenience to enterprise organizations in managing data, at the same time, it is a difficult problem to manage and make full use of the data stored in these data systems. Whether it is PostgreSQL or MySQL in relational database, or Hive or HBase in Hadoop system, these commonly used data management systems in the industry have their own set of SQL dialects. If a data analyst wants to analyze the data of a certain data management system, he must be proficient in a certain SQL dialect. In order to jointly query different data sources, different clients have to be used in the application logic to connect different data sources. The whole analysis process has complex architecture, many programming entrances, and system integration is difficult. This kind of analysis process is very painful for data analysts who involve huge amounts of data.

In order to solve the problem of joint query of data islands formed by multiple data sources, the solution of data warehouse is being widely used in the industry. Data warehouse has developed rapidly in the past few years. Through the whole process of Extract, Transform, Load and ETL, the processed data is centrally stored in the thematic data warehouse for data analysts or users. However, with the further growth of the data scale, it has to be pointed out that the industry has gradually realized that the process of transporting data to the data warehouse is expensive. In addition to the cost of hardware or software of the data warehouse, the manpower cost of maintaining and updating the whole ETL logic system has gradually become one of the important expenses of the data warehouse. The ETL process of data warehouse is also cumbersome and time-consuming. In order to get the data they want, data analysts or users have to compromise to the data analysis mode of data warehouse. It has always been a difficult problem for data analysts to carry out business analysis and exploration quickly.

In order to solve the problem of data islands in various data management systems, thematic data warehouses have been invented for different business applications, but with the increase of business applications, more and more thematic data warehouses have become data isolated islands. So will the heroic "dragon slayer" inevitably become a "dragon" as time goes by? Is there a solution with simple system architecture, unified programming entry and good system integration? Perhaps today, it is time for us to go back to the original starting point and look at another paradigm of big data's data analysis from the beginning.

Data Virtualization engine openLooKeng: we do not carry data, we are the "connectors" of data

So when we look back at the various problems encountered in the data warehouse, it is easy for you to find that the reason why the "dragon slayer" of the data warehouse has gradually become a "dragon" is that it is constantly moving data, which leads to the onerous, time-consuming and expensive "culprit" of the establishment and analysis process of the data warehouse. Since moving data leads to these problems, let's go back to the starting point of big data's analysis and consider "another road in the forest", which is the path that openLooKeng is taking to transform data handling into data connection.

To put it succinctly, openLooKeng data virtualization engine analyzes data by connecting to various data source systems through a variety of data source Connector. When users initiate a query, they use each Connector to obtain data in real time and perform high-performance calculations, so as to get the analysis results in seconds or minutes. This is very different from the previous way that the data warehouse processes the data well and then gives it to the user through the ETL data handling process of data Warehouse.

Unlike in the past, data analysts need to learn a variety of SQL dialects, now they only need to be proficient in ANSI SQL2003 syntax. While the differences of various data management systems in SQL standards are shielded by openLooKeng as the middle layer, users no longer need to learn various SQL dialects, these complicated SQL dialect conversion work will be completed by openLooKeng. By liberating users from a variety of SQL dialects, users can focus on building high-value business application query analysis logic, and the intangible assets formed by these analysis logic are often the core of enterprise business intelligence. OpenLooKeng is to help users quickly build high-value business analysis logic to build their own entire technical architecture. Because there is no need to move data, users' analysis and query inspiration can be quickly verified by openLooKeng, thus achieving a faster analysis effect than the previous data warehouse analysis and processing process.

Let's take a higher view. Since openLooKeng can connect to relational databases, NoSQL databases, and other data management systems through Connector, can openLooKeng itself be used as a Connector? The answer is yes. When we provide openLooKeng itself as a data source to another openLooKeng cluster, we can get the following benefit: previously, due to cross-regional or cross-DC network bandwidth or delay constraints, real-time federated query for data between multiple data centers was basically not available, but now openLooKeng cluster 1 calculates the local data and then passes the results to openLooKeng cluster 2 for further analysis. It avoids the transmission of a large amount of original data, thus avoiding the network problem of cross-domain and cross-DC query.

The unified SQL entrance of openLooKeng and the rich southbound data source ecology, to a certain extent, solved the problems of complex cross-source query structure, too many programming entries and poor system integration, realized the mode transformation of data from "handling" to "connection", and facilitated users to quickly realize the value realization of massive data.

Key features of openLooKeng

Perhaps after reading the above introduction, you can't wait to know in which scenarios openLooKeng can be used to solve the pain points of current business applications. But before moving on to the business scenarios where openLooKeng applies, let's take a look at some of the key features of openLooKeng so that you can gain a deeper understanding of why openLooKeng is suitable for these business scenarios, and you can even explore more business scenarios based on these capabilities of openLooKeng.

A memory computing framework specially designed for massive data

Since its birth, openLooKeng has been designed for the query and analysis tasks of TB or even PB-level massive data. it has a natural affinity for Hadoop file system. Its SQL on Hadoop distributed processing architecture adopts the design concept of separation of storage and computing, which can easily realize the horizontal expansion of computing or storage nodes. At the same time, the openLooKeng kernel adopts a memory-based computing framework, and all data processing is done in parallel pipelined jobs in memory, which can provide query delay response from seconds to minutes.

Support for ANSI SQL2003 syntax

OpenLooKeng supports ANSI SQL2003 syntax. When users query using openLooKeng syntax, no matter the underlying data source is RDBMS or NoSQL or other data management systems, with the help of openLooKeng's Connector framework, the data can still be stored in the original data source, thus realizing the query of data "0 relocation".

Through the unified SQL entrance of openLooKeng, the SQL dialects of various underlying data sources can be shielded, and users can obtain the data of the underlying data sources without paying attention to the SQL dialects of the underlying data sources, which makes it convenient for users to consume data.

A wide variety of data sources Connector

Just like the variety of data management systems, openLooKeng has developed a variety of data sources Connector for these data management systems, including RDBMS (Oracle Connector, HANA Connector, etc.), NoSQL (Hive Connector, HBase Connector, etc.), full-text search database (ElasticSearch Connector, etc.). OpenLooKeng can easily get the data source data through these various Connector, so as to further carry out high-performance joint computing based on memory.

Cross-domain DataCenter Connector across DC

OpenLooKeng not only provides the ability to federate queries across multiple data sources, but also further extends the ability of cross-source queries, and develops DataCenter Connector for cross-domain and cross-DC queries. With this new Connector, you can connect to another remote openLooKeng cluster, providing the ability to collaborate between different data centers. The key technologies are as follows:

Parallel data access: worker can concurrently access data sources to improve access efficiency, and clients can also concurrently obtain data from the server to speed up data acquisition.

Data compression: before serialization during data transmission, the GZIP compression algorithm is used to compress the data to reduce the amount of data transmitted over the network.

Dynamic filtering across DC: filtering data to reduce the amount of data extracted from the remote end, thereby ensuring network stability and improving query efficiency.

High performance query Optimization Technology

Based on the in-memory computing framework, openLooKeng also uses many query optimization techniques to meet the needs of high-performance interactive queries.

Indexes

OpenLooKeng provides indexes based on Bitmap Index, Bloom Filter, and Min-max Index. By creating an index on the existing data and storing the index results outside the data source, the index information is used to filter out the mismatched files when the query plan is arranged, so as to reduce the data scale that needs to be read, so as to speed up the query process.

Cache

OpenLooKeng provides a rich variety of Cache, including metadata cache, execution plan cache, ORC row data cache, and so on. Through these various cache, users can accelerate the query delay response to the same SQL or the same type of SQL multiple times.

Dynamic filtering

The so-called dynamic filtering refers to the optimization method of applying the filter results of one side table of join to the filter of the other side table at run time (run time). OpenLooKeng not only provides the dynamic filtering optimization characteristics of a variety of data sources, but also applies this optimization feature to DataCenter Connector, so as to accelerate the performance of associated queries in different scenarios.

Operator pushdown

When openLooKeng connects to data sources such as RDBMS through Connector framework, because RDBMS has strong computing power, generally, better performance can be obtained by pushing operators down to the data source for computing. OpenLooKeng currently supports operator push-down of a variety of data sources, including Oracle, HANA and so on. In particular, operator push-down is also implemented for DC Connector, thus achieving faster query delay response.

High availability characteristic

Double active users of HA AA

OpenLooKeng introduces the high-availability AA feature and supports the coordinator AA double-active mechanism, which can maintain the load balance among multiple coordinator and ensure the availability of openLooKeng under high concurrency.

Auto-scaling

The auto-scaling feature of openLooKeng supports the smooth withdrawal of service nodes that are performing tasks, while at the same time pulling up inactive nodes and accepting new tasks. By providing "isolated" and "isolated" status interfaces for external resource managers (such as Yarn, Kubernetes, etc.), openLooKeng can flexibly scale up and scale up coordinator and worker nodes.

Common application scenarios of openLooKeng

Through the above introduction of the key features of openLooKeng, you must have come up with a lot of openLooKeng application scenarios. Let's take a look at its application scenarios in real business.

High-performance interactive query scenario

OpenLooKeng's memory-based computing framework makes full use of memory parallel processing, indexing, Cache, distributed pipeline operations and other technical means for rapid query and analysis, and can deal with massive data at TB or even PB level. In the past, interactive analysis applications that use Hive, Spark or even Impala to build query tasks can use openLooKeng query engine to upgrade, so as to obtain faster query performance.

Cross-source heterogeneous query scenarios

As mentioned earlier, data management systems such as RDBMS and NoSQL are widely used in customers' various application systems, and more and more thematic data warehouses such as Hive or MPPDB are established to deal with these data. These databases or data warehouses often form independent data islands from each other, and data analysts often suffer from:

Querying various data sources requires different connections or clients, as well as running different SQL dialects, which lead to additional learning costs and complex application development logic

If you do not aggregate the data from various data sources again, you cannot federate the data from different systems.

OpenLooKeng can be used to realize the joint query of databases such as RDBMS and NoSQL and data warehouses such as Hive or MPPDB. With the cross-source heterogeneous query capability of openLooKeng, data analysts can realize minute-level or even second-level query and analysis of massive data.

Cross-domain and cross-DC query scenarios

For provincial-municipal, headquarters-branch or multi-level data center scenarios, users often need to query the data of municipal (branch) data centers from provincial (headquarters) data centers. The main bottleneck of this cross-domain query is the network problems between multiple data centers (insufficient bandwidth, large delay, packet loss, etc.), resulting in prolonged query time, unstable performance and so on.

OpenLooKeng specially designed a cross-domain cross-DC solution DataCenter Connector for this cross-domain query. By transmitting the calculation results between openLooKeng clusters, it avoids the network transmission of a large number of original data, avoids the network problems caused by insufficient bandwidth and packet loss, and solves the problem of cross-domain cross-DC query to a certain extent. It has high practical value in cross-domain cross-DC query scenarios.

Computing storage separation scenario

OpenLooKeng itself does not have a storage engine, and its data sources mainly come from a variety of heterogeneous data management systems, so it is a typical storage and computing separation system, which can facilitate the independent horizontal expansion of computing and storage resources. The technical architecture of openLooKeng storage and computing separation can realize the dynamic expansion of cluster nodes and the elastic scaling of continuous business resources, which is suitable for business scenarios that require computing and storage separation.

A scenario for rapid data exploration

As mentioned earlier, in order to query data from a variety of data sources, customers usually establish a special data warehouse through the ETL process, but this brings expensive labor costs, ETL time costs and other problems. For customers who need to explore data quickly but do not want to build a special data warehouse, copying and loading data into the data warehouse is time-consuming and laborious, and may not get the analysis results that users want.

OpenLooKeng can define a virtual data Mart through standard syntax and connect to various data sources with cross-source heterogeneous query capabilities, thus defining a variety of analysis tasks that users need to explore in the semantic layer of this virtual data Mart. Using this data virtualization capability of openLooKeng, customers can quickly build exploration and analysis services based on various data sources without the need to build complex and specialized data warehouses, thus saving manpower and time costs. OpenLooKeng is one of the best choices for scenarios where you want to quickly explore data to develop new business.

After reading the above, do you have any further understanding of how to analyze the data virtualization engine openLooKeng? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.