How to use Hive3 to realize cross-database federated query in Apache 07/12 Update SLTechnology News&Howtos

How to use Hive3 to realize cross-database federated query in Apache

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article will explain in detail how to use Hive3 to achieve cross-database federated query in Apache. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

one

Use Apache Hive3.0&SQL for unified acc

Apache Hive plays an important role in the above environment. More specifically, it provides access to all of the above systems, provides unified and efficient SQL access, and is available out of the box. The benefits of this are enormous:

1. Single SQL dialect and API

two。 Centralized security control and audit trail

3. Unified governance

4. Ability to merge data from multiple data sources

5. Data independence

Setting up and using Apache Hive as the federation layer simplifies application development and data access. In particular, all data can be processed using SQL and accessed through the familiar JDBC/ODBC interface.

The most important thing, because all data access is through Hive, when Hive interacts with different systems, it provides unified security control (table, row and column access control), data traceability and auditing.

Finally, you can guarantee the independence of the data: if all access is abstracted through SQL tables or views, it is easier to change the storage or processing engine.

two

Intelligent push-down based on cost Optimization

Suppose you want to execute a Hive query that accesses data through a JDBC connection to the back-end RDBMS. One naive idea is to think of JDBC sources as "dumb" storage systems, reading all raw data through JDBC and processing it in Hive. In this case, we actually ignore the query function of RDBMS itself, and then extract too much data through JDBC links, which can easily lead to poor performance and system overload.

For this reason, Hive relies on its storage handler interface and the cost-based optimizer (CBO) supported by Apache Calcite to intelligently push down other systems. In particular, Calcite provides rules that match a subset of operators in the logical representation of the query, and then generates an equivalent representation in an external system to perform more operations. Hive pushes calculations to external systems in its query planner and relies on Calcite to generate query languages supported by external systems. The implementation of storage handler is responsible for sending the generated query to the external system, retrieving its results, and converting the incoming data into an Hive internal representation for further processing as needed.

This is not limited to SQL systems: for example, Apache Hive can also federate Apache Druid or Apache Kafka queries, and as we described in our recent blog post, Druid can handle the aggregation and filtering of timing data very efficiently. Therefore, when executing a query against a data source stored in Druid, Hive can push filtering and aggregation to Druid to generate and send JSON queries to the REST API exposed by the engine. On the other hand, if you are querying data on Kafka, Hive can push filters on partitions or offset to read data in topic based on conditions.

Https://hortonworks.com/blog/benchmark-update-apache-hive-druid-integration-hdp-3-0/

three

Federation to JDBC source side

The combination of storage handler and Calcite adapters is very flexible and powerful, especially when using Hive as the federation layer of different SQL systems.

Apache Hive3 includes a new implementation of JDBC storage handler that allows Calcite's JDBC adapter rules to selectively push calculations to the JDBC source side, such as MySQL,PostgreSQL,Oracle or Redshift. Hive can push a variety of operations, including projections, filters, joins, aggregations, unions, sorting and limit.

More importantly, Calcite can adjust the behavior according to the system on the source side of the JDBC. For example, it can identify a function that the JDBC source-side system does not support in the filter criteria in the query. In this case, it will execute it in Hive while still pushing the remaining supported filter conditions to the JDBC system. In addition, Hive can generate SQL in different dialects, depending on the database behind which the JDBC connects.

Another important feature is that JDBC storage handler can split queries into multiple subqueries, which are sent to the execution engine in parallel to speed up reading large amounts of data from the data source.

The following figure is an example of Hive's new JDBC push-down feature. You can see that there is a delay when generating query SQL statements for MySQL and PostgreSQL.

four

Next work: automatic metadata mapping

To query data from other systems, users only need to use the appropriate storage handler to create an external Hive table, including, of course, some other information about that external system. For example, if a user wants to create a table in Hive from the PostgreSQL database 'organization'' item', can use the following statement:

Although you can see that this table-building statement is simple enough, our expectations for this feature of Hive are much more than that.

As shown in the example above, the table-building statement also needs to specify the schema of the table in PostgreSQL. In HIVE-21060, if you are configuring an external table through JDBC, you want Hive to automatically discover the schema instead of declaring it in the build statement.

HIVE-21060:

Https://jira.apache.org/jira/browse/HIVE-21060

On the other hand, in HIVE-21059, it is to develop and implement external catalog support. External catalog allows you to create a new catalog in metastore, which points to an external mysql database, so that under this catalog, all tables can be automatically identified and queried by Hive.

HIVE-21059:

Https://jira.apache.org/jira/browse/HIVE-21059

five

Summary

Federation brings a lot of flexibility to Hive, so let's take a look at what you can do with it:

1.Combining best tools for the job: application developers can access multiple data processing systems through a single interface. There is no need to learn different query dialects or interfaces, and there is no need to worry about data security and data governance. Hive is responsible for everything. This greatly simplifies the integration effort.

2.Transform & write-back: with this feature, you can use Hive SQL to transform data stored outside Hive. For example, you can use Hive to query Kafka, transform the data, and then write it back to Kafka.

3.Simplifying data-loads: you can now query other systems directly from Hive, clean the data, reinforce the data, and finally merge the data into the Hive table without having to deploy any other tools. This makes ETL very simple.

4.Query across multiple systems: for some ad-hoc queries or small reports, it is not realistic to define ETL processes and move the data to the same storage location. With this feature, you only need to write a query SQL that spans multiple data sources.

On how to use Hive3 in Apache to achieve cross-database federated query is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.