Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What should be paid attention to in the integration of Flink1.10 and Hive

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

Editor to share with you some need to pay attention to the integration of Flink1.10 and Hive, I believe that most people do not understand, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

Flink officially release the Flink1.10 version, which has a lot of changes. For example:

Flink 1.10 also marks the completion of the integration of Blink. With the production-level integration of Hive and the full coverage of TPC-DS, Flink not only enhances the processing power of streaming SQL, but also has mature batch processing capabilities. This blog will introduce the main new features and optimizations in this version upgrade, the important changes that deserve attention, and the expected results of using the new version one by one.

One of the most important features is the introduction of Hive integration available for production.

Flink 1.9 introduces a preview version of the Hive integration. This version allows users to use SQL DDL to persist Flink-specific metadata to Hive Metastore, to call UDF defined in Hive, and to read and write tables in Hive. Flink 1.10 further develops and refines this feature, resulting in a production-available Hive integration that is fully compatible with major versions of Hive.

Several problems encountered by the author are classified and summarized as follows. If you encounter all kinds of strange problems in a production environment, there may be some revelations:

Architecture design

When creating a runtime environment, Flink creates a CatalogManager at the same time. This CatalogManager is used to manage different Catalog instances. This is how our Flink runtime environment accesses Hive:

The examples given on the official website are as follows:

Hive Catalog + Hive requires a configuration file

Both Hadoop and Spark have a hive-site.xml configuration file when they link to Hive, and Flink also needs a configuration file when integrating with Hive: sql-client-hive.yaml this configuration file contains: the path to the hive configuration file, execution engine, etc. The configuration case given on the official website:

The official website also gives a warning ⚠️ message as follows:

This means that a hive-site.xml is required locally, and the planner configuration in sql-client-hive.yaml must be blink.

SQL CLI tool support

This toy is similar to a conversation window and can be started by a script sql-client.sh script, which is run as follows:

It should be noted that the current machine running the script must have the necessary environment variables, such as HADOOP_CONF_DIR, HIVE_HOME, HADOOP_CLASSPATH, etc. Just take some of the environment variables specified when the Hadoop cluster is built.

Necessary dependency and version difference

Flink 1.10 supports many versions of Hive integration, and different Hive versions need different Jar package support. For more information, please see: https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/hive/#connecting-to-hive

In addition, the official website also gives some points for attention on the current support for Hive ⚠️:

Very simple English, no longer translated.

Advantages and disadvantages

This update issue mentions some of the current major optimizations, including: Projection Pushdown (read only the necessary columns), Limit Pushdown (sql can limit limit, reduce the amount of data), partition clipping (read-only must partition) and so on. Generally speaking, it is some common means of sql optimization at present.

The current shortcomings mainly include:

The storage format is not fully supported at present, We have tested on the following of table storage formats: text, csv, SequenceFile, ORC, and Parquet. I believe that I will release again soon.

In addition, ACID and Bucket tables are not supported yet.

As the absolute core of data warehouse system, Hive undertakes most of the offline data ETL computing and data management, and looks forward to the perfect support of Flink in the future.

The above is all the content of the article "what do you need to pay attention to in the integration of Flink1.10 and Hive"? thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report