Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to extend FlinkSQL to realize join of flow and dimension table

2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

How to extend FlinkSQL to realize the join of flow and dimension table. In view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Extending FlinkSQL to realize join of flow and Dimension Table

FlinkStreamSQL is open source and 380+Star has been obtained on Github.

1. Why expand FlinkSQL?

Real-time computing requires complete SQLization

SQL is the most widely used language in data processing. It allows users to declare their business logic succinctly. It is common for big data batch computing to use SQL, but there are not many real-time computations that support SQL. In fact, using SQL to develop real-time tasks can greatly reduce the threshold of data development. In the kangaroo cloud stack-real-time computing module, we decided to achieve complete SQL.

Advantages of using SQL in data calculation

☑ declarative. Users only need to express what I want, as to how to calculate that is a matter of the system, users do not care.

☑ tuning automatically. The query optimizer can generate the most available execution plan for the user's SQL. Users do not need to know about it to automatically enjoy the performance improvements brought about by the optimizer.

☑ is easy to understand. Many people in different industries and different fields understand that the learning threshold of SQL,SQL is very low, and using SQL as a cross-team development language can greatly improve efficiency.

☑ is stable. SQL is a language with a history of several decades. It is a very stable language with few changes. So when we upgrade the version of the engine, or even replace it with another engine, we can upgrade it in a compatible and smooth manner.

Real-time computing also requires JOIN of streams and dimension tables.

In the world of real-time computing, it's not just the JOIN of stream and stream.

JOIN of streams and dimension tables is also required.

In the world of real-time computing, not only the JOIN of flow and flow, but also the JOIN of flow and dimension table are needed. Last year, during the development of version 3.0 of Kangaroo Cloud, the latest version, FlinkSQL in flink1.6, applied the advantages of SQL to the Flink engine, but did not support JOIN for streams and dimensional tables.

About FlinkSQL

FlinkSQL began to open stream computing service to Alibaba Group in July 2017. although it is a very young product, it has supported thousands of jobs by the Singles' Day. During the Singles' Day, the processing peak of Blink jobs reached 500 million per second, while the total processing peak of FlinkSQL jobs alone reached 300 million per second.

Reference link: https://yq.aliyun.com/articles/457438

Let's first explain what a dimension table is; a dimension table is a dynamic table, and the data stored in the table may not be changed, or it may be updated regularly, but the update frequency is not very frequent. In business development, the general dimension table data is stored in relational databases such as mysql,oracle, and may also be stored in nosql databases such as hbase,redis.

2. To implement the join of flow and dimension table with FlinkSQL, there are two steps:

1. Realize the function of dimensional table with Flink api

In order to realize the dimension table function, it is necessary to use the function of Flink Aysnc Flink Aysnc O, which is contributed to Apache Flink by Alibaba.

Async I Pot O is contributed to the community by Alibaba and introduced in version 1.2. the main purpose is to solve the problem that network delay becomes the bottleneck of the system when interacting with external systems. For a specific introduction, you can see this article: http://wuchong.me/blog/2017/05/17/flink-internals-async-io/

The api corresponding to Flink is the abstract class RichAsyncFunction. The open (initialization), asyncInvoke (data asynchronous invocation), close (stopped operations) methods in the implementation of this abstract class, and the most important thing is to implement the methods in asyncInvoke.

The join of streams and dimension tables will encounter two problems:

The first is performance. Because if the flow rate is very fast, every piece of data needs to go to the dimension table to do join, but the data of the dimension table exists in a third-party storage system. If you access a third-party storage system in real time, not only the performance of join will be poor, but every time you have to go to the network io; will also bring great pressure to the third-party storage system, which may fail the third-party storage system.

So the solution is to cache the data in the dimension table, which can be fully cached, which is mainly due to the small amount of data in the dimension table, and the other is the LRU cache, where the amount of data in the dimension table is relatively large.

Realization of LRU Dimension Table

The second problem is that the deferred data is so related to the previous dimension table data. This involves that the dimension table data needs to store snapshot data, so it is appropriate to use HBase as the dimension table in this scenario, because HBase naturally supports multiple versions of the data.

Realization of ALL Dimension Table

Second, the parsing stream and the SQL syntax of the dimension table join are transformed into the underlying FlinkAPI.

Because FlinkSQL has done most of the SQL scenarios, we can't parse all the syntax of SQL and convert it to the underlying FlinkAPI.

So what we do is parse the SQL syntax to find out if there is a dimension table in the join table, if there is a dimension table, then we will separate the statement of the dimension table of this join, generate a new DataStream with Flink's TableAPI and StreamAPi, and do join with this DataStream and other tables so that we can use SQL to realize the join syntax of the flow and dimension table.

The tool for SQL parsing is to use Apache calcite,Flink and this framework to do SQL parsing. So all grammars can be parsed.

1. DEMO SQL

2. Calcite parses the Insert into statement and splits the sub-statement

3. Calcite continues to parse select statements

Calcite continues to parse select statements

Calcite continues to parse select statements

This is the answer to the join question about how to expand the FlinkSQL implementation flow and dimension table. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report