How to understand the real-time data warehouse based on TiDB and Flink 07/06 Update SLTechnology News&Howtos

How to understand the real-time data warehouse based on TiDB and Flink

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how to understand the real-time data warehouse with the combination of TiDB and Flink". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

The concept of real-time data warehouse

The concept of data warehouse, which was put forward by Bill Inmon in the 1990s, refers to a topic-oriented, integrated, relatively stable set that reflects historical changes and is used to support management decisions. At that time, data warehouses collected data from data sources through message queues and calculated them daily or weekly for use in reports, also known as offline data warehouses.

In the 21st century, with the development of computing technology and the improvement of overall computing power, the main body of decision-making has gradually changed from manual control to computer algorithms, and there are some requirements, such as real-time recommendation, real-time monitoring and analysis, and so on. The corresponding decision cycle time has gradually changed from sky level to second level. In these scenarios, real-time data warehouse arises at the historic moment.

There are three main architectures of real-time data warehouse: Lambda architecture, Kappa architecture and real-time OLAP variant architecture:

The Lambda architecture means that the real-time data warehouse is superimposed on the basis of the offline data warehouse, and the streaming engine is used to deal with the real-time data. Finally, the offline and online results are supplied and used.

The Kappa architecture removes the offline data warehouse and all uses real-time data production. This architecture unifies the computing engine and reduces development costs.

With the improvement of real-time OLAP technology, a new real-time architecture is proposed, which is temporarily called "real-time OLAP variant". To put it simply, part of the computing pressure is transferred from the streaming computing engine to the real-time OLAP analysis engine for more flexible real-time warehouse calculation.

To sum up, for real-time data warehouses, the Lambda architecture requires maintenance flow and batch engines, and the development cost is higher than the other two. Compared with the Kappa architecture, the real-time OLAP variant architecture can perform more flexible computing, but needs to rely on additional real-time OLAP computing resources. The Flink + TiDB real-time data warehouse scheme that we will introduce next belongs to the real-time OLAP variant architecture.

For a more detailed comparison of real-time data warehouses and these architectures, interested readers can refer to this article in the Flink Chinese community.

Flink + TiDB real-time data warehouse

Flink is a big data computing engine with low latency, high throughput and uniform flow batches. it is widely used in real-time computing in high real-time scenarios and has important features such as supporting exactly-once.

With the integration of TiFlash, TiDB has become a real HTAP (online transaction processing OLTP + online analytical processing OLAP) database. In other words, in the real-time warehouse architecture, TiDB can not only be used as the business database of the data source for business query processing, but also as a real-time OLAP engine for analytical scenario calculation.

Combining the characteristics of Flink and TiDB, the advantages of Flink + TiDB are also reflected: first, the speed is guaranteed, and both of them can increase computing power by horizontally expanding nodes; secondly, the cost of learning and configuration is relatively low, because TiDB is compatible with MySQL 5.7protocol, and the latest version of Flink can also write submission tasks entirely through Flink SQL and powerful connectors (connector), saving users' learning costs.

For Flink + TiDB real-time data warehouse, here are several commonly used prototypes that can be used to meet different needs or can be self-expanded in actual use.

Using MySQL as the data source

By using the official flink-connector-mysql-cdc,Flink provided by Ververica, you can generate dynamic tables not only as a collection layer to collect MySQL binlog, but also as a flow computing layer to achieve streaming computing, such as streaming Join, pre-aggregation and so on. Finally, Flink writes the calculated data to TiDB through the JDBC connector.

The advantage of this architecture is that it is very simple and convenient, and when both MySQL and TiDB are ready to correspond to databases and tables, you can complete the registration and submission of tasks by writing only Flink SQL. Readers can try this architecture in the "try it in docker-compose" section at the end of this article.

Docking Flink with Kafka

If the data has been stored in Kafka from other ways, Flink can easily get the data from Kafka through Flink Kafka Connector.

It should be mentioned here that if you want to store the change logs of MySQL or other data sources in Kafka for subsequent Flink processing, it is recommended to use Canal or Debezium to collect data source change logs, because Flink 1.11 natively supports parsing of changelog in these two tool formats, without the need to implement additional parsers.

Using TiDB as the data source

TiCDC is a TiDB incremental data synchronization tool implemented by pulling TiKV change logs, which can be used to output TiDB change data to a message queue and then extracted by Flink.

In version 4.0.7, docking with Flink can be done through TiCDC Open Protocol. In later versions, TiCDC will support direct output as canal-json for use by Flink.

Case and practice

The previous section introduced some basic architectures, and the exploration in practice is often more complex and interesting. This section will introduce some representative and enlightening user cases.

Little Red Book

Little Red Book is a lifestyle platform for young people, where users can record their lives in the form of short videos, pictures and texts, share their lifestyles, and form interactions based on their interests. By October 2019, the number of monthly active users of Xiaohong Books has exceeded 100 million, and continues to grow rapidly.

In Xiaohongshu's business architecture, both the data source and data summary of Flink are TiDB to achieve an effect similar to "materialized view":

The online business table in the upper left corner performs normal OLTP tasks.

The TiCDC cluster below extracts real-time change data from TiDB and passes it to Kafka in the form of changelog.

Flink reads the changelog in Kafka for calculation, such as putting together a wide table or an aggregation table.

Flink writes the results back to the wide table of TiDB for subsequent analysis.

The whole process forms the closed loop of TiDB, transfers the Join work of the follow-up analysis task to Flink, and relieves the pressure through flow calculation. At present, this scheme has supported Little Red Book content audit, note tag recommendation, growth audit and other services, has experienced a large throughput of online business test and continued stable operation.

Shell gold suit

Shell Financial Services has been ploughing the residential scene for many years and has accumulated a wealth of Chinese real estate big data. Driven by financial technology, Shell Financial Services uses AI algorithm to efficiently apply multi-dimensional massive data to enhance product experience and provide users with rich and customized financial services.

In the data service of shell data group, Flink real-time calculation is used for typical dimension table Join:

First of all, use Syncer (a lightweight synchronization tool from MySQL to TiDB) to collect dimension table data from the business data source and synchronize it to TiDB.

Then, the flow table data on the business data source is collected by Canal and stored in the kafka message queue by binlog.

Flink reads the change log of the flow table in Kafka, tries streaming Join, and looks for it in TiDB whenever the data in the dimension table is needed.

Finally, Flink writes the flattened wide table to TiDB for data analysis services.

Using the above structure, the master table in the data service can be landed by real-time Join, and then the server only needs to query the single table. This system has gone deep into various core business systems in Shell Financial Service, and the cross-system data acquisition unifies the data service of the data group, eliminating the development work of business system development API and memory aggregation data code.

Wisdom bud

PatSnap (Wisdom Bud) is a global patent search database that integrates 130 million patent data and 170 million chemical structure data in 116 countries and regions since 1790. Can search, browse and translate patents, generate Insights patent analysis reports for patent value analysis, citation analysis, legal search, and view 3D patent maps.

Smart Bud uses Flink + TiDB to replace the original Segment + Redshift architecture.

In the original Segment + Redshift architecture, only the ODS layer is built, and the rules of data writing and schema are not controlled. And need to write complex ETL for ODS to calculate various indicators according to business requirements to complete the upper requirements. The amount of data in the Redshift database is large, the calculation is slow (Tunable 1 aging), and affects the external service performance.

After replacing with a real-time warehouse architecture based on Kinesis + Flink + TiDB, you no longer need to build an ODS layer. As a pre-computing unit, Flink builds Flink Job ETL directly from the business, completely controls the storage rules and customizes the schema;, that is, only the indicators of business concern are cleaned and written into TiDB for subsequent analysis and query, and the amount of data written is greatly reduced. According to the indicators concerned by users / tenants, regions, business actions and other indicators, combined with different granularity time windows such as minutes, hours, days, etc., a DWD/DWS/ADS layer is built on TiDB to directly serve business statistics, lists and other requirements. The upper layer applications can directly use the constructed data and obtain real-time capabilities in seconds.

User experience: after using the new architecture, the amount of incoming data, storage rules and computational complexity have been greatly reduced. The data has been processed in Flink Job and written into TiDB according to business requirements, and the full ODS layer based on Redshift is no longer needed for Test1 ETL. Through reasonable data layering, the real-time data warehouse built based on TiDB has been greatly streamlined in architecture, and the development and maintenance has become more simple; the performance of data query, update and writing has been greatly improved; when meeting different adhoc analysis needs, there is no need to wait for a process similar to Redshift precompilation; expansion is convenient, simple and easy to develop. At present, this architecture is being launched, which is used to analyze and track user behavior within the wisdom bud, and summarize the functions of the company's operation market, user behavior analysis, tenant behavior analysis and so on.

NetEase entertains each other

NetEase formally established the online Game Division in 2001. after nearly 20 years of development, it has become one of the seven largest game companies in the world. NetEase ranks second in App Annie's list of the top 52 global publishers in 2020.

In the application architecture of NetEase's mutual entertainment billing group, on the one hand, Flink is used to complete the real-time writing of business data sources to TiDB; on the other hand, using TiDB as the analysis data source, real-time stream calculation is carried out in the subsequent Flink cluster to generate analysis reports. In addition, NetEase Mutual Entertainment has now developed a flink job management platform internally, which is used to manage the entire life cycle of jobs.

Zhihu

Zhihu is a comprehensive Chinese Internet content platform, with "let everyone get reliable solutions efficiently" as the brand mission and Polaris. As of January 2019, Zhihu had more than 220 million users and produced a total of 130 million responses.

As a partner of PingCAP and a deep user of Flink, Zhihu developed a set of interactive tools between TiDB and Flink and contributed them to the open source community: pingcap-incubator/TiBigData, which mainly includes the following features:

As a Flink Source Connector, TiDB is used to synchronize data in batches.

TiDB, as a Flink Sink Connector, is implemented based on JDBC.

Flink TiDB Catalog, you can use TiDB's table directly in Flink SQL without having to create it again.

Try in docker-compose

In order to facilitate readers to better understand, we provide a docker-compose-based MySQL-Flink-TiDB test environment in https://github.com/LittleFall/flink-tidb-rdw for testing.

A simple tutorial for this scenario is provided in the Flink TiDB real-time warehouse Slides, including concept explanation, code examples, simple principles, and some considerations, including:

Flink SQL simple attempt

Using Flink to import data from MySQL to TiDB

Dual-stream Join

Dimension table Join

After starting docker-compose, you can write and submit Flink tasks through Flink SQL Client, and observe the execution of the tasks through localhost:8081.

This is the end of the content of "how to understand the real-time data warehouse with the combination of TiDB and Flink". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.