Is it possible for Flink State to replace the database? 04/10 Update SLTechnology News&Howtos

Is it possible for Flink State to replace the database?

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Stateful computing, as a guarantee of fault tolerance and data consistency, is one of the essential features of real-time computing. Popular real-time computing engines including Google Dataflow, Flink, Spark (Structure) Streaming and Kafka Streams provide support for built-in State respectively. The introduction of State enables real-time applications to store metadata and intermediate data without relying on external databases, and in some cases even directly use State to store result data, which makes the industry wonder: what is the relationship between State and Database? Is it possible to replace the database with State? On this topic, the Flink community began to explore relatively early. Generally speaking, the efforts of the Flink community can be divided into two lines: one is the ability to access State through the job query interface while the job is running, that is, the ability of QueryableState; to query and modify State offline through State's offline dump file (Savepoint), that is, the upcoming Savepoint Processor API. In Flink 1.2, released by QueryableState in 2017, Flink introduced the QueryableState feature to allow users to query the contents of job State through a specific client [1], which means that Flink applications can provide real-time access to computing results without relying on external storage outside the State storage medium.

Real-time data access is provided only through QueryableState. However, although QueryableState is idealized in imagination, it has been in the Beta version and can not be used in production environment because of many changes depending on the underlying architecture and limited functions. To solve this problem, some time ago, Yang Hua, an engineer at Tencent, put forward an improvement plan for QueryableState [2]. In the mailing list, the community discussed whether QueryableState could be used as a substitute for databases and had different views. Combined with my personal opinions, the author sorts out the main advantages and disadvantages of State as Database as follows. Advantages: lower data latency. In general, the calculation results of Flink applications need to be synchronized to the external database, such as timing trigger output window calculation results, and this synchronization usually brings a certain delay, resulting in the embarrassing situation that the calculation is real-time but the query is not real-time, and direct State can avoid this problem. Stronger data consistency guarantee. The consistency guarantees provided by Flink Connector or custom SinkFunction vary depending on the nature of the external storage. For example, for HBase,Flink that does not support multi-line transactions, Exactly-Once delivery can only be guaranteed through the idempotence of business logic. By contrast, State has a proper guarantee of Exactly-Once delivery. Save resources. By reducing the need to synchronize data to external storage, we can save the cost of serialization and network transmission, as well as the cost of the database. Disadvantages: insufficient SLA protection. Database technology has been very mature, in the availability, fault tolerance and operation and maintenance of a lot of accumulation, at this point, State is still in the primitive period. In addition, from the positioning point of view, Flink jobs have version iterative maintenance or automatically restart the down time caused by errors, which can not achieve the high availability of the database in data access. May lead to the instability of the homework. Unconsidered Ad-hoc Query may require scanning and returning exaggerated data, which puts a heavy load on the system and is likely to affect the normal execution of the job. Even a reasonable Query may affect the execution efficiency of the job in the case of a large number of concurrency. The amount of data stored should not be too large. State runtime is mainly stored in TaskManager local memory and disk. Excessive State will result in TaskManager OOM or insufficient disk space. In addition, a large State means a large checkpoint, which may cause checkpoint to time out and significantly extend the job recovery time. Only basic queries are supported. State can only query the simplest data structure and does not provide computing power such as functions like relational databases, nor does it support optimization techniques such as predicate push-down. It can only be read, not modified. State can only be modified by the job itself at run time. If you really want to modify the State, you can only use the Savepoint Processor API below. Generally speaking, the disadvantages of State instead of database far outweigh its advantages, but for some jobs that do not require high data availability, it is entirely reasonable to use State as a database. Due to the difference in positioning, it is difficult to see the possibility that Flink State can completely replace the database in a short time, but there is no doubt that State develops to the database in terms of data access characteristics. Savepoint Processor APISavepoint Processor API is a new feature recently proposed by the community (see FLIP-42 [3]), which is used to analyze and modify State's dump file Savepoint offline, or to build an initial Savepoint directly from the data. Savepoint Processor API belongs to the State Management of Flink State Evolution. If QueryableState is DSL, Flink State Evolution is DML, and Savepoint Processor API is the most important part of DML. The predecessor of Savepoint Processor API is the third-party Bravo project [4]. The main idea is to provide the ability to convert Savepoint to DataSet. The typical application is that Savepoint is read into DataSet, modified on DataSet, and then written as a new Savepoint. This is suitable for the following scenarios: analysis job State to study its patterns and rules to troubleshoot problems or audit the initial State modification Savepoint built for new applications, such as changing the maximum parallelism of the job and making huge Schema changes to fix problematic State

As the dump file of State, Savepoint can expose the function of data query and modification through Savepoint Processor API, similar to an offline database, but there are still many differences between the concept of State and the concept of typical relational data. FLIP-43 also makes an analogy and summary of these differences. First of all, Savepoint is the physical storage set of multiple operator state, and the state of different operator is independent, which is similar to the table between different namespace under the database. We can get the database corresponding to Savepoint, and a single operator corresponds to Namespace. DatabaseSavepointNamespaceUidTableState but as far as table is concerned, its corresponding concepts in Savepoint vary according to the type of State. There are three types of State: Operator State, Keyed State and Broadcast State, among which Operator State and Broadcast State belong to non-partitioned state, that is, state is not partitioned by key, while Keyed State belongs to partitioned state. For non-partitioned state, state is a line in table for each element of a table,state; for partitioned state, all state under the same operator correspond to a table. This table has a row key like a HBase, and then each specific state corresponds to a column in the table. For example, if there is a data stream of player scores and online time, we need to use Keyed State to record the score and game duration of the player's group, and use Operator State to record the total score and total time of the player. The input of the data flow over a period of time is as follows: user_iduser_nameuser_groupscore1001PaulA5,0001002CharlotteA3,6001003KateC2,0001004RobertB3900user_iduser_nameuser_grouptime1001PaulA1,8001002CharlotteA1,2001003KateC6001004RobertB2000 uses Keyed State, we register two MapState of group_score and group_time respectively to represent the total score and total duration of the group, and update the cumulative values of the two indicators to State according to the user_group keyby data flow, the table is as follows: user_groupgroup_scoregroup_timeA8,6003000C2,00600B3,9002000 relatively, if the total score and total duration are recorded with Operator State (parallelism is set to 1) We register total_score and total_time State and get two tables: total_score |-| 14500 | total_time5600 so far the corresponding relationship between Savepoint and Database should be relatively clear. For Savepoint, there are different StateBackend to determine how the State is persisted, which obviously corresponds to the storage engine of the database. In MySQL, we can change the storage engine with a simple one-line command ALTER TABLE xxx ENGINE = InnoDB;, and MySQL automatically does the tedious format conversion behind it. As for Savepoint, because the respective storage formats of StateBackend are not compatible, it is not convenient to switch StateBackend at present. To this end, the community created FLIP-41 [5] not long ago to further improve the operability of Savepoint. Summary State as Database is a major trend of the development of real-time computing, it is not to replace the use of database, but to learn from the experience of the database field to expand the State interface to make its mode of operation closer to our familiar database. For Flink, the external use of State can be divided into online real-time access and offline access and modification, which will be supported by Queryable State and Savepoint Processor API, respectively. The original link to this article is the original content of Yunqi community and may not be reproduced without permission.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.