In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article focuses on "how the Flink data architecture evolves". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how the Flink data architecture evolves.
01 traditional data infrastructure
As shown in figure 1-1, the most important feature of traditional single data architecture (Monolithic Architecture) is centralized data storage. There may be many systems within the enterprise, such as Web business system, order system, CRM system, ERP system, monitoring system and so on. The transactional data of these systems are mainly based on centralized relational database (DBMS) storage, and most of the architecture is divided into computing layer and storage layer.
The storage layer is responsible for the data access of the system in the enterprise, and has the final data consistency guarantee. These data reflect the current business status, such as the order transaction volume of the system, the number of active users of the website, the change of transaction volume of each user, and so on. All update operations need to be realized with the help of the same database.
▲ figure 1-1 traditional data structure
The initial efficiency of the single architecture is very high, but with the passage of time, more and more businesses, the system gradually becomes very large, more and more difficult to maintain and upgrade, the database is the only accurate data source, each application needs to access the database to obtain the corresponding data, if the database changes or problems, it will have an impact on the entire business system.
Later, with the emergence of micro-service architecture (Microservices Architecture), enterprises began to gradually adopt micro-services as the architecture of enterprise business systems. The core idea of micro-service architecture is that an application is composed of several small, independent micro-services, which run in their own process and have no dependence on development and release. Different services can be built on different technical architectures according to different business needs, and can focus on limited business functions.
▲ figure 1-2 Micro Services Architecture
As shown in figure 1-2, the micro-service architecture decomposes the system into different independent service modules, each of which uses its own independent database. This model solves the problem of business system development, but it also brings new problems. That is, the business transaction data is too scattered in different systems, it is difficult to centralize the data management.
For applications such as data analysis or data mining within the enterprise, it is necessary to extract data from different databases, synchronize the data from the database to the data warehouse periodically, and then extract, transform and load the data in the data warehouse (ETL), so as to build different data marts and applications for business systems to use.
02 big data data architecture
At first, data warehouse is mainly built on relational databases, such as Oracle, Mysql and other databases, but with the growth of enterprise data, relational databases can no longer support the storage and analysis of large-scale data sets, so more and more enterprises begin to build enterprise big data platform based on Hadoop.
At the same time, many Sql-On-Hadoop technical solutions also make it simple and efficient for enterprises to build different types of data applications on Hadoop, such as using Apache Hive for data ETL processing, using Apache Impala for real-time interactive query and so on.
The rise of big data technology enables enterprises to use their business data more flexibly and efficiently, extract more important values from the data, and apply the results of data analysis and mining in enterprise decision-making, marketing, management and other application fields. But inevitably, with the introduction and use of more and more new technologies, a set of big data management platform within the enterprise may be realized with the help of many open source technology components.
For example, in the process of building an enterprise data warehouse, the data is often periodically synchronized from the business system to the big data platform, and finally forms applications such as data Marts after completing a series of ETL conversion actions. However, for some time-demanding applications, such as real-time report statistics, there must be a very low delay to display statistical results, so the industry has proposed a set of Lambda architecture to deal with different types of data.
As shown in figure 1-3, the big data platform contains Batch Layer for batch computing and Speed Layer for real-time computing, by integrating batch computing and stream computing in a set of platforms, such as using Hadoop MapReduce for batch data processing and Apache Storm for real-time data processing.
This architecture solves the problems of different computing types to some extent, but the problem is that too many frameworks will lead to high platform complexity, high operation and maintenance costs and so on. It is also very difficult to manage the use of different types of computing frameworks in a resource management platform. All in all, Lambda architecture is a very effective solution for building big data applications, but it is not the most perfect one.
▲ figure 1-3 big data Lambada architecture
Later, with the emergence of Apache Spark's distributed memory processing framework, it is proposed to split the data into micro-batch processing mode for streaming data processing, so that batch computing and streaming computing can be completed in a set of computing framework.
However, because Spark itself is based on batch mode, it can not handle the native data flow perfectly and efficiently, so the support of convection computing is relatively weak. It can be said that the emergence of Spark is to some extent upgrade and optimize the Hadoop architecture.
03 stateful flow computing architecture
The nature of data generation is actually a series of real events. The different architectures mentioned above actually violate this nature to a certain extent. It is necessary to process business data with a certain delay, and then get accurate results based on business data statistics.
In fact, due to the limitations of streaming computing technology, it is difficult for us to calculate and produce statistical results directly in the process of data generation, because it not only has very high requirements for the system, but also has to meet many goals such as high performance, high throughput, low delay and so on.
The proposal of the state flow computing architecture (shown in figure 1-4) meets the needs of enterprises to a certain extent. Enterprises maintain the state of all computing processes based on real-time streaming data. The so-called state is the intermediate calculation results produced in the calculation process. Every time the new data enters the flow system, the calculation is based on the intermediate state results, and finally produces the correct statistical results.
The biggest advantage of stateful computing is that there is no need to re-take the original data from external storage for full computing, because the cost of this computing method can be very high. From another point of view, users do not need to schedule and coordinate various batch computing tools to obtain data statistical results from the data warehouse, and then store them on the ground, all of which can be done based on streaming computation. it can greatly reduce the dependence of the system on other frameworks and reduce the time loss and hardware storage in the process of data calculation.
▲ figure 1-4 stateful computing architecture
If the calculation results can be consistent, the real-time calculation will get the results in a very short time, and the batch calculation will need to wait for a certain amount of time to get it. It is believed that most users will be more inclined to use stateful flow for big data processing.
04 Why is it Flink?
It can be seen that stateful flow computing will gradually become an architectural model for enterprises to build a data platform, but from a community point of view, only Apache Flink can be satisfied. Flink implements a real-time streaming computing framework with high throughput, low latency and high performance by implementing the Google Dataflow streaming computing model.
At the same time, Flink supports highly fault-tolerant state management to prevent the state from being lost due to system anomalies in the process of computing. Flink periodically maintains the state through the distributed snapshot technology Checkpoints, so that the correct results can be calculated even in the case of system downtime or abnormal.
Flink has advanced architecture concept, many excellent features and perfect programming interface, and Flink constantly introduces new features in every version of Release, such as the proposal of Queryable State function, which allows users to obtain the status information of streaming computing tasks directly through remote means, and the data can be queried directly from Flink streaming applications without landing database. For real-time interactive query services, the latest results can be queried directly from the status of Flink.
In the future, Flink will not only serve as a framework for real-time streaming processing, but also become a real-time state storage engine, allowing more users to benefit from stateful computing technology.
The specific advantages of Flink are as follows.
1. It also supports high throughput, low latency and high performance.
Flink is the only distributed streaming data processing framework in the open source community that integrates high throughput, low latency and high performance. For example, Apache Spark can only take into account high throughput and high performance characteristics, mainly because it can not achieve low latency guarantee in Spark Streaming streaming computing, while streaming computing framework Apache Storm can only support low latency and high performance characteristics, but can not meet the requirements of high throughput. It is very important for the distributed streaming computing framework to meet the three goals of high throughput, low latency and high performance.
two。 Supports the concept of event time (Event Time)
Window computing plays an important role in the field of streaming computing, but at present, most frame window computing uses system time (Process Time), which is also the current time of the system host when events are transmitted to the computing framework for processing.
Flink can support window computing based on event time (Event Time) semantics, that is, using the time generated by events. This event-driven mechanism enables the streaming system to calculate accurate results even if events arrive out of order, maintaining the timing of the original events and avoiding the impact of network transmission or hardware systems as much as possible.
3. Support for stateful computing
Flink implements state management in version 1.4. the so-called state means that the intermediate result data of the operator is saved in memory or in the file system in the process of streaming computing. After the next event enters the operator, the current result can be calculated from the previous state, so that it is not necessary to count the results based on all the original data every time, which greatly improves the performance of the system. And reduce the resource consumption of the data calculation process.
For streaming computing scenarios with large amount of data and very complex operation logic, stateful computing plays a very important role.
4. Support for highly flexible Window operations
In streaming applications, the data is continuous, so it is necessary to stream the data through a window for a certain range of aggregate calculations, such as counting how many users have clicked on a web page in the past minute. In this case, we must define a window to collect data within the last minute and recalculate the data in this window.
Flink divides windows into window operations based on Time, Count, Session, and Data-driven. Windows can be customized with flexible trigger conditions to support complex streaming modes, and users can define different window trigger mechanisms to meet different needs.
5. Fault tolerance based on lightweight distributed Snapshot (Snapshot)
Flink can run on thousands of nodes, disassemble the flow of a large computing task into small computing processes, and then distribute the tesk to parallel nodes for processing. In the process of task execution, it can automatically find the problems of data inconsistency caused by errors in event processing, such as node downtime, network transmission problems, or the restart of computing services due to user upgrades or repair problems.
In these cases, through the Checkpoints based on distributed snapshot technology, the state information in the execution process is persisted. Once the task stops abnormally, Flink can automatically recover the task from the Checkpoints to ensure the consistency of the data in the process of processing.
6. Independent memory Management based on JVM
Memory management is an important part of all computing frameworks, especially for computing scenarios with a large amount of computation, how to manage data in memory is very important. For memory management, Flink implements its own memory management mechanism to minimize the impact of JVM GC on the system.
In addition, Flink converts all data objects into binaries and stores them in memory through serialization / deserialization, reducing the size of data storage and reducing the use of memory space more effectively, reducing the risk of performance degradation or task anomalies brought about by GC. Therefore, Flink is more stable than other distributed processing frameworks and will not affect the operation of the entire application because of JVM GC and other problems.
7. Save Points (SavePoint)
For streaming applications running 24 hours a day, data is continuously accessed, and the termination of the application within a period of time may lead to data loss or inaccurate calculation results, such as upgrading the cluster version, downtime operation and maintenance operations, and so on.
It is worth mentioning that Flink saves the snapshot of the task execution on the storage medium through Save Points technology, and when the task is restarted, it can directly engage in the saved Save Points to restore the original computing state, so that the task continues to run according to the state before downtime. Save Points technology allows users to better manage and operate real-time streaming applications.
At this point, I believe you have a deeper understanding of "how the Flink data architecture evolves". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.