In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces the example analysis of Lambda architecture and Kappa architecture in big data's processing, which is very detailed and has a certain reference value. Interested friends must read it!
Typical Internet big data platform architecture
First, let's take a look at the architecture of a typical Internet big data platform, as shown in the following figure:
In this architecture diagram, the online business processing components for users in big data's platform are marked in brown. This part belongs to Internet online applications, and the other blue parts belong to big data-related components. Use open source big data products or develop related big data components.
As you can see, the big data platform can be divided into three parts from top to bottom: data acquisition, data processing, data output and display.
data acquisition
Synchronize the data and logs generated by the application to the big data system, due to different data sources, the data synchronization system here is actually a combination of multiple related systems. Database synchronization usually uses Sqoop, log synchronization can choose Flume, and the data collected by dots is formatted and transformed and then transmitted through message queues such as Kafka.
The data quality generated by different data sources may vary greatly, and the data in the database may be directly imported into big data system to be used, while the data generated by logs and crawlers need to be cleaned and transformed in order to be used effectively.
Data processing.
This part is the core of big data's storage and computing. The data imported by the data synchronization system is stored in HDFS. MapReduce, Hive, Spark and other computing tasks read the data on HDFS for calculation, and then write the calculation results to HDFS.
The computing processing carried out by MapReduce, Hive, Spark and so on is called offline computing, and the data stored by HDFS is called offline data. The offline calculation on big data system usually aims at all the data, such as mining the relevance of goods for all orders in history, when the scale of the data is very large and it takes a long time to run. This kind of calculation is offline calculation.
In addition to offline computing, there are also some scenarios where the data scale is also large, but the processing time is relatively short. For example, Taobao needs to count the number of orders generated per second for monitoring and publicity. This scenario, known as big data streaming computing, is usually completed by streaming big data engines such as Storm and Spark Steaming, which can be completed in seconds or even milliseconds.
Data output and display
The data generated by big data's calculation is still written to HDFS, but it is impossible for the application to read the data in HDFS, so the data in HDFS must be exported to the database. It is relatively easy to export data synchronously, and the data generated by calculation is relatively standardized. With a little processing, you can export to the database with a system such as Sqoop.
At this point, the application can directly access the data in the database and show it to the user in real time, such as showing the recommended product to the user.
In addition to providing data for user access, big data also needs to provide various statistical reports to the operation and decision-making level, which are also written into the database and accessed by the corresponding background system. Many operators and managers, as soon as they go to work every day, log in to the background data system to check the data reports of the previous day to see if the business is normal. If the data are normal or even rise, it can be a little easier; if the data fall, a restless and busy day is about to begin.
What integrates the above three parts is the task scheduling management system, when different data start to synchronize, how to schedule all kinds of MapReduce and Spark tasks reasonably to make the most reasonable use of resources, and the waiting time is not too long, at the same time, temporary important tasks can be carried out as soon as possible, which need to be completed by the task scheduling management system.
The big data platform architecture mentioned above, also known as Lambda architecture, is a conventional architecture prototype scheme for building big data platform. The Lambda architecture prototype is shown in the figure below.
Lambda architecture
Lambda Architecture (Lambda Architecture) is a big data processing architecture proposed by Twitter engineer Nathan Matts (Nathan Marz). The proposal of this architecture is based on Matz's experience in distributed data processing systems on BackType and Twitter.
The Lambda architecture enables developers to build large-scale distributed data processing systems. It not only has good flexibility and expansibility, but also has good fault tolerance to hardware failures and human errors.
The Lambda architecture consists of three layers of systems: the batch layer (Batch Layer), the speed processing layer (Speed Layer), and the service layer (Serving Layer) for responding to queries.
In the Lambda architecture, each layer has its own task.
The batch layer stores and manages master data sets (immutable data sets) and pre-batch calculated views.
The batch layer uses a distributed processing system that can process large amounts of data to pre-calculate the results. It processes all the existing historical data to achieve the accuracy of the data. This means that it is recalculated based on the complete dataset, can fix any errors, and then update the existing data view. The output is usually stored in a read-only database, and updates completely replace the existing pre-calculated views.
The speed processing layer will deal with the new big data in real time.
The speed layer minimizes latency by providing a real-time view of the latest data. The data views generated by the speed layer may not be as accurate or complete as the final views generated by the batch layer, but they are available almost immediately after the data is received. When the same data is processed in the batch layer, the data in the speed layer can be replaced.
In essence, the speed layer makes up for the lag of the data view caused by the batch layer. For example, each task in the batch layer takes 1 hour to complete, and during that hour, we cannot get the data view given by the latest task in the batch layer. The speed layer makes up for the lag of 1 hour because it can process the data in real time and give the result.
All the results processed in the batch layer and the speed layer are output and stored in the service layer, which responds to the query by returning the pre-calculated data view or building the data view from the speed layer.
For example, the recommendation system such as advertising prediction generally uses the Lambda architecture. Generally speaking, companies that can do accurate advertising will have a large number of historical data such as user characteristics, user history and web type classification. The popular practice in the industry is to use Alternating Least Squares (ALS) algorithm, that is, Collaborative Filtering collaborative filtering algorithm, in the batch layer, which can get the types of advertisements that are consistent with the characteristics of other users, and can also get advertisements similar to the types of advertisements that users are interested in, and k-means can also be used to classify the types of advertisements that customers are interested in.
The result here is the result of the batch layer. In the speed layer, look for some top K ads in the previously classified ads according to the type of real-time web browsing of the user. Finally, the service layer can combine the top K ads in the speed layer and the classified similar ads with high click-through rate in the batch layer to make a choice to deliver to the users.
The deficiency of Lambda architecture
Although Lambda architecture is very flexible to use and can be applied to many application scenarios, Lambda architecture also has some shortcomings in practical application, mainly in that its maintenance is very complex.
When using the Lambda architecture, architects need to maintain two complex distributed systems and ensure that they logically produce the same results to the service layer.
As we all know, programming in a distributed framework is actually very complex, especially when we optimize specifically for different frameworks. So almost every architect agrees that the Lambda architecture has a certain degree of complexity to maintain in practice.
So how to solve this problem? Let's think about it first, what is the root cause of this architecture's complexity to maintain?
The complexity of maintaining the Lambda architecture lies in maintaining two sets of system architectures at the same time: the batch layer and the speed layer. As we have said, the batch layer is added to the architecture because of the high accuracy of the results from the batch layer, while the speed layer is added because of its low latency when processing large-scale data.
Can we improve the architecture of one layer so that it has the characteristics of another layer of architecture?
For example, improve the system of the batch layer to make it have lower latency, or improve the system of the speed layer to make the data view it produces more accurate and closer to historical data?
Another architecture commonly used in large-scale data processing, Kappa architecture (Kappa Architecture), was born under this kind of thinking.
Kappa architecture
Kappa architecture is an architectural idea put forward by Jay Kreps, a former chief engineer of LinkedIn. Kreps is one of the authors of several well-known open source projects, including streaming systems such as Apache Kafka and Apache Samza, and is now CEO of Confluent big data.
Kreps proposes an idea to improve the Lambda architecture:
Can we improve the system performance of the speed layer in the Lambda architecture so that it can also deal with the integrity and accuracy of data? Can we improve the speed layer in the Lambda architecture so that it can not only process real-time data, but also reprocess previously processed historical data when the business logic is updated?
Based on his years of architectural experience, he found that we can make such improvements.
Stream processing platforms like Apache Kafka have the ability to keep data logs permanently, and through this feature, we can reprocess historical data deployed in the speed tier architecture.
Let's take Apache Kafka as an example to describe the whole process of the new architecture.
The first step is to deploy Apache Kafka and set the retention period (Retention Period) for the data log. The retention period here refers to the time interval of historical data that you want to be able to reprocess.
For example, if you want to reprocess historical data for up to one year, you can set the retention period in Apache Kafka to 365 days. If you want to be able to handle all the historical data, you can set the retention period in Apache Kafka to "Forever".
Second, if we need to improve the existing logic algorithm, it means that we need to reprocess the historical data.
All we need to do is restart an Apache Kafka job instance (Instance). This job instance will start from scratch, recalculate the retained historical data, and output the results to a new data view. We know that at the bottom of Apache Kafka, Log Offset is used to determine which block has been processed, so just set Log Offset to 0 and the new job instance will process the historical data from scratch.
Third, when the data processed by this new data view catches up with the old data view, our application can switch to reading from the new data view.
The fourth step is to stop the old job instance and delete the old data view.
Unlike the Lambda architecture, the Kappa architecture removes the batch layer and retains only the speed layer. You only need to reprocess the data when the business logic changes or the code changes.
After talking about the Kappa architecture, I would like to emphasize that the Kappa architecture has its own shortcomings.
Because the Kappa architecture only retains the speed layer and lacks the batch layer, there may be data update errors when dealing with large-scale data on the speed layer, which requires us to spend more time dealing with these error exceptions.
Also, the batch and stream processing of the Kappa architecture is placed on the speed layer, which causes the architecture to use the same set of code to handle the algorithm logic. Therefore, the Kappa architecture is not suitable for scenarios where the logic of batch and streaming code is inconsistent.
The above is all the contents of the article "sample Analysis of Lambda Architecture and Kappa Architecture in big data's processing". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.