What is the principle of Kappa architecture? 04/27 Update SLTechnology News&Howtos

What is the principle of Kappa architecture?

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what is the principle of Kappa architecture". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Lambda Architecture Review the core idea of the Lambda architecture is to split the big data system into three layers: Batch Layer,Speed Layer and Serving Layer. Among them, Batch Layer is responsible for the storage of data sets and the pre-query of all data sets. Speed Layer is mainly responsible for calculating incremental data and generating Realtime Views. Serving Layer is used to respond to the query request of the user. It merges the results of Batch Views and Realtime Views to get the final result and return it to the user. Figure 1 shows the overall architecture of Lambda:

Kappa architecture mentioned above, in order to combine batch and real-time processing, Lambda designed Batch Layer and Speed Layer two-tier structures for batch processing and real-time computing, respectively, so it is necessary to maintain two sets of code running on batch and real-time computing systems, respectively. Faced with this problem, some people may wonder, why not use a stream computing system for full data processing to remove the Batch Layer layer?

There may be an answer like this: stream computing gives the impression that some streaming, temporary data is calculated, and the original data is discarded after saving the results, so it is not suitable for processing historical data. In fact, this answer is not entirely correct, for Storm frameworks implemented based on Lambda architecture, this is true, but not for the later emergence of Spark.

Storm was open source in July 2011, and Spark became known after 2012, so when Nathan Marz designed the Lambda architecture, there was no framework for both offline processing and real-time computing. However, with the development of Spark technology, this idea has become possible. Spark itself can be used for batch processing, and Spark Streaming built on Spark can be used for real-time computing, so it is feasible to use a system to deal with the combination of batch processing and real-time computing.

The core idea of Kappa architecture includes the following three points:

Using Kafka or similar distributed queuing system to save data, you need to save the amount of data for a few days.

When full recalculation is required, a stream calculation instance is restarted, the data is read from scratch for processing, and output to a new result store.

When the new instance is done, stop the old stream calculation instance and delete some of the old results.

The architecture diagram of Kappa is shown in figure 2:

Compared with the Lambda architecture, under Kappa architecture, historical data is repeated only when necessary, and the real-time calculation and batch process use the same code. Some people may question that streaming is inadequate for the high throughput of historical data, but this can be improved by controlling the number of concurrency of new instances.

In the above architecture diagram, the new and old instances use their own result storage, which makes it easy to roll back at any time. Furthermore, if we produce some data such as algorithm models, users can also verify the effect of both the new and old data at the same time, do some A test or use the bandit algorithm to maximize the use of these data.

Comparison of advantages and disadvantages

Contrast item

Lambda architecture

Kappa architecture

Data processing ability

Can handle very large-scale historical data

The ability of historical data processing is limited

Machine cost

Batch processing and real-time computing need to be run all the time, and the machine is expensive.

Full calculation is carried out if necessary, and the machine cost is relatively small.

Storage overhead

Only one query result needs to be saved, and the storage overhead is small.

The results of new and old instances need to be stored, and the storage overhead is relatively high.

Development and testing are difficult and easy

Degree

It is difficult to develop and test two sets of codes.

It is relatively easy to develop and test only one framework.

Operation and maintenance cost

Maintain two sets of systems with high cost of operation and maintenance

Only one framework needs to be maintained, and the cost of operation and maintenance is low.

Table 1 comparison of advantages and disadvantages between Lambda architecture and Kappa architecture

As shown in the table above, the Kappa architecture has relatively more advantages and is currently being used by more vendors to build commercial projects.

First, the Lambda architecture not only needs to maintain two sets of code running on batch and real-time computing systems, but also needs batch processing and full computing to run for a long time, while the Kappa architecture performs full computing only when needed.

Second, under the Kappa architecture, many instances can be started for repeated calculation, so when some algorithm models need to be tuned, only one set of system parameters need to be changed under the Kappa architecture, and the effect comparison between the new and old data is allowed; but under the Lambda architecture, the flow computing system algorithm model and batch processing system algorithm model need to be changed at the same time, and the parameter tuning process is relatively complex.

Third, from the perspective of user development, testing and operation and maintenance, under the Kappa architecture, developers only need to face a framework, the difficulty of development, testing and operation and maintenance will be relatively small, which is a very important advantage.

How to choose

From the comparison of the advantages and disadvantages mentioned above, business requirements, difficulty of development and testing, and operation and maintenance costs are the three main framework selection considerations, while machine overhead and storage overhead are not very different, so here we mainly consider how to choose the above two architectures from three aspects: business requirements, difficulty of development and testing, and operation and maintenance costs.

Business requirements

Users need to choose the architecture according to their business needs. If the scale of historical data to be processed is large, for example, the TB-level data of a provincial intelligent transportation system for several years, then it may be more appropriate to choose Lambda architecture. If the amount of data processed is small, such as analyzing the data of an e-commerce website for nearly 30 days, then Kappa architecture may be more appropriate.

Degree of difficulty in developing and testing

If the parameters of the algorithm model need to be tuned frequently in the project, the Kappa architecture is more convenient; another criterion is whether the algorithm you designed is suitable for both batch processing and real-time computing, and if the same code can handle both well, then you can choose the Kappa architecture. But for some complex cases, the result of real-time calculation is different from that of batch processing, for example, in some machine learning applications, the prediction model is generated by batch processing, and then handed over to the real-time computing system for real-time analysis, then in this case, the batch layer and real-time computing layer can not be merged, so Lambda architecture should be chosen.

Operation and maintenance cost

The operation and maintenance cost of Kappa architecture is low, so it is more suitable for teams or enterprises with limited technical human resources.

StreamSQL and Lambda architecture Transwarp StreamSQL is a stream computing engine specially built by Star Ring Technology for enterprise users, which is mainly used in real-time application scenarios. For example, the financial industry needs real-time early warning of market fluctuations; banking business needs online analysis business and so on. Its support for SQL and PL/SQL enables users to implement complex business logic through SQL, which greatly reduces the threshold of streaming application development, and makes it possible to develop offline and real-time services based on a set of SQL programs.

Figure 3 shows a Kappa architecture system built with Kafka and StreamSQL, and improves the shortcomings of the original Kappa architecture.

Every 100ms, StreamSQL receives a batch of time series data from the Kafka message queue, such as t0-tn time data, in which the data of t0 is (0meme 1rec 2je 3jue 4), and the data of T1 is (5meme 6pr 7rem 8rem 9). The data of the current batch will be mapped into a two-dimensional relational table, transformed through SQL and converted to memory column storage, and the transformed data will be written to Holodesk in real time to persist to SSD, in this way to permanently retain or retain the last month's data. Applications can statistically analyze the column data in Holodesk through Inceptor SQL or R language.

StreamSQL's improvements to the Kappa architecture include the following:

As mentioned above, the original Kappa architecture stores historical data in Kafka or similar distributed message queues, which leads to a disadvantage that it can only save data for a few days or months, and can only be saved in the form of a stream, so its processing capacity for historical data is limited. StreamSQL supports output to multiple formats, not only to Kafka, but also to save the results in various formats (TEXT table, ORC table, Holodesk table, HBase table) in Inceptor for longer-term storage, so it can meet the business needs of a larger big data scale.

StreamSQL supports the association between stream data and Inceptor table data in real-time computing or historical data analysis, which greatly enhances its historical data processing ability.

Another feature of StreamSQL is that it can be perfectly compatible with SQL standards and PL/SQL, so that users can achieve business logic through SQL, which greatly reduces the threshold of streaming application development.

StreamSQL also adds the function of Application management. At runtime, each Application is isolated from each other and needs permission verification, which greatly improves the security and availability of the system.

Kappa architecture case study next we use StreamSQL as the flow processing engine to build an intelligent transportation system based on Kappa architecture, and make a detailed data flow analysis of the real-time early warning business scenario of license plate vehicles.

After the current end card port connects the monitored vehicle information to the Kafka distributed message queue, the bus will sort and distribute the data to different service clusters, such as real-time storage service cluster, non-annual vehicle inspection monitoring service cluster and so on.

Suppose that part of the data is sent to the illegal vehicle monitoring service cluster, and one of the services of the cluster is to analyze the license plate of the vehicle. The previous section mentioned that the Kappa architecture makes it easy to tune the algorithm model, so let's take a look at how it is done.

First, suppose we create a UDF function DectectCloneVehicle (param1, param2), which is used to check whether the license plate to be tested is a set of vehicles. The UDF receives two input parameters: when the straight line distance of two cars with the same license plate exceeds param1 km and the occurrence time is less than param2 minutes, it is regarded as a set of license plates. This function returns two results: if it is a set of license plates, it outputs 1, otherwise it outputs 0.

Suppose our initial set of license plate analysis strategy is that if the straight line distance of two cars with the same license plate is more than 20 kilometers and the occurrence time is less than 2 minutes, then it is determined that the license plate is set up. The StreamSQL statement to start an Stream Job instance and analyze it according to this policy is as follows:

CREATE STREAM vehicle_stream1 (license STRING,location STRING, time TIMESTAMP) ROW FORMAT DELIMITED FIELDS TERMINATED BY', 'TBLPROPERTIES ("topic" = fakeLicense ", kafka.zookeeper" = "172.16.1.128 yyyy-MM-dd HH-mm-ss.SSS", "timefield" = "time", "timeformat" = "yyyy-MM-dd HH-mm-ss.SSS); CREATE TABLE clone_vehicle_result_app1 (license STRING,location STRING, time TIMESTAMP); INSERT INTO clone_vehicle_result_app1 SELECT DetectCloneVehicle (20,2) as cloned FROM vehicle_stream1 HAVING cloned > 0

However, through practice and considering some practical situations (such as whether the straight line distance is reasonable, whether the current high-speed sections or low-speed sections are more, etc.), we find that if the detection is carried out according to this parameter, the efficiency of plate inspection will be very low. If the judgment standard of the set of license plate vehicles is adjusted to: the straight line distance is more than 10 kilometers and the occurrence time is less than 5 minutes, the efficiency will be greatly improved. Now restart an instance of Stream Job and execute the following StreamSQL statement:

CREATE STREAM vehicle_stream2 (license STRING, location STRING, time TIMESTAMP)

ROW FORMAT DELIMITED FIELDS TERMINATED BY','

TBLPROPERTIES ("topic" = fakeLicense ", kafka.zookeeper" = "172.16.1.128purl 2181")

"timefield" = "time", "timeformat" = "yyyy-MM-dd HH-mm-ss.SSS)

CREATE TABLE clone_vehicle_result_app2 (license STRING,location STRING, time TIMESTAMP)

INSERT INTO clone_vehicle_result_app2

SELECT DetectCloneVehicle (10,5) as cloned

FROM vehicle_stream2

HAVING cloned >

The efficiency of the Stream Job is higher than the parameters selected before, so we have one-step tuning of the parameters of the UDF model. Therefore, in the actual analysis, the improvement of business execution efficiency can not simply rely on the optimization help provided by the system, and users need to be able to choose the most effective model parameters according to the architecture adopted, the problems handled and the model methods applied, combined with the actual external restrictions.

This is the end of the content of "what is the principle of Kappa Architecture". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.