Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What does Kafka Stream mean?

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article will explain what Kafka Stream means for you in detail. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

I. brief introduction

First of all, compared with the mainstream Storm, SparkStreaming, Flink, etc., KafkaStream has the advantage of being lightweight, so you don't need to specify container resources. It is very suitable for some lightweight ETL scenarios. For example, in commonly used ETL, most lightweight operations such as Filter, LookUp, WriteStorage, etc., can be performed using KafkaStreams. The ideal architecture is that a lightweight computing framework such as KafkaStream + Lamdba can achieve a secure on-demand stream computing model.

1. Basic introduction of Kafka Streams

Kafka Streams is built on Kafka and is based on a series of important functions of flow processing, such as correctly distinguishing between event events and processing time, handling late data, and efficient application state management.

Powerful function

Highly expandable, flexible, fault tolerant

Stateful and stateless processing

Window,Join,Aggergations based on event time

Lightweight

No need for a special cluster

No external dependence

A library, not a framework

Fully integrated

100% Kafka version compatible

Easy to integrate into existing applications

Program deployment does not need to be handled manually (this should refer to the automatic matching of Kafka multi-partition mechanism to Kafka Streams multi-instances)

Real-time performance

Millisecond delay

Not microbatch processing

Windows allow out-of-order data

Allow late data

2. Characteristics

Simpler stream processing: Kafka Streams is designed to be a lightweight library, just like Kafka's Producer and Consumer. You can easily integrate Kafka Streams into your own applications. The additional requirements for the application are simply packaged and deployed to the same cluster as the application.

There are no external dependencies other than Apache Kafka and can be used in any Java application. There is no need to deploy an additional cluster for the flow processing requirements.

Using Kafka as the internal message communication storage medium, there is no need to rejoin other external components to do message communication. Kafka Streams uses the partition level extension of Kafka to process data in an orderly and efficient manner. This also takes into account high performance, high scalability, and makes the operation easy. There is no need to understand and adjust the two different messaging layers (independent messaging layers for data movement and streaming processing between different scalable media). Similarly, improvements in the performance and high reliability of Kafka will directly benefit Kafka Streams.

Kafka Streams can be more seamlessly integrated into existing development, packaging, deployment, and business practices. You are free to use your favorite tools, such as java application server, Puppet, Ansible,Mesos,Yarn,Docket, or even on a machine that manually runs your own applications for verification.

Local state fault tolerance is supported. This allows for very efficient and fast Join and Window aggregation operations that include states. Local states are saved in Kafka, and other machines can automatically restore these states to continue processing in the event of a machine failure.

Processing one piece of data at a time to achieve low latency is also the difference between Kafka Streams and other microbatch-based streaming frameworks. In addition, KafkaStreams's API is very similar to Spark's, with a lot of operators with the same meaning, but the current version still has some problems with scala support, but for people who are good at Spark programming, writing a Kafka stream processing doesn't require much extra learning.

3. Some concepts of stream processing (1) Stream stream

Stream is the most important concept in KafkaStream, which represents an unlimited and constantly updated dataset. A Stream is an ordered, repeatable and immutable dataset, which is defined as a fault-tolerant key-value pair.

(2) Stream Processing Application (streaming application)

A stream handler can be any program that inherits the KafkaSteams library, in practice, that is, the Java code we write.

(3) Processor Topology (processing Topology)

The processing topology defines the computational logic for data processing by the flow processing application. In general, we can use StreamsBuilder builder = new StreamsBuilder (); StrinmBuilder will create a processing topology for us inside the class. If you need to customize the processing topology, you can build the topology through Low-level API or through Kafka Streams's DSL.

(4) Streamer

The stream processor is used to process each node in the topology, represents each processing step in the topology, and is used to complete the data conversion function. A stream processes receiving an input data from the upstream at the same time, generating one or more output records to the next stream processor. Kafka has two ways to define a stream processor:

DSL API, that is, operators such as map,filter.

Low-Level API, a low-level API, allows developers to define and connect the processor's state memory for exchange.

(5) time

Some operators, such as window functions, are defined based on time boundaries.

Event time: the time or the time at which the record was generated, that is, the time at which the time was originally created at the source

Processing time: the point in time at which the stream processing application starts processing, that is, the time when the time enters the flow processing system

Ingestion time: the time when the data record is saved by the KafkaBroker to the corresponding partition of the kafka topic, similar to the time time, is the timestamp field embedded in the data record, but the ingestion time is attached by the KafkaBroker to the target Topic.

The selection of event time and ingestion time is achieved through configuration on Kafka (not KafkaStreams). Starting from Kafka 0.10.X, the timestamp is automatically embedded in the Message of Kafka, and the event time or ingestion time can be selected according to the configuration. Configuration can be specified in broker or topic. The default time extractor provided by Kafka Streams restores these embedded timestamps. Therefore, the effective time semantics of the application depends on the configuration of this embedded timestamp reading. Please refer to: Developer Guide for further information.

(6) stateful flow processing

If each message processing is independent of each other, it does not require state, such as message transformation or filtering, and the topology of flow processing is also very simple. If the state can be saved, stream processing can be applied to more scenarios, such as Join, Group By, or Aggregate wiping to the left, and KafkaStreams DSL provides many of these DSL containing states.

(7) KTable and KStream

First of all, the flow and the table have a dual nature, each bit with a stream can be used as a table, a table can also be used as a flow. Kafka's Log compact function takes advantage of this duality. The impact of Kafka log compression, consider another form of KStream and KTable. If a KTable is stored in Kafka's topic, you need to enable Kafka log compression to save space. However, this approach is not safe in the case of KStream, because once log compression is turned on, Kafka removes the older key values, which destroys the semantics of the data. Take data playback as an example, you will suddenly get an alice of 3 instead of 4, because previous records have been deleted by the log compression feature. Therefore, log compression is safe to use in KTable, but wrong to use in KStream

The simple form of a table is a collection of KV pairs.

Stream as table: a stream can be thought of as a table and can be transformed into a real table by rebuilding the log.

Table as Stream: a table can be thought of as a snapshot of a point in time on a stream, with each row representing the latest value of the key. You can easily form a real flow by traversing each KV in the table.

(8) KStream (event flow)

Only KafkaStreams's DSL has the concept of KSteam. An KSteam is a stream of events, and each time record represents an infinite abstraction of a dataset containing data. The concept of flow is explained by tables. Records of data are always interpreted as Insert, only appended, because there is no way to replace the row data of the same key that already exists.

(9) KTable (changelog stream)

Only KafkaSteams's DSL has the concept of KTable. A KTable is an update log stream for changelog. Each data record represents an updated abstraction. Each record is the result of an update to the last value of the Key. KTable provides the ability to find data values through key, which can be used in functions such as Join.

(10) Join

Join can record and merge two streams on Key to generate new streams. A stream-based Join is usually window-based, otherwise all data will be saved and records will grow indefinitely. KafkaStreamsDSL supports different Join, such as Join between KSteam and Join between KStream and KTable.

# (11) Aggregations aggregation operations, such as sum and count, require an input stream and form a single record in units of multiple input records to generate a new stream. Aggregation on the stream must be based on exposure, and the responsible data, like join, will grow indefinitely. The aggregate input can be KStream or KTable, but the output must be KTable, so that the output of KafkaStreams will be constantly updated. When the data arrives out of order, the data can also be updated, because the output is KTable, the data will be overwritten in time.

4. KafkaSteams architecture

First, put a diagram of the architecture:

(1) topological operator

A topology operator or simple topology defines the computational logic of a stream processing application, that is, how input data is converted to output data. A topology operator is a logical abstraction that contains user flow processing code. At run time, the logical topology is instantiated and copied in parallel in the application.

(2) concurrency model-Stream partitions and tasks (Task)

Each Stream partition is a complete and orderly data record in a partition of kafka; the key value of a data record in a Stream data record map directly from Kafka topic data is the key to Kafak and KafkaStreams, which determines how the data is routed to a particular partition. During the execution of the flow task, the number of partitions of the input stream determines the number of Task. Each Task is responsible for the data processing of the partition, and the kafkaStreams allocates a corresponding buffer for each assigned partition, which provides a time processing mechanism based on the buffer to process one message at a time. It should be noted that KafkaStreams is not a resource manager, but a library that can run in any streaming application, and multiple instances of the application can run on the same machine or be distributed by the resource manager to different nodes. The partition assigned to the Task will never change, and if an example fails, the task will be reassigned and started on another instance, and data will continue to be consumed from the same partition.

(3) concurrency model-thread model

Developers can configure the number of threads for parallel processing in each application, and each thread performs one or more tasks independently of their topology operator. For example, two Task can be executed in a thread. These two Task correspond to two partitions of Topic1, or two partitions of Topic2 can be processed at the same time, but different partitions of the same Topic must be processed with different Task.

(4) status

Kafka provides state storage that allows data to be saved and queried in streaming applications. Each Task has one or more built-in state storage spaces that can be saved or queried through API. These state storage spaces are RocksDB databases, a memory-based HashMap, or other more convenient data structures. And kafkaStreams provides fault tolerance and automatic recovery based on the local state.

(5) Fault tolerance

Because the partition of Kafka itself is highly available and replicable, it is also highly available when the stream is saved to Kafka, even if the stream processing fails. KafkaStreams will restart the corresponding Task in other instances, taking advantage of KafkaConsumer's failure handling feature. The reliability of the local data store depends on the update log, keeping a replicable changelog for each state Kafkatopic. Changelog uses partitioning in local storage, and each task has its own dedicated partition. If one task fails, kafka will restart on the other instance and use the changelog on that topic to update the latest status of task. If the topic of changelog enables the log compression of kafka, the nine data will be safely cleared and the changelog will grow indefinitely.

(6) processing reliability

Kafka implements a message handling mechanism at least once, even if applause occurs, there will be no data loss and no processing, but some of the data may be processed multiple times. However, there are some non-idempotent operations, such as counting, where errors may occur in at-least-once, and KafkaStreams will support semantic processing of exactly-once in future releases.

(7) flow control based on timestamp

KafkaStreams controls the flow by synchronously adjusting the timestamps on the message records of all input streams. KafkaStreams provides the processing semantics of event-time by default.

This is the end of the article on "what does Kafka Stream mean?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report