Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the practice of Kafka application in the field of public safety?

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about the practice of Kafka application in the field of public safety. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

I. Preface

As the opening work of the application practice of big data framework in the field of public security, this case will start from the optimization of the most basic data architecture system. The following chapters will describe in detail the basic principles of Kafka, the enhanced components of Kafka and the specific application scenarios of Lambda architecture based on Kafka, as well as the corresponding research and development results.

The Lambda architecture is proposed by Nathan Marz, the author of Storm. The aim is to design a satisfying. The architecture of the key features of the real-time big data system has the characteristics of high fault tolerance, low delay and scalability.

Lambda architecture integrates offline computing and real-time computing, integrates immutable (Immutability), a series of principles of read-write separation and isolation, and can integrate various big data components such as Hadoop,Kafka,Storm,Spark,HBase. The key question of big data system: how to query on any big data set in real time? Big data, coupled with real-time calculation, the problem is relatively difficult. The Lambda architecture solves this problem through a decomposed three-tier architecture: Batch Layer,Speed Layer and Serving Layer. As shown in the following figure.

Lambda architecture diagram

After entering the system, the data stream is sent to Batch Layer and Speed Layer for processing at the same time. Batch Layer stores all data sets offline in an immutable model and constantly recalculates the Batch Views corresponding to building queries on all data sets. Speed Layer handles the incremental real-time data stream and constantly updates the Real time Views corresponding to the query. In response to the user's query request, Serving Layer merges the resulting dataset from Batch View and Real time View into the final dataset.

II. Lambda architecture based on Kafka

2.1A practical case of big data platform in a province

Taking the construction plan of big data of a provincial department as an example, Kafka is taken as a unified data flow channel (data pipeline). Kafka is divided into two levels: prefectures and cities and provincial departments. The data of prefectures and cities are first sent to the Kafka of prefectures and cities through streaming processing. After standardization, the Kafka of prefectures and cities is then collected to the provincial Kafka.

Practice of big data platform in a certain province

2.2 necessity of introducing Kafka

In the big data system, we often encounter a problem, the whole big data is composed of various subsystems, and the data needs to flow continuously with high performance and low delay in each subsystem. The traditional enterprise message system is not very suitable for large-scale data processing. It is easy to cause problems such as difficult to collect log data, easy to lose information, pipes between Oracle instances can not be used by other systems, data architecture is easy to create and difficult to expand, poor data quality and so on. In order to handle both online applications (messages) and offline applications (data files, logs), Kafka appeared. Kafka can serve two purposes:

Reduce the complexity of system networking.

To reduce the programming complexity, each subsystem is no longer to negotiate the interface with each other, each subsystem similar to the socket is plugged into the socket, Kafka assumes the role of high-speed data bus.

Traditional data architecture

With the introduction of Kafka, you can build a stream-centric data architecture. Kafka is used as a global data pipeline. Each system sends or obtains data to this central pipeline. The application or flow handler can access the pipeline and create a new derived flow. These derived streams can be used by various other systems.

Stream-centric data architecture

III. Technical analysis of Kafka

3.1Features of Kafka

Kafka allows the right data to appear in the right place in the right form. Kafka provides message queues, allowing producers to add data to the end of the queue, allowing multiple consumers to read data from the queue in turn and then process it on their own.

Kafka message queuing

Distributed system, easy to expand outward. All producer, broker, and consumer will have multiple, all distributed. The machine can be expanded without downtime.

Provides massive message processing in Pub/Sub mode. It is understood that Kafka can produce about 250000 messages per second (50 MB) and process 550000 messages per second (110 MB).

Store massive data streams in a highly fault-tolerant way.

Ensure the order of data flow and handle critical updates.

Provides long-term storage of messages and persists messages to disk, so it can be used for bulk consumption, such as ETL, and real-time applications. Prevent data loss by persisting data to the hard disk and replication.

Ability to cache or persist data and support integration with batch systems such as Hadoop.

Provide low-latency data transmission and processing for real-time applications.

Scenarios that support online and offline.

The state in which the message is processed is maintained on the client side, not on the server side. It can balance automatically when it fails.

3.2 Analysis of Kafka principle

3.2.1 overall Kafka architecture

Overall architecture of Kafka

The overall architecture of Kafka is very simple and is explicitly distributed. Producer, broker (kafka), and consumer can all have multiple. Producer,consumer implements the interface for Kafka registration, and data is sent from producer to broker,broker to act as an intermediate cache and distribution. Broker distributes consumer registered with the system. The role of broker is similar to caching, that is, caching between active data and offline processing systems. The communication between client and server is based on TCP protocol, which is simple, high-performance and independent of programming language.

Basic concepts:

Topic: specifically refers to the different categories of message sources (feeds of messages) processed by Kafka.

Partition:Topic physical packets, a topic can be divided into multiple partition, each partition is an ordered queue. Each message in partition is assigned an ordered id (offset).

Message: messages are the basic unit of communication, and each producer can publish some messages to a topic (topic).

Producers: a producer of messages and data. The process of publishing a message to a topic in Kafka is called producers.

Consumers: a consumer of messages and data, the process of subscribing to topics and processing its published messages is called consumers.

Broker: cache proxy. One or more servers in a Kafka cluster are collectively referred to as broker.

3.2.2 key technical points of Kafka

3.2.2.1 zero-copy

On Kafka, there are two possible reasons for inefficiency: too many network requests and too many byte copies. In order to improve efficiency, Kafka divides message into groups, and each request sends a set of message to the corresponding consumer. In addition, in order to reduce byte copies, sendfile system calls are used.

3.2.2.2 Exactly once message transfer

Only the offset that each consumer has processed data is saved in the Kafka. This has two advantages: one is that the amount of data saved is small, and the other is that when consumer goes wrong, when you restart consumer to process data, you only need to start processing data from the nearest offset.

3.2.2.3 Push/pull

Producer pushes (push) data to Kafka, and consumer pulls (pull) data from kafka.

3.2.2.4 load balancing and fault tolerance

There is no load balancing mechanism between Producer and broker. Zookeeper is used for load balancing between broker and consumer. All broker and consumer are registered with zookeeper, and zookeeper saves some of their metadata information. If a broker and consumer changes, all other broker and consumer are notified.

3.2.2.5 Subarea

Kafka can divide topics into multiple Partition and choose to distribute messages evenly among different partitions according to partition rules, thus achieving load balancing and horizontal scaling. Multiple subscribers can consume data from one or more partitions at the same time to support massive data processing capabilities. Because messages are appended to partitions, the overall efficiency of writing disks sequentially to multiple partitions is higher than that of random write memory, which is one of the important guarantees of high throughput in Kafka.

Kafka partition to achieve load balancing, horizontal expansion, high throughput

In order to ensure the reliability of the data, each partition node will set a Leader, as well as several nodes as Follower. When the data is written to the partition, Leader copies the data to each Follower in addition to making a copy of itself. If any follower dies, Kafka will find another follower to get the data from leader. If the Leader fails, extract one from the Follower as the Leader.

Kafka Partition to realize data Reliability

3.3Technical selection of Kafka

3.3.1 Overview of Confluent Platform

Confluent Platform is a streaming data platform that can organize and manage data from different data sources and has a stable and efficient system. Confluent Platform makes it easy to build real-time data pipelines and streaming applications. By integrating data from multiple sources and locations into a central data flow platform. Confluent Platform simplifies the infrastructure to connect data sources to Kafka, build applications with Kafka, and secure, monitor, and manage Kafka.

Confluent Platform architecture

3.3.2 Kafka Connect

Kafka Connect, which makes it easier to create and manage data flow pipelines. It provides a simple model for Kafka and other systems to create scalable and reliable stream data. Big data can be imported into Kafka from other systems or exported to other systems from Kafka through connectors. Kafka Connect can inject the complete database into the Topic of Kafka, or inject the system monitoring metrics of the server into Kafka, and then process the data flow like the normal Kafka stream processing mechanism. The export work is to export the data from Kafka Topic to other data storage systems, query systems or offline analysis systems.

Kafka Connect features include:

Kafka connector generic framework that provides a unified integrated API

Support both distributed mode and stand-alone mode

REST interface, used to view and manage Kafka connectors

Automated offset management, developers do not have to worry about the impact of error handling

Distributed and scalable

Stream / batch integration

How Kafka connect works

3.4 Kafka end-to-end audit

The open source Chaperone technology framework is used to realize the end-to-end audit of kafka. Its goal is to catch each message, count the amount of data in a certain period of time, and accurately detect the loss, delay and repetition of data at each stage of data flow through the data pipeline.

Is there any data loss? Yes, so how much data has been lost? Where were they lost in the data pipeline?

What is the end-to-end delay? If there is a message delay, where did it start?

Is there any data duplication?

Chaperone architecture

Chaperone architectures: AuditLibrary, ChaperoneService, ChaperoneCollector, and WebService, which collect data, perform calculations, automatically detect missing and delayed data, and display audit results. Ensure that each message is audited only once during the audit process, using a consistent timestamp between layers.

The audit process of Chaperone module is as follows:

Generate audit messages: ChaperoneService records status by periodically generating audit messages to specific Kafka topics

Audit algorithm: AuditLibrary implements the audit algorithm, which regularly collects and prints statistical time windows

Get audit results: ChaperoneCollector listens for specific Kafka topics, gets all audit messages, stores them in the database, and generates dashboards. Dashboard display: data loss, message delay, view the topic status of each topic center

Display the results accurately: WebService provides REST interface to query the metrics collected by Chaperone. Through these interfaces, we can accurately calculate the amount of data lost.

IV. Introduction of Kafka application achievements

Based on the technical characteristics of Kafka, Kafka has been maturely applied to the resource service platform project of a provincial office, which is mainly used to collect logs and massive data, to provide a large-scale message processing platform for data sharing among business systems, and to form a data pipeline between cities and provincial departments.

Combined with the in-depth study of Kafka and Kafka plug-ins, Xindehui big data Research Institute independently developed a lightweight FSP stream processing engine, which is used for portable docking data, efficient processing and implementation of all kinds of stream data extension applications.

4.1 Log aggregation

Logs between multiple systems are aggregated through kafka to provide audit or other monitoring systems for consumption. Log aggregation generally collects log files from the server and puts them in a centralized location (file server or HDFS) for processing. However, Kafka ignores the details of the file and abstracts it more clearly into a message flow of logs or events. This makes Kafka processing less latency and easier to support multiple data sources and distributed data processing. Compared to log-centric systems such as Scribe or Flume, Kafka provides the same efficient performance and higher durability due to replication, as well as lower end-to-end latency.

4.2 message system

Decoupling between systems, data sharing and business process-driven among business systems are driven by kafka.

Compared with most messaging systems, Kafka has better throughput, built-in partitioning, redundancy and fault tolerance, which makes Kafka a good solution for large-scale message processing applications. Messaging systems generally have relatively low throughput, but require less end-to-end latency and often rely on the strong persistence guarantees provided by Kafka. In this area, Kafka is comparable to traditional messaging systems, such as ActiveMR or RabbitMQ.

4.3 data pipeline

Kafka allows the integration effort to be connected to a separate pipeline without having to connect to each data producer and consumer.

Kafka provides data pipelines for various types of data resources in multiple prefectures and cities, without knowing the details of the original data source when integrating, or which application will consume and load the data when publishing data, adding a new system, and only need to access the existing Kafka streaming data platform.

A case of Kafka data pipeline in a provincial department

4.4 ETL pipeline

When kafka is not introduced, the ETL process of data needs to generate temporary database, generate landing files many times, and consume memory, and it will consume memory when calling temporary database again. Such a heavy architecture does not have the ability to process streaming data.

After the introduction of kafka, micro ETL is realized. Through the Kafka flow processing engine, the ELT process is simplified, the data processing level is refined, and the target data is obtained with low delay.

Advantages of Micro ETL:

Seamlessly connect the flow processing engine to complete fast data ETL

Kafka builds a scalable and reliable data flow channel

Interactive low latency

Micro ETL to realize portable data processing flow

Comparison between traditional ETL and Micro ETL

4.5 FSP stream processing engine

4.5.1 FSP architecture

FSP architecture

Stream processing platform: configurable management platform for convection data, core processing engine and stream collection tools

Core processing engine: PIPELINEDB allows us to manipulate the data stream and store the operation results through sql; the Kafka plug-in can extend the kafka function to realize the extended application of all kinds of stream data of SQL on kafka

Stream collection toolset: Kafkacat realizes the docking of data collected by Kafka, sqluldr and copy, and realizes the collection of stream data.

4.5.2 Kafkacat

4.5.2.1 tools to grab and send messages

Kafkacat is NON JVM TOOL, fast, light, statically compiled less than 150kb, and provides metadata lists to show clusters / partitions / topics.

Kafkacat working mode

4.5.2.2 generate GP external tables by loading data with kafkacat command

The data docking between GP and kafka is realized through Kafkacat: the kafkacat tool can obtain the data of GP and kafka according to the external table protocol, and generate external tables to realize the parallel loading of data. Fault-tolerant processing of data format error rows in the form of an external table

Kafkacat loads GP external table

Fifth, the prospect of Kafka extension application

Integrate NiFi and kafka, and put MiNiFi as a data collector to the opposite data source to form a streaming data processing production line that can be expanded and flowed.

Combination of Kafka and NiFi

5.1 NiFi introduction

NiFi is an easy-to-use, powerful and reliable data processing and distribution system. Simply put, NiFi is used to automate the management of data flow between systems. Through the docking with Kafka, it can provide visual command and control, realize the display and editing function of data flow, and realize the whole process tracking of data flow.

NiFi features:

1. Visual command and control

Web-based user interface, seamless experience design, monitoring, control data flow.

High scalability

NiFi ensures that the constraint relationships between each extension component are limited to a very limited extent by providing a custom class loader model. Therefore, when you create an extension component, you no longer have to pay too much attention to whether it conflicts with other components. The data flow processor can be executed in a predictable and repeatable mode.

Data back pressure

NiFi provides a cache of all queue data and can provide data backpressure when the queue reaches a specified limit or times out.

Highly configurable

Data loss fault tolerance and guaranteed delivery, low latency and high throughput, dynamic priority, streams can be modified at run time.

Security.

Between systems, NiFi can encrypt data through two-way SSL. And can allow the use of shared keys at the sending and receiving end, and other mechanisms to encrypt and decrypt the data stream.

Between the user and the system, NiFi allows two-way SSL authentication and provides a pluggable authorization mode, so you can control the user's login permissions (for example, read-only permissions, data flow managers, system administrators).

5.2 NiFi implements a distributed streaming platform for unified real-time data collection

Real-time data collector MiNiFi:

Realize the real-time acquisition of incremental data and stream data, rather than the traditional timing acquisition, and achieve more detailed data acquisition.

Can support a variety of data sources, strong applicability

Realize end-to-end data acquisition

Distributed streaming platform NiFi:

Collect the data, form the data stream, and automatically record, index and track the data source.

Precise control of data flow

The performance of a single NIFI node is to process 100 megabits of data per second. Building a NIFI cluster can be upgraded to handle G-level data per second.

NiFi distributed streaming platform

The above is what the application practice of Kafka in the field of public safety is shared by the editor. If you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report