In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
Serverless connects the practical analysis of Kafka upstream and downstream data flow. In view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.
As a key component of big data architecture, CKafka plays the role of data aggregation, traffic peaking and message pipeline. There are a variety of excellent open source solutions in the data flow upstream and downstream of CKafka. Such as Logstash,File Beats,Spark,Flink and so on. Here's a new solution: Serverless Function. Compared with the existing open source solutions, it will have excellent performance in the aspects of learning cost, maintenance cost, capacity expansion and so on.
Tencent Cloud Kafka introduction
Tencent Cloud Kafka is a Cloud Kafka suitable for large-scale public cloud deployment based on the open source Kafka engine. It is a distributed, highly reliable, high throughput and highly scalable message queuing system suitable for public cloud deployment, operation and maintenance. It is 100% compatible with open source Kafka API and currently supports four major versions of open source: 0.9,0.10,1.1.1 and 2.4.2, and provides backward compatibility.
At present, Tencent Cloud Kafka maintains a cluster of more than 4,000 + nodes, with more than 9 trillion messages per day, peak bandwidth up to 800GB+/s, and accumulated data up to 20PB+. It is a reliable public cloud Kafka cluster that integrates a series of features such as tenant isolation, current restriction, authentication, security, data monitoring and alarm, rapid failover, disaster recovery across availability zones, and so on.
What is data flow?
CKafka is a high throughput, high reliable message queuing engine. Need to undertake a large number of data inflow and outflow, the process of data flow we call it data flow. In the process of dealing with the inflow and outflow of data, there will be many mature and rich open source solutions, such as Logstash,Spark,Fllink. From simple data dump to complex data cleaning, filtering, aggregation and so on, there are ready-made solutions.
As shown in the figure, in the upstream and downstream ecological map of Kafka, CKafka is in the middle layer, which plays the role of data aggregation, traffic peaking and message pipeline. On the left and on the figure is an overview of the components of data writing, and on the right and below are the downstream streaming data processing scheme and persistent storage engine. These constitute the ecology of data flow around Kafka.
New solution for data flow: Serverless Function
The following figure is a schematic diagram of typical data flow in streaming computing. Among them, data flow solutions are undertaken by a variety of open source solutions. From a purely functional and performance point of view, open source solutions perform well.
From the point of view of learning cost, maintenance cost, money cost, capacity expansion and so on, these open source solutions are still lacking. It's hard to say? The main drawbacks of open source solutions are as follows:
Learning cost
The cost of tuning, maintaining, and solving problems
Capacity expansion and reduction capacity
Take Logstash as an example, its entry learning threshold is not high, and advanced use has a certain cost, including the use cost of many release versions, parameter tuning and fault handling costs, follow-up maintenance costs (process availability, single machine load processing) and so on. If we use streaming computing engines, such as spark and flink, although they have distributed scheduling capabilities and real-time data processing capabilities, their learning threshold and late cluster maintenance costs will be greatly increased.
Let's take a look at how Serverless Function handles data flow. As shown in the figure, Serverless Function runs at the processing layer where data flows in and out, replacing the open source solution. Serverless Function implements data cleaning, filtering, aggregation, dump and other capabilities in the form of custom code. It has excellent characteristics such as low learning cost, no maintenance cost, automatic capacity expansion and pay-by-quantity.
Next, let's take a look at how Serverless Function implements data flow, and understand its underlying operating mechanism and its advantages.
Data transfer based on Serverless Function
First, let's take a look at how to use Serverless Function to implement the data flow of Kafka To Elasticsearch. Here's how Function implements low-cost data cleaning, filtering, formatting, and dumping in a way that is triggered by Function events:
In the scenario of business error log collection and analysis, the log information on the machine is collected and sent to the server. The server chooses Kafka as the message middleware, which plays the role of reliable data storage and traffic peaking. In order to save long-term data (month, year), the data is generally cleaned, formatted, filtered, aggregated and stored in a back-end distributed storage system, such as HDFS,HBASE,Elasticsearch.
The following code snippet is divided into three parts: the message format of the data source, the target message format after processing, and the Function code snippet of the function implementation.
Source data format:
{"version": 1, "componentName": "trade", "timestamp": 1595944295, "eventId": 9128499, "returnValue":-1, "returnCode": 101103, "returnMessage": "return has no deal return error [error: missing * * c parameter] [seqId:u3Becr8iz*]", "data": [] SeqId: "@ kibana-highlighted-field@u3Becr8iz@/kibana-highlighted-field@*"}
Target data format:
{"timestamp": "2020-07-28 21:51:35", "returnCode": 101103, "returnError": "return has no deal return error", "returnMessage": "error: missing * * c parameter", "requestId": "u3Becr8iz*"}
Function code
The function of Function is to convert the data from the source format to the target data format through cleaning, filtering and formatting, and dump it to Elasticsearch. The logic of the code is simple: when CKafka receives the message, it triggers the execution of the function. After receiving the message, the function will filter, reorganize, format the convertAndFilter function, convert the source data to the target format, and finally the data will be stored in Elasticsearch.
#! / usr/bin/python#-*-coding: UTF-8-*-from datetime import datetimefrom elasticsearch import Elasticsearchfrom elasticsearch import helpersesServer = "http://172.16.16.53:9200" # modified to es server address + port E.g. Http://172.16.16.53:9200esUsr =" elastic "# changed to es user name E.g. ElasticesPw =" PW123 "# changed to es password E.g. PW2312321321esIndex =" pre1 "# es index setting #. Or specify common parameters as kwargses = Elasticsearch ([esServer], http_auth= (esUsr, esPw), sniff_on_start=False, sniff_on_connection_fail=False Sniffer_timeout=None) def convertAndFilter (sourceStr): target = {} source = json.loads (sourceStr) # filter out returnCode=0 logs if source ["returnCode"] = 0: return dateArray = datetime.datetime.fromtimestamp (source ["timestamp"]) target ["timestamp"] = dateArray.strftime ("% Y-%m-%d% H:%M:%S") target ["returnCode"] = Source ["returnCode"] message = source ["returnMessage"] message = message.split ("] [") errorInfo = message [0] .split ("[") target ["returnError"] = errorInfo [0] target ["returnMessage"] = errorInfo [1] target ["requestId"] = message [1] .replace ("]" ") .replace (" seqId: ",") return targetdef main_handler (event, context): # get the event Records field and do the conversion operation data structure https://cloud.tencent.com/document/product/583/17530 for record in event ["Records"]: target = convertAndFilter (record) action = {"_ index": esIndex "_ source": {"msgBody": target # get Ckafka trigger msgBody}} helpers.bulk (es, action) return ("successful!")
See here, you may find that this code snippet is usually the same script for dealing with a small amount of data on a single machine, that is, to do conversion, dump, very simple. In fact, many distributed systems do such simple things from a micro point of view. The distributed framework itself does more distributed scheduling, distributed operation, reliability, availability and other work, refined to the execution unit, the function is actually the same as the above code snippet.
From a macro point of view, Serverless Function does the same thing as distributed computing frameworks such as Spark and Flink, scheduling, executing basic execution units, and dealing with business logic. The difference is that with open source solutions, users need to learn, use, and maintain the running engine, while Serverless Function is the platform to help users do these things.
Next, let's take a look at how Serverless Function supports these functions at the bottom, and take a look at its underlying operating mechanism. As shown in the figure:
Function is submitted to the platform as a code snippet. There is a need for a way to trigger the function. At present, there are three main ways: event trigger, timing trigger and active trigger.
In the above example, we take the event trigger as an example. When the message is submitted to the Kafka, it triggers the function to run. At this time, the Serverless scheduling platform will dispatch the underlying Container and send it to execute the function, and execute the logic of the function. At this time, the concurrency of Container is automatically scheduled and calculated by the system. When the source data of Kafka is more, the concurrency is large, and when the data is less, the concurrency is less. Because the function is billed by the running time, when the amount of data of the source message is small, the amount of concurrency is small, the natural running time is less, and the capital cost is reduced naturally.
In the process of function execution, users do not need to care about the reliability of the function, automatic expansion and scaling scheduling, concurrency and so on. The only thing that users need to Cover is the runnability of the function code snippet, without BUG. This will greatly reduce the cost of energy input for R & D personnel.
It is worth mentioning that in terms of development languages, open source solutions only support their corresponding languages, such as Logstash embedded scripts using ruby,spark mainly supports java,scala,python and so on. Serverless Function supports almost common development languages in the industry, including not limited to java,golang,python,node JS,php and so on. This allows developers to solve data flow problems in a language they are familiar with, which virtually reduces the chances of code errors and problems.
The advantages of Serverless Function in data flow scenarios
Let's take a look at the main differences and advantages between Serverless Function and open source solutions. As shown in figure 5, compared to the open source solution. In non-real-time data flow scenarios, Serverless Function has almost overwhelming advantages over existing open source solutions. From a functional and performance point of view, it can be fully satisfied in batch computing (real-time) scenarios. However, compared with the open source scheme, the learning cost is almost negligible, and its dynamic expansion, on-demand and millisecond payment is also very friendly to the investment of capital cost.
To sum up in one sentence: Serverless Function can write a small piece of code in a familiar language to link up the data flow in streaming computing.
The Prospect of Serverless Function in batch Computing scenario
With the development of streaming computing, the directions of batch computing (batch computing), streaming computing (stream computing), interactive computing (interactive computing) and graph computing (graph computing) have gradually evolved. The core of the architect's choice of batch computing or streaming computing in the business is to use batch computing or streaming computing as needed to achieve a balance in terms of delay, throughput, fault tolerance, cost input and so on. In the view of users, batch processing can provide accurate batch data view, and streaming processing can provide near real-time data view. In batch processing, or in the future process of the confluence of batch processing and streaming technology, Lambda architecture is the inevitable path of its development.
With its on-demand use, automatic expansion and almost unlimited horizontal capacity expansion, Serverless Function provides a choice for batch processing at this stage, and it can be expected in the process of batch integration in the future.
This is the answer to the practical analysis of the data flow between the upstream and downstream of Serverless and Kafka. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.