A brief introduction to big data's Real-time processing Framework 07/19 Update SLTechnology News&Howtos

A brief introduction to big data's Real-time processing Framework

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Welcome to the world of BigData

Now, we have come to the data age, data informatization is closely related to our life and work. This article briefly introduces the process of real-time data processing and related frameworks using big data framework, including:

The concept and significance of Real-time data processing

What can real-time data processing do?

Brief introduction of Real-time data processing Architecture

Real-time data processing code demonstration

The concept and significance of Real-time data processing

What is real-time data processing? My personal understanding of real-time data processing is: data generation-> real-time collection-> real-time cache storage-> (quasi) real-time computing-> real-time landing-> real-time display-> real-time analysis. This process goes down the line, and the speed of processing data is in seconds or even milliseconds.

What is the significance of real-time data processing? When we get the data, we can analyze the data and use data statistical methods to sort out the relationship between things from the complex data relationship, such as development trend, influencing factors, causality and so on. Even set up some BI to visualize the useful information of some data and form data stories.

Real-time data processing can do the real-time calculation of what kind of data

What is the real-time calculation of data? We get the data from the data source, may not be satisfactory, we want to ETL the data, or associate, and so on, then we will use the real-time calculation of the data. At present, the mainstream real-time computing framework is spark,storm,flink.

Real-time landing of data

The real-time landing of data means that our source data or calculated data are stored in real time. In the field of big data, it is recommended to use HDFS,ES for storage.

Real-time display and analysis of data

We've got the data, and we need to be able to use the value of the data. The value of data is reflected in the correlation between the data, or related to history, or able to predict the future. We can get real-time data, not only can use the front-end framework for real-time display, but also can train some of the data algorithm, predict the future trend and so on.

Example:

Taobao double 11 big screen, the annual double 11 is Taobao fans crazy day. Jack Ma will erect a large electronic screen at Ali headquarters on Singles Day to show Taobao's achievements on that day. For example, transaction volume, number of visitors, order quantity, order volume, transaction volume and so on. Behind this large electronic screen is the real-time processing of what we call data. First of all, Ali's servers all over the country, these servers collect PC, mobile and other logs, report to the server, and deploy data collection tools on the service. Next, due to the large amount of data, we need to do data cache buffer processing. The next step is to calculate the original log in real time, such as filtering out the metrics described above. Finally, the front-end screen is displayed in real time through the interface or other forms.

Brief introduction of Real-time data processing Architecture

Next is the focus of our introduction, starting with a data flow chart:

Cdn.xitu.io/2018/9/3/1659d6798453f811?imageView2/0/w/1280/h/960/format/webp/ignore-error/1 ">

On the data acquisition side, flume, which is the mainstream control of data collection, is selected.

Data caching, using distributed message queuing kafka.

Real-time data calculation, choose spark calculation engine.

Data storage location, choose distributed data storage ES.

Others refer to the visual display and data analysis after getting the data from ES.

The following is a brief introduction to each component:

Flume

Flume is a distributed data collection system with high reliability, high availability, transaction management, failure restart, aggregation and transmission functions. The speed of data processing is fast, and it can be used in production environment.

The core concepts of flume are: event,agent,source,channel,sink

Event

The data flow of the flume is run through by the event. Event is the basic data unit of flume. It carries log data and header information of the data. These event are generated by the source outside the agent. When the source captures the event, it will be formatted specifically, and then source will push the event into the channel. Think of channel as a buffer that will hold the event until the sink finishes processing it. Sink is responsible for persisting logs or pushing events to another source.

Agent

The core of flume is agent. Agent is a java process that runs on the log collection side, receives logs through agent, then temporarily stores them, and sends them to the destination. Each machine runs an agent. Agent can contain multiple source,channel,sink.

Source

Source is the collection side of the data, which is responsible for special formatting after the data is captured, encapsulating the data into event, and then pushing events into channel. Flume provides many built-in source, support for avro,log4j,syslog, and so on. If the built-in source does not meet the needs of the environment, flume also supports custom source.

Channel

Channel is a component that connects source and sink, and you can think of it as a data buffer (data queue). It can temporarily store events in memory or persist them to local disks until sink finishes processing the event. Two more commonly used channel,MemoryChannel and FileChannel.

Sink

Sink takes the event from the channel and sends the data elsewhere, either to the file system, database, hadoop, kafka, or the source of other agent.

Reliability and recoverability of flume

Reliability of flume: when a node fails, logs can be sent to other nodes without being lost. Flume provides a guarantee of reliability. The received data is first written to disk and then deleted when the data is successfully transferred. If the data fails to be sent, it can be re-sent.

Recoverability of flume: recoverability depends on channel.

Dictation abstract, the last two official website map:

Single agent collection data flow chart

Data flow chart of multiple agent collaborative processing

Kafka

Kafka is a high-throughput distributed publish-subscribe messaging system. In enterprises, kafka is generally used as message middleware for caching processing. Zookeeper distributed coordination component management is required.

The design goals of kafka:

Provide excellent message persistence, and ensure constant-time access to data above TB level.

High throughput. Even on very cheap machines, 100000 messages per second can be transmitted per machine.

Support message partitioning between kafka server, and distributed consumption, while ensuring the sequential transmission of messages within each partition.

Both offline data processing and real-time data processing are supported.

Core concepts of kafka

Broker: message middleware processing node. A kafka node is a broker, and multiple broker can form a kafka cluster.

Topic: topic, kafka cluster can be responsible for the distribution of multiple topic at the same time.

Partition:topic physical packets, a topic can be divided into multiple partition, each partition is an ordered queue.

Offset: each partition consists of a series of ordered, immutable messages that are continuously appended to the partition. Each message in partition has a contiguous sequence number called offset, which is used by partition to uniquely identify a message.

Producer: responsible for publishing messages to kafka broker.

Consumer: the message consumer, the client that reads the message to kafka broker.

Consumer group: each consumer belongs to a specific consumer group.

Post two pictures of the official website

Prodecer-broker-consumer

Zoning map

Spark

Spark is a distributed computing framework, which I think is the most popular computing framework at present.

Spark, a "one stack to rulethem all" big data computing framework, expects to use a technology stack to perfectly solve all kinds of computing tasks in big data's field. Apache official, the definition of spark is: general big data fast processing engine (a "stack" type).

Spark composition

Spark core for off-line calculation

Spark sql is used for interactive query

Spark streaming,structed streaming for Real-time streaming Computing

Application of spark MLlib in Machine Learning

Spark GraphX for Graph Computation

Characteristics of spark

Speed: spar k computes based on memory (of course, some calculations are based on disk, such as shuffle).

Easy to develop: spark's rdd-based computing model is easier to understand and easier to develop than hadoop's map-reduce-based computing model to achieve a variety of complex functions.

Versatility: the technical components provided by spark can complete off-line batch processing, interactive query, streaming computing, machine learning, graph computing and other common tasks in big data field in one stop.

Perfect integration with other technologies: for example, hadoop,hdfs, hive and hbase are responsible for storage, yarn is responsible for resource scheduling, and spark is responsible for big data computing.

Extremely active: spark is currently the top project of apache, a large number of outstanding engineers in the world are spark's committer, and many of the world's top IT companies are using spark on a large scale.

Post a spark architecture diagram

Real-time data processing code demonstration to build each cluster environment

Flume cluster, kafka cluster, es cluster and zookeeper cluster need to be built. Since the spark in this example runs in local mode, there is no need to build spark cluster.

Configure the configuration file for integration between components

After building the cluster, configure the configuration file according to the direct integration relationship of the cluster components. The main configuration is the configuration of flume, as shown below:

As you can see, the source of our agent is R1, the source is C1, and the source is K1, which serves my local nc. When collecting logs, you only need to open port 9999 to collect logs. Channel selects memory memory mode. Sink is the topic8 theme of kafka.

Start each cluster process

Open the zookeeper service. Where QuorumPeerMain is the zookeeper process.

Open the kafka service.

Open the es service.

Open the flume service. Where Application is the flume process.

Create a corresponding table for es

Create a table corresponding to es. The table has three fields, corresponding to the case class in the code (the code is then affixed).

The code is as follows:

Package run import org.apache.kafka.common.serialization.StringDeserializer import org.apache.log4j.Logger import org.apache.spark. {SparkConf, SparkContext} import org.apache.spark.sql.SparkSession import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe import org.apache.spark.streaming.kafka010.KafkaUtils import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent import org.apache.spark.streaming. {Seconds StreamingContext} import org.elasticsearch.spark.rdd.EsSpark / * @ author wangjx * Test kafka data for statistical kafka self-maintenance offset (it is recommended to use custom maintenance methods to maintain offsets) * / object SparkStreamingAutoOffsetKafka {/ / define the sample class corresponds to the es table case class people (name:String,country:String,age:Int) def main (args: Array [String]): Unit = {val logger = Logger.getLogger (this.getClass) / / spark configuration val conf = new SparkConf () .setAppName ("SparkStreamingAutoOffsetKafka") .setMaster ("local [2]") conf.set ("es.index.auto.create", "true") conf.set ("es.nodes", "127.0.0.1") conf.set ("es.port") "9200") / spark streaming real-time computing initialization definition A batch of quasi-real-time processing every 10 seconds is generally quasi-real-time, for example, statistics of data for nearly 1 minute every 10 seconds, etc. Val ssc = new StreamingContext (conf, Seconds (10)) val spark = SparkSession.builder () .config (conf) .getOrCreate () spark.sparkContext.setLogLevel ("WARN") / set kafka parameters val kafkaParams = Map [String, Object] ("bootstrap.servers"-> "XRV 9092", "key.deserializer"-> classOf [StringDeserializer], "value.deserializer"-> classOf [StringDeserializer], "group.id"-> "exactly-once", "auto.offset.reset"-> "latest" "enable.auto.commit"-> (false: java.lang.Boolean) / / kafka topic val topic = Set ("kafka8") / / get data from kafka val stream = KafkaUtils.createDirectStream [String, String] (ssc, PreferConsistent, Subscribe [String, String] (topic) KafkaParams)) / / specific business logic val kafkaValue: DStream [String] = stream.flatMap (line= > Some (line.value () val peopleStream = kafkaValue .map (_ .split (":")) / / forms the people sample object .map (m = > people (m (0), m (1), m (2) .toInt)) / / stores ES peopleStream.foreachRDD (rdd = > {EsSpark.saveToEs (rdd)) "people/man")}) / / start the program entry ssc.start () ssc.awaitTermination ()}} copy the code

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.