Lesson 86: SparkStreaming data Source Flume actual case sharing 02/13 Update SLTechnology News&Howtos

Lesson 86: SparkStreaming data Source Flume actual case sharing

2026-02-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

What is Flume?

As a real-time log collection system developed by cloudera, Flume has been recognized and widely used in the industry. The initial release of Flume is now collectively referred to as Flume OG (original generation) and belongs to cloudera. However, with the expansion of FLume functions, the shortcomings such as bloated Flume OG code engineering, unreasonable design of core components and non-standard core configuration have been exposed, especially in the last release version 0.94.0 of Flume OG, the phenomenon of unstable log transmission is particularly serious. in order to solve these problems, cloudera completed Flume-728 on October 22, 2011. A landmark change has been made to Flume: refactoring core components, core configuration and code architecture. The refactored version is collectively referred to as Flume NG (next generation). Another reason for the change is the inclusion of Flume under apache and the renaming of cloudera Flume to Apache Flume.

Characteristics of Flume:

Flume is a distributed, reliable and highly available system for massive log collection, aggregation and transmission. Support customizing various data senders in the log system for data collection; at the same time, Flume provides the ability to simply process the data and write to various data recipients (such as text, HDFS, Hbase, etc.).

The data flow of the flume is run through by the Event. An event is the basic data unit of a Flume. It carries log data (in the form of a byte array) and header information. These Event are generated by the Source outside the Agent. When the Source captures the event, it is formatted specifically, and then the Source pushes the event into (single or multiple) Channel. You can think of Channel as a buffer that will hold the event until the Sink finishes processing it. Sink is responsible for persisting logs or pushing events to another Source.

Reliability of Flume

When a node fails, the log can be sent to other nodes without being lost. Flume provides three levels of reliability guarantee, from strong to weak: end-to-end (agent first writes event to disk when the data is received, and then deletes it when the data is transferred successfully; if the data fails to be sent, it can be re-sent. ), Store on failure (this is also the strategy adopted by scribe, when the data receiver crash, write the data locally, wait for recovery, continue to send), Besteffort (after the data is sent to the receiver, it will not be acknowledged).

Recoverability of Flume:

It's still Channel. FileChannel is recommended and events are persisted in the local file system (poor performance).

Some core concepts of Flume:

Agent uses JVM to run Flume. Each machine runs one agent, but it can contain multiple sources and sinks in one agent.

1. Client production data, running in a separate thread.

2. Source collects data from Client and passes it to Channel.

3. Sink collects data from Channel and runs on a separate thread.

4. Channel connects sources and sinks, which is a bit like a queue.

5. Events can be logging, avro objects, and so on.

Flume takes agent as the smallest independent operating unit. An agent is a JVM. A single agent consists of three major components: Source, Sink and Channel, as shown below:

It is worth noting that Flume provides a large number of built-in Source, Channel, and Sink types. Different types of Source,Channel and Sink can be combined freely. The combination method is based on the profile set by the user, which is very flexible. For example, Channel can store events in memory temporarily or persist them to a local hard disk. Sink can write logs to HDFS, HBase, or even another Source, etc. Flume allows users to establish multi-level streams, that is, multiple agent can work together, and support Fan-in, Fan-out, Contextual Routing, Backup Routes, which is where NB is. As shown in the following figure:

2. Flume+Kafka+SparkStreaming application scenarios:

1. The Flume cluster collects the business information of the external system, sends the collected information to the Kafka cluster, and finally provides the Spark Streaming flow framework for calculation and processing. After the stream processing is completed, the final result is sent to Kafka for storage.

2. The Flume cluster collects the business information of the external system, sends the collected information to the Kafka cluster, and finally provides the Spark Streaming flow framework for calculation and processing. After the stream processing is completed, the final result is sent to Kafka for storage, and the final result is graphically displayed through the Ganglia monitoring tool.

3. We need to do: Sparkstreaming interactive 360-degree visualization, Sparkstreaming interactive 3D visualization UI The Flume cluster collects the business information of the external system, sends the collected information to the Kafka cluster, and finally provides the Spark Streaming streaming framework for calculation and processing. After the stream processing is completed, the final result is sent to Kafka for storage, and the final result is stored in the database (Mysql) and memory middleware (Redis, MemSQL). At the same time, the final result is graphically displayed through the Ganglia monitoring tool, as shown below:

3. There are two ways to write Kafka data to Spark Streaming:

One is Receivers, which uses Receivers to receive data, and the implementation of Receivers uses Kafka's high-level consumer API. For all Receivers, the received data will be saved in Spark's distributed executors, and then the Job initiated by Spark Streaming will process the data. However, in the default configuration, this method will lose data in case of failure, in order to ensure zero data loss, you can use the WAL log feature in Spark Streaming, which allows us to save the received data to WAL (WAL logs can be stored on HDFS), so in case of failure, we can recover from WAL without losing data.

The other is DirectAPI, where data is generated and processed on two machines? In fact, on the same data, because there are Driver and Executor on the same machine, so this machine should be strong enough.

The Flume cluster puts the collected data into the Kafka cluster. Spark Streaming will get the data from the Kafka cluster in real time and online through DirectAPI. You can read the data of each batch by querying the latest offset (offset) through the topic+partition in Kafka. Even if the reading fails, you can also read the failed data according to the offset to ensure the stability and data reliability of the application.

Warm reminder:

1. When Flume cluster data is written into Kafka cluster, it may lead to uneven data storage, that is, some Kafka nodes have a large amount of data and some are small. Later, custom algorithms will be used to solve the problem of data storage imbalance.

2. It is highly recommended to use DirectAPI in production environment, but our release will optimize DirectAPI to reduce its latency.

Summary:

In the actual production environment, Kafka is the core of collecting distributed logs.

Note:

Source: DT_ big data DreamWorks (IMF legendary Action Top Secret course)

For more private content, please follow the Wechat official account: DT_Spark

If you are interested in big data Spark, you can listen to the Spark permanent free open course offered by teacher Wang Jialin at 20:00 every evening, address YY room number: 68917580

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.