What is Apache Flume? 07/11 Update SLTechnology News&Howtos

What is Apache Flume?

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what Apache Flume is". The explanation in this article is simple and clear, easy to learn and understand. Please follow the ideas of Xiaobian and go deep into it slowly to study and learn "what Apache Flume is" together!

Flume is a reliable, usable distributed system designed for efficient collection of aggregated large amounts of log data. It has a simple and flexible architecture based on streaming data streams. It provides controllable reliability mechanisms and many failover and recovery mechanisms. It uses a simple and scalable data model for online analytics applications.

FileChannel is a persistent Flume channel that supports parallel writing to multiple disks and encryption.

concept

When using Flume, each process has a Source, Channel, and Sink. A typical example would be a webserver writing events to a Source via RPC (e.g. AvroSource), then the Source writing events to MemoryChannel, and finally HDFS Sink consuming events writing them to HDFS.

MemoryChannel provides high throughput, but data is lost when the system loses power or the program crashes. Therefore, people urgently need a durable Channel. FileChannel is implemented in FLUME-1085. Its goal is to provide a reliable, high-throughput Channel. FileChannel guarantees that when a power outage or crash occurs, transactions are committed and no data is lost.

The important point is that FileChannel does not replicate any data and relies only on the reliability of the underlying disk, so users of FileChannel need to be aware of this when purchasing and configuring hardware due to persistence. The underlying disks are going to do RAID, SAN or something like that.

Many systems need to switch with a small amount of data loss allowance for high throughput. The Flume group decided FileChannel took a different approach. Flume is a transaction-enabled system where multiple events can be put or taken in a single transaction. Batch size is used to control throughput. With a large batch size, flume can move data at high throughput rates without losing data. Batch size can be controlled by client. This method works with DBMS.

Similar.

A flume transaction contains put or take, and a transaction cannot have both put and take operations. Each transaction implements the put and take methods. Source puts an event into the channel by putting, sink takes an event out of the channel by taking.

design

FileChannel is based on memory queues and WAL design. Each transaction is written to WAL according to the transaction type (Take and Put), and the queue is modified accordingly. Every time a transaction is commented, fsync is called to ensure that events are stored in a disk file and that a pointer to the event is placed in the queue. The queue service here is like any other queue: it manages what is consumed by the sink. During the Take, the pointer to the event is removed from the queue. Read this event directly from WAL. Since we have a lot of RAM available today, reading from the operating system's file cache is also a frequent occurrence.

After a crash, by replaying WALs, the queue position can be restored to the same state as before the crash, and those transactions that are not committed are discarded. Replaying WALs is time-consuming, so the queue itself is periodically written to disk. Writing queues to disk is called checkpoint. Thus, after a crash, the queue is first loaded from the disk checkpoint file and only those transactions committed since the queue was last checkpointed to disk are replayed, significantly reducing the number of WAL reads.

For example, a channel has two events, as follows:

WALs contain three important attributes: transaction ID, sequence number, and event data. Each transaction has a unique transaction Id, and each event has a unique sequence number. Transaction IDs are used to simply group events into the same transaction, while sequence numbers are used when replays log. In the figure above, transaction ID is 1 and sequence numbers are 1, 2, 3.

When the queue is checkpointed to disk, the sequence number is incremented, and the sequence number is also saved to disk. On restart, the queue is first loaded from disk, and then any WAL entities larger than the queue sequence number are replayed. During a checkpoin operation, the queue is locked so that no Put or Take operation can change its state. Allowing queue modification during checkpoint will cause disk-stored queue snapshots to be inconsistent with actual queues.

In the example above, after transaction 1 is commented, checkpoint occurs, and the result in the queue is saved to disk with events and sequence number 4.

Then, in transaction 2, take an event from the queue:

If it crashes at this point, the queue is loaded from checkpoint on restart. Note that checkpoint occurs before transaction 2, both events "a" and "b" are loaded into the queue, and any confirmed transactions larger than 4 are replayed. After replay,"a"event is removed from the queue. The above design 2 points do not take into account that Take or Put is in progress, and checkpoint occurs at the same time, which will lead to data loss. Assume checkpoint occurs after take"a": if it crashes, event"b" is loaded into the queue under the design described above, and any WAL entity greater than 5 is replayed, transaction 2 is rolled back, but take "a" here is not replayed. Event "a" is missing, and a similar situation exists for Put. For this reason, when a queue checkpoint occurs, transactions that are still in progress are written out so that the problem can be handled appropriately.

achieve

FileChannel is in the flume-file-channel module of the Flume project, corresponding to a package named org.apache.flume.channel.file. The queues described above correspond to the FlumeEventQueue class, and WAL corresponds to the Log class. The queue itself is a circular array supported by memory-mapped files; WAL corresponds to a set of files that can be read and written using the LogFile class and its subclasses. Thank you for reading, the above is "Apache Flume is what" content, after the study of this article, I believe that we have a deeper understanding of what Apache Flume is, the specific use of the situation also needs to be verified. Here is, Xiaobian will push more articles related to knowledge points for everyone, welcome to pay attention!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.