What is the use of Apache's Flume and FileChannel 04/21 Update SLTechnology News&Howtos

What is the use of Apache's Flume and FileChannel

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is the use of Apache's Flume and FileChannel". In daily operation, I believe many people have doubts about the use of Apache's Flume and FileChannel. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the questions of "what is the use of Apache's Flume and FileChannel?" Next, please follow the editor to study!

Flume uses a simple and extensible data model to support online analysis applications.

FileChannel is a persistent Flume channel that supports parallel encryption to write to multiple disks.

Overview

When using Flume, each workflow has Source, Channel, and Sink. A typical example is a webserver writes events to Source (for example: Avro Source) through RPC, sources writes events to MemoryChannel, while HDFS Sink consumes events from MemoryChannel and writes to HDFS

MemoryChannel provides high throughput performance while causing data loss after power outages and program crashes. Therefore, you need to develop a persistent Channel. The goal of File Channel is to provide a reliable high throughput Channel. File Channel guarantees that when a transaction is committed, data will not be lost due to a series of program crashes or power outages.

It is worth noting that FileChannel itself does not make any replication of the data. Therefore, it can only be as reliable as the underlying disk. Users who use FileChannel because of its durability should take this into account when purchasing and configuring hardware. The underlying disk should be RAID,SAN or similar.

Many systems trade a small risk of data loss for higher throughput (for example, fsync, from memory to disk every few seconds). The Flume team decided to take a different approach to implementing FileChannel. Flume is a transactional system, and multiple events (events) can be implemented in a single transaction or put or take operations. Batch size can be used to control throughput. With large batch capacity, Flume can stream data without losing data at high throughput. The size of batch processing can be completely controlled by the client. This is familiar to RDBMS users.

Flume transactions either contain Puts operations or Takes operations, and do not support both operations, as do commit and rollback. Everything transanction is the Put and Take methods. Source calls the Put method to write the event to Channel,Sinks to execute the Takes method to fetch data from the channel.

Design

In addition to being based on memory queues, FileChannel is also based on pre-written log WAL. Each transaction is written to the WAL based on the transaction type (Take or Put) and the queue is modified accordingly. Each time a transaction is committed, fsync is called on the appropriate file to ensure that the data is actually written to disk and the pointer to the event is placed in the queue. This queue is like any other queue: it manages content that has not been consumed by Sink. During the fetch, the pointer is removed from the queue. Then read the event directly from the WAL. Because of the large amount of RAM currently available, it is common to read from the operating system file cache.

After the program crashes, you can replay the WAL to put the queue in the state it was before the crash, so committed transactions are not lost. Replaying the WAL can be time-consuming, so the queue itself is written to disk periodically. Writing a queue to disk is called a checkpoint. After a crash, the queue is loaded from disk, and the transaction is not committed until the queue is saved to disk, which greatly reduces the number of WAL that must be read.

For example, there are two events in channel as shown in the following figure.

WAL contains three important items: transaction ID, sequence number, and event data. Each transaction has a unique transaction ID, and each event has a unique sequence number. The transaction id is only used to group events into transactions, and the sequence number is used when replaying the log. In the above example, the transaction id is 1 and the sequence numbers are 1, 2 and 3.

When the queue is saved to disk (checkpoint), the serial number is also increased and saved. On reboot, first load the queue from disk, and then replay all WAL entries whose sequence number is greater than the queue. During a checkpoint operation, the channel is locked, so no Put or Take operation can change its state. Allowing queues to be modified during checkpoints will result in inconsistent snapshots of queues stored on disk.

In the example queue above, a checkpoint occurs after transaction 1 is committed, causing events an and b in the queue to be saved to disk with a sequence number of 4.

Event an is then taken in transaction 2.

If a crash occurs, the queue checkpoint is read from disk. Note that because the checkpoint occurs before transaction 2, events an and b currently exist in the queue. The WAL is then read and any committed transactions with a sequence number greater than 4 are applied, resulting in the removal of "a" from the queue (the event of this operation is [2Magne5 take "a]).

The above design does not include two projects. Takes and Puts in progress at the time of the checkpoint are lost. Suppose a checkpoint occurs after getting "a":

If a crash occurs at this time, under the above design, the event "b" will be on the queue and any WAL entries with a sequence number greater than 5 will be replayed on replay. The rollback of transaction 2 will be replayed, but the Taken of transaction 2 will not be replayed, but the rollback of transaction 2 will not be replayed. Therefore, "a" will not be placed in the queue, resulting in data loss. Puts will adopt a similar scheme. Therefore, when a queue checkpoint occurs, transactions that are still in progress are also written out so that the above scenario can be resolved appropriately.

Realize

The FileChannel is stored in the flume-file-channel module of the Flume project and the package name is org.apache.flume.channel.file. The name of the queue described above is FlumeEventQueue,WAL and the name is Log. The queue itself is a circular array supported by memory-mapped files, while WAL is a set of files written and read using the LogFile class and its subclasses.

At this point, the study on "what is the use of Apache's Flume and FileChannel" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.