What is the overall architecture of Flume? 07/04 Update SLTechnology News&Howtos

What is the overall architecture of Flume?

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "what is the overall structure of Flume", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "what the overall architecture of Flume is like" this article.

1. Flume introduction

Flume is an open source cloudera company for distributed and reliable collection, aggregation and transfer of a large number of log data to storage; provides reliable message transmission support through the transaction mechanism, with its own load balancing mechanism to support horizontal expansion; and provides some default components for direct use.

Common Flume application scenarios: log-> Flume--- > real-time computing (such as Kafka+Storm), log-> Flume--- > offline computing (such as HDFS, HBase), log-> Flume--- > ElasticSearch.

2. Overall structure

Flume is mainly divided into three components: Source, Channel, and Sink; data flow as shown in the following figure:

1. Source is responsible for log inflows, such as data inflows from file, network, Kafka and other data sources. There are two ways of data inflow: rotation pull and event-driven.

2. Channel is responsible for data aggregation / temporary storage, such as temporary storage to memory, local files, database, Kafka, etc. Log data will not stay in the pipeline for a long time and will soon be consumed by Sink.

3. Sink is responsible for transferring data to storage, such as getting logs from Channel and storing them directly to HDFS, HBase, Kafka, ElasticSearch, etc., and then analyzing or querying data such as Hadoop, Storm, ElasticSearch and so on.

An Agent will co-exist these three components, Source and Sink are executed asynchronously and will not affect each other.

Assuming that we have collected and indexed Nginx access logs, we can deploy them as follows:

1. Logs collected by Source will be passed into the ChannelProcessor component, which first filters logs through Interceptor. If you have come into contact with Servlet, the concept is similar. You can refer to "Servlet3.1 Specification Translation-filter". The filter can filter out logs and modify log contents.

2. After the filtering is completed, it will be handed over to ChannelSelector for processing. By default, two selectors are provided: replication or multiplexing selector; replication means copying a log to multiple Channel;, and multiplexing will route qualified Channel; to the corresponding Channel according to the configured selector conditions. There may be failures when writing multiple Channel, and there are two ways to deal with failures: try again later or ignore them. Retry usually takes exponential time to retry.

We said earlier that Source production logs are consumed by Channel and Sink from Channel; they are completely asynchronous, so Sink only needs to listen for Channel changes in its own relationship.

At this point, we can filter / modify the Source log to copy / route a message to multiple Channel. For Sink, there should also be write failures. Flume provides the following policy by default:

The Failover policy defines priorities for multiple Sink. Assuming that one of them fails, the Sink;Sink routed to the next priority will be considered a failure as long as an exception is thrown, then removed from the surviving Sink, and then wait for retry for an exponential time. The default is to wait for 1 second to start the retry, and the maximum waiting time for retry is 30 seconds.

Flume also provides a load balancing strategy:

1. First, there is the log collection layer, where the Agent and the application are deployed on the same machine and are responsible for collecting access logs such as Nginx, and then flow the logs to the collection / aggregation layer through RPC; in this layer, the logs should be quickly collected and then flowed to the collection / aggregation layer

2. The collection / aggregation layer collects or aggregates logs, and can perform fault-tolerant processing, such as failover or load balancing, to improve reliability; in addition, file Channel can be opened in this layer to serve as data buffers

3. The collection / aggregation layer filters or modifies the data and then stores or processes it; for example, it is stored in HDFS, or flows into Kafka and then processes the data in real time through Storm.

The above is all the content of this article "what is the overall structure of Flume?" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.