What is the flume architecture like? 07/13 Update SLTechnology News&Howtos

What is the flume architecture like?

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces what kind of flume architecture is. It has certain reference value. Interested friends can refer to it. I hope you will gain a lot after reading this article. Let Xiaobian take you to understand it together.

flume presentation

Flume is a distributed, reliable, and highly available system for massive log collection, aggregation, and transmission. Support customization of various data senders in the log system for collecting data; at the same time, Flume provides the ability to simply process data and write to various data recipients (such as text, HDFS, Hbase, etc.). Official website: flume.apache.org/

Flume architecture

Flume uses a layered architecture: agent, collector and storage. Users can add their own agents, collectors or storage as needed. Agent and collector have two parts: source and sink, where source is the source of data and sink is the destination of data. (The big three layers are the same as the small three layers. It can also be said that flume is divided into source, channel and sink)

Flume Core Concept

The big aspect:

Agent uses JVM to run Flume. Each machine runs an agent, but you can have multiple sources and sinks in an agent. Collector collects data from multiple agents and loads it into storage.

Storage stores collected data.

Client produces data, running in a separate thread.

Small aspects:

Source collects data from clients and passes it to Channel.

Sink collects data from Channel and runs in a separate thread.

Channel connects sources and sinks, which is a bit like a queue.

Events is the basic unit of data for flume, which can be log records, avro objects, etc.

Flume component details

Source collects data from clients and passes it to Channel.

Client side operations consume data sources, Flume supports Avro, log4j, syslog and http post(body is json format). You can let applications deal directly with existing Sources, such as AvroSource, SyTcpSource. You can also write a Source to access your application in IPC or RPC mode, Avro and Thrift can both (NettyAvroRpcClient and ThriftRpcClient implement the RpcClient interface respectively), where Avro is the default RPC protocol. For specific code level client-side data access, please refer to the official manual.

The smallest change to the existing program is to use the log file that is directly read from the original record of the program, which can basically achieve seamless access without any changes to the existing program.

For reading a file Source directly, there are two ways:

ExecSource: Continuously outputs the latest data in a way that runs Linux commands, such as the tail -F filename directive. In this way, the filename must be specified. ExecSource can realize real-time collection of logs, but when Flume does not run or command execution errors exist, log data cannot be collected, and the integrity of log data cannot be guaranteed.

SpoolSource: Monitor new files in configured directories and read data from files. Two points need to be noted: files copied to spool directory cannot be opened for editing;spool directory cannot contain corresponding subdirectories.

SpoolSource does not collect data in real time, but it can split files in minutes, approaching real time.

If your app can't cut log files in minutes, you can use a combination of both collection methods. In the actual use process, it can be used in combination with log4j. When log4j is used, the file splitting mechanism of log4j is set to once a minute, and the file is copied to the spool monitoring directory.

Log4j has a TimeRolling plugin that splits log4j files into spool directories. Real-time monitoring is basically achieved. Flume will modify the suffix of the file to.COMPLETED after passing the file (suffix can also be specified flexibly in the configuration file)

Channel connects sources and sinks, which is kind of like a queue.

There are currently several channels to choose from: Memory Channel, JDBC Channel , File Channel, and Psuedo Transaction Channel. The most common are the first three channels.

MemoryChannel can achieve high-speed throughput, but it cannot guarantee data integrity.

MemoryChannel has been proposed on the official documentation recommendations to use FileChannel instead.

FileChannel guarantees data integrity and consistency. When configuring FileChannel, it is recommended that the directory where FileChannel is set and the directory where program log files are saved be set to different disks in order to improve efficiency.

File Channel is a persistent channel that persists all events and stores them to disk. Therefore, even if the Java VM crashes, or the operating system crashes or restarts, or the event does not successfully pass to the next agent in the pipeline, none of this causes data loss. Memory Channel is an unstable tunnel because it stores all events in memory. If the java process dies, any events stored in memory will be lost. In addition, memory space is limited by RAM size, and File Channel is its advantage, as long as there is enough disk space, it can store all event data on disk.

Sink collects data from Channel and runs in a separate thread.

Sink can store data in file system, database and hadoop when setting storage data, and can store data in file system when log data is small, and set a certain time interval to save data. When there is a lot of log data, the corresponding log data can be stored in Hadoop, so as to facilitate the corresponding data analysis in the future. collectorSink("fsdir","fsfileprefix",rollmillis): collectorSink, data is aggregated by collector and sent to hdfs, fsdir is hdfs directory, fsfileprefix is file prefix code.

Thank you for reading this article carefully. I hope that the article "What is the flume architecture" shared by Xiaobian will help everyone. At the same time, I hope that everyone will support you more, pay attention to the industry information channel, and more relevant knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.