What is Flume? 04/26 Update SLTechnology News&Howtos

What is Flume?

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces what Flume is, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

Flume is a highly available and reliable open source distributed mass log collection system provided by Cloudera. Log data can be transferred to storage terminal destinations through Flume. The log here is a general term, which generally refers to files, operation records and many other data.

1. Basic knowledge of Flume

1. Data flow model

The core of Flume is to collect data from the data source and send it to the destination. In order to ensure the success of the transmission, the data will be cached before it is sent to the destination, and the cached data will be deleted after the data actually arrives at the destination.

The basic unit of data transmitted by Flume is Event, and if it is a text file, it is usually a line of records, which is also the basic unit of transactions. Event flows from Source to Channel and then to Sink, which itself is an byte array and can carry headers information. Event represents the smallest complete unit of a data stream, coming from an external data source to an external destination.

2. Core components

The core of Flume operation is Agent. It is a complete data collection tool with three core components, namely source, channel and sink. With these components, Event can flow from one place to another, as shown in figure 1-1, or any combination of multi-level agent links, as shown in figure 1-2.

Figure 1-1 flume data flow model

Figure 1-2 Multi-level agent connection model

1) Source: designed to collect logs, it can handle various types of log data in various formats, including avro, thrift, exec, jms, spooling directory, netcat, sequence generator, syslog, http, legacy, customization, etc.

▶ Exec Source: continuously outputs the latest data, such as the tail-F filename instruction, by running the Linux command, in which case the filename must be specified. ExecSource can collect logs in real time, but when Flume is not running or instruction execution is wrong, log data cannot be collected and the integrity of log data cannot be guaranteed.

▶ Spool Source: monitor the new files in the configured directory and read the data in the files. There are two points to note: files copied to the spool directory can no longer be opened for editing, and the spool directory cannot contain corresponding subdirectories

2) Channel: dedicated to temporary data storage, which can be stored in memory, jdbc, file, custom, etc. The stored data will not be deleted until the sink is successfully sent.

▶ Memory Channel: high-speed throughput can be achieved, but data integrity cannot be guaranteed. Memory Channel is an unstable tunnel because it stores all events in memory. If the java process dies, any events stored in memory will be lost. In addition, the memory space is also limited by the size of RAM, which is different from File Channel.

▶ File Channel: ensure the integrity and consistency of the data. When configuring FileChannel, it is recommended that the directory set by FileChannel and the directory saved by the program log file should be set to different disks in order to improve efficiency. File Channel is a persistent channel that persists all events and stores them to disk. Therefore, even if the Java virtual machine fails, or the operating system crashes or restarts, or the event is not successfully passed to the next agent (agent) in the pipeline, none of this will result in data loss.

3) Sink: designed to send data to the destination, including hdfs, logger, avro, thrift, ipc, file, null, hbase, solr, custom, etc.

3. Reliability

Flume uses a transactional approach to ensure the reliability of the entire process of transmitting Event. The Sink can remove the Event from the Channel only after the Event has been stored in the Channel, or it has been communicated to the next station agent, or after it has been stored in an external data destination. In this way, the event in the data stream, whether in an agent or between multiple agent, can be guaranteed to be reliable, because the above transactions ensure that the event will be stored successfully. However, many implementations of Channel have different guarantees in terms of recoverability. It also ensures the reliability of event in different degrees. For example, Flume supports saving a file channel locally as a backup, while memory channel stores event in memory queue, which is fast, but cannot be recovered if lost.

II. Installation and use of Flume

1. Installation

Download the flume version (this lab: apache-flume-1.5.2-bin.tar.gz) on the official website (http://flume.apache.org/download.html), extract it to the / usr/local directory, enter the flume-xx/conf directory, execute the command: mv flume-env.sh.properties flume-env.sh, and then configure the JAVA_HOME path in flume-env.sh.

2. An example

This example Source is from Spooling Directory,Sink to HDFS. Monitor the files in the / root/logs file directory, and as soon as there are new files, flow the contents of the files through agent to the hdfs://cluster1/flume/%Y%m%d file of HDFS (if you cannot find cluster1 here, you need to copy the configuration files core-site.xml and hdfs-site.xml of hadoop to the conf directory of flume).

Create a new test directory and a new file example under the flume directory, as follows:

# define agent name, name of source, channel, sink agent1.sources = source1agent1.channels = channel1agent1.sinks = sink1# specific definition sourceagent1.sources.source1.type = spooldiragent1.sources.source1.spoolDir = / home/logsagent1.sources.source1.fileHeader = false# define interceptor Add a timestamp to the message agent1.sources.source1.interceptors = i1agent1.sources.source1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder# specific definition channel# can also put channel data in memory (but memory is easy to lose) For example, # agent1.channels.c1.type = memory#agent1.channels.c1.capacity = 10000#agent1.channels.c1.transactionCapacity = 10 "here is configured as the agent1.channels.channel1.type=file# backup path in the file agent1.channels.channel1.checkpointDir=/root/flume_bak # data save path agent1.channels.channel1.dataDirs=/root/flume_tmp # specific definition sinkagent1.sinks.sink1.type = hdfsagent1.sinks.sink1.hdfs.path = hdfs://cluster1/flume/%Y % m%dagent1.sinks.sink1.hdfs.fileType = prefix stored in HDFS filename by DataStream# Format: 20140116-filename.. agent1.sinks.sink1.hdfs.filePrefix =% Y -% HDFS% d # does not generate a file on agent1.sinks.sink1.hdfs.rollCount = 0 # HDFS when the file reaches 128m. Generate a file on agent1.sinks.sink1.hdfs.rollSize = 134217728 # HDFS every 60 seconds to detect agent1.sinks.sink1.hdfs.rollInterval = 60 # assembly source, channel, sinkagent1.sources.source1.channels = channel1agent1.sinks.sink1.channel = channel1

Run the example, enter the / usr/local/flume directory, and execute the command: bin/flume-ng agent-n agent1-c conf-f test/example-Dflume.root.logger=DEBUG,console

Where-n specifies the agent name,-c specifies the configuration file directory,-f specifies the configuration file, and-Dflume.root.logger=DEBUG,console sets the log level to output to the console.

Thank you for reading this article carefully. I hope the article "what is Flume" shared by the editor will be helpful to you. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.