What's the use of Flume? 07/02 Update SLTechnology News&Howtos

What's the use of Flume?

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the use of Flume, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

Flume is a collection, it can be understood as a data converter, read the data locally, and then converted to the hdfs distributed file system, the purpose is to dock the files of the two systems; similar to sqoop to do database docking, it transfers the mysql database to hdfs or Hbase.

Flume is equivalent to a pipe, in which data flows, the directory that goes in-source (local directory), the directory that Flume monitors is the local directory; the directory that goes out-sink (HDFS directory, on the distributed file system)

Spooling Directory Source considerations:

Flume comes in to monitor the local directory. There is a monitored directory in the local directory, and the files in the local directory can only be changed. Can not be directly monitored to the nginx server, it will generate logs, it has been reading and writing, nginx generated logs, after real scrolling, for example, 10 minutes to generate a file, at this time, you put it mv or copy to the local directory to monitor, flume can not monitor the contents of files in the directory is always changing, can only monitor the number of files change, flume look at the number of changes, it will grab the log.

In the last two days, I took a closer look at the Flume central log system (version number: 1.3.x). In my opinion, Flume is still a very good log collection system, and its design concept is very easy to use and simple. And is an open source project, based on the Java language development, can carry out some custom function development. When running Flume, the machine must be installed with JDK6.0 or above, and Flume currently only has startup scripts for Linux systems, not for Windows environments.

Flume is mainly purchased by three important components:

Source: complete the collection of log data, divide it into transtion and event and put it into channel.

Channel: mainly provides a queue function for simple caching of data provided by source.

Sink: take the data from the Channel, store the file system, database, or submit it to a remote server.

The smallest change to the existing program is to read the log file recorded by the program directly, which can be accessed seamlessly without any changes to the existing program.

For reading the file Source directly, there are two ways:

ExecSource: continuously outputs the latest data, such as the tail-F filename instruction, by running the Linux command, in which case the filename must be specified.

SpoolSource: monitors the new files in the configured directory and reads the data in the files.

There are two points to note:

1. Files copied to the spool directory can no longer be opened for editing.

2. The spool directory may not contain corresponding subdirectories. In the process of actual use, it can be used in combination with log4j. When using log4j, set the file division mechanism of log4j to once every minute, and copy the files to the monitoring directory of spool. Log4j has a plug-in for TimeRolling that can transfer log4j split files to the spool directory. The real-time monitoring is basically realized. After transferring the file, Flume will modify the suffix of the file to .COMPLETED (the suffix can also be flexibly specified in the configuration file).

ExecSource,SpoolSource comparison:

ExecSource can realize real-time log collection, but when Flume is not running or instruction execution is wrong, log data can not be collected and the integrity of log data cannot be verified.

Although SpoolSource cannot collect data in real time, it can split files in minutes to approach real-time. If the application cannot cut log files in minutes, it can be used in combination with the two collection methods.

There are many ways to Channel: there is MemoryChannel,JDBC Channel,MemoryRecoverChannel,FileChannel.

MemoryChannel can achieve high-speed throughput, but can not guarantee the integrity of the data.

MemoryRecoverChannel has suggested that it should be replaced with FileChannel on the recommendation of the official documentation.

FileChannel ensures the integrity and consistency of data.

When configuring the implemented FileChannel, it is recommended that the directory set by FileChannel and the directory saved by the program log file be set to different disks in order to improve efficiency.

When Sink sets up to store data, it can store data (HDFS, HBase) in hadoop in the file system and database. When there is less log data, it can store the data in the file system, and set a certain time interval to save the data. When there is too much log data, the corresponding log data can be stored in Hadoop to facilitate the corresponding data analysis in the future.

Thank you for reading this article carefully. I hope the article "what's the use of Flume" shared by the editor will be helpful to everyone? at the same time, I also hope that you will support and pay attention to the industry information channel, and more related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.