The usage of Flume Log Collection Framework 07/12 Update SLTechnology News&Howtos

The usage of Flume Log Collection Framework

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to use the Flume log collection framework". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to use the Flume log collection framework.

Installation and deployment of Flume log collection framework Flume operating mechanism collects static files to hdfs collects dynamic log files to hdfs two agent cascades

Flume log collection framework

In a complete offline big data processing system, in addition to hdfs+mapreduce+hive as the core of the analysis system, indispensable auxiliary systems such as data collection, result data export, task scheduling and so on are required. All of these auxiliary tools have a convenient open source framework in the hadoop ecosystem, as shown in the figure:

1 Flume introduction

Flume is a distributed, reliable and highly available system for massive log collection, aggregation and transmission. Flume can collect files, socket data packets, files, folders, kafka and other forms of source data, and can output the collected data (sinking sink) to many external storage systems such as HDFS, hbase, hive, kafka and so on.

For the general collection requirements, it can be realized through the simple configuration of flume.

Flume also has a good ability to customize and expand for special scenarios, so flume can be applied to most daily data acquisition scenarios.

2 Flume operation mechanism

The core role of the Flume distributed system is that the agent,flume acquisition system is formed by connecting one agent after another. Each agent is equivalent to a data messenger with three components:

Source: acquisition component for interfacing with data sources to obtain data

Sink: sinking component for passing data to the next level of agent or to the final storage system

Channel: transport channel component for passing data from source to sink

A single agent collects data:

Cascading between multi-level agent:

3 installation and deployment of Flume

1 download the installation package apache-flume-1.9.0-bin.tar.gz and extract it

2 add JAVA_HOME to flume-env.sh under the conf folder

Export JAVA_HOME=/usr/local/bigdata/java/jdk1.8.0_211

(3) according to the requirements of collection, add the configuration file of acquisition scheme, and the file name can be taken at will.

For more information, please see the following example.

4 start flume

In the test environment:

$bin/flume/-ng agent-c conf/-f. / dir-hdfs.conf-n agent1-Dflume.root.logger=INFO,console

Command description:

-c: specify the configuration file directory that comes with flume. You do not need to modify it.

-f: specify your own configuration file. Ask the dir-hdfs.conf under the current folder here.

-n: specify which agent is used in your configuration file and the name defined in the corresponding configuration file.

-Dflume.root.logger: print the log in the console with the type INFO. This is for testing only and will be printed to the log file later.

In production, to start flume, you should start flume in the background:

Nohup bin/flume-ng agent-c. / conf-f. / dir-hdfs.conf-n agent1 1 > / dev/null 2 > & 1 & 4 requirements for collecting static files to hdfs4.1

Under a specific directory of a server, new files will be generated constantly. Whenever new files appear, they need to be collected into HDFS.

4.2 add profile

Add the file dir-hdfs.conf in the installation directory, and then add configuration information.

First get the agent and name it agent1. The following configuration is followed by agent1, or you can change it to other values, such as agt1. There can be multiple configuration schemes in the same configuration file, and you can get the corresponding name when you start agent.

According to the requirements, first define the following three elements

Data source component

That is, source-monitoring file directory: spooldir spooldir has the following features:

Monitor a directory and collect the contents of a file whenever a new file appears in the directory

The collected files will be automatically added by agent with a suffix: COMPLETED (modifiable)

Duplicate files with the same file name are not allowed in the monitored directory

Sinking component

Sink--HDFS file system: hdfs sink

Channel component

That is, channel-- is available for file channel or memory channel is available.

# define the name of the three major components agent1.sources = source1agent1.sinks = sink1agent1.channels = channel1# configure the length of each line of the source component agent1.sources.source1.type = spooldiragent1.sources.source1.spoolDir = / root/log/agent1.sources.source1.fileSuffix=.FINISHED# file, note here that if each line of the file exceeds this length, it will be automatically cut off Will result in data loss of the agent1.sources.source1.deserializer.maxLineLength=5120# configuration sink component agent1.sinks.sink1.type = hdfsagent1.sinks.sink1.hdfs.path = hdfs://Master:9000/access_log/%y-%m-%d/%H-%Magent1.sinks.sink1.hdfs.filePrefix = app_logagent1.sinks.sink1.hdfs.fileSuffix = .logagent1.sinks.sink1.hdfs.batchSize = 100agent1.sinks.sink1.hdfs.fileType = DataStreamagent1.sinks.sink1.hdfs.writeFormat = Text# roll: Scroll switch: control the switching rule of writing files # # cut agent1.sinks.sink1.hdfs.rollSize = 512000 entries by file volume (bytes) # cut agent1.sinks.sink1.hdfs.rollCount = 1000000 entries by the number of event entries # switch files agent1.sinks.sink1.hdfs.rollInterval = 60 million # by time interval agent1.sinks.sink1.hdfs.round = trueagent1.sinks.sink1.hdfs.roundValue = 10agent1.sinks.sink1. Hdfs.roundUnit = minuteagent1.sinks.sink1.hdfs.useLocalTimeStamp = true# channel component configuration agent1.channels.channel1.type = number of memory## event agent1.channels.channel1.capacity = cache capacity required for 500000##flume transaction control eventagent1.channels.channel1.transactionCapacity = 60 bind source, Connection between channel and sink agent1.sources.source1.channels = channel1agent1.sinks.sink1.channel = channel1

Channel parameter explanation:

Capacity: defaults to the maximum number of event that can be stored in this channel

TrasactionCapacity: the maximum number of event that can be obtained from source or sent to sink at a time

Allowed time for keep-alive:event to be added to or removed from the channel

4.3 launch flume$ bin/flume/-ng agent-c conf/-f dir-hdfs.conf-n agent1-Dflume.root.logger=INFO,console5 to collect dynamic log files to hdfs5.1 collection requirements

For example, the log generated by the business system using log4j is increasing continuously, and the data appended to the log file needs to be collected to hdfs in real time.

5.2 configuration Fil

Profile name: tail-hdfs.conf first defines the following 3 elements according to the requirements:

Collection source, namely source-- monitoring file content update: exec tail-F file

Sinking target, that is, sink--HDFS file system: hdfs sink

The transfer channel between Source and sink-channel, either file channel or memory channel

Configuration file content:

# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = cased Describe/configure the sourcea1.sources.r1.type = execa1.sources.r1.command = tail-F / root/app_weichat_login.log# Describe the sinkagent1.sinks.sink1.type = hdfsagent1.sinks.sink1.hdfs.path = hdfs://Master:9000/app_weichat_login_log/%y-%m-%d/%H-%Magent1.sinks.sink1.hdfs.filePrefix = weichat_logagent1.sinks.sink1. Hdfs.fileSuffix = .datagent1.sinks.sink1.hdfs.batchSize = 100agent1.sinks.sink1.hdfs.fileType = DataStreamagent1.sinks.sink1.hdfs.writeFormat = Textagent1.sinks.sink1.hdfs.rollSize = 100agent1.sinks.sink1.hdfs.rollCount = 1000000agent1.sinks.sink1.hdfs.rollInterval = 60agent1.sinks.sink1.hdfs.round = trueagent1.sinks.sink1.hdfs.roundValue = 1agent1.sinks.sink1.hdfs.roundUnit = minuteagent1.sinks.sink1.hdfs.useLocalTimeStamp = true# Use a channel which buffers events in memorya1.channels.c1.type = Memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 10 minutes Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c15.3 start flume

Start the command:

Bin/flume-ng agent-c conf-f conf/tail-hdfs.conf-n a16 two agent cascades

Get data from the tail command and send it to the avro port another node can configure an avro source to relay the data and send external storage

# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = cased Describe/configure the sourcea1.sources.r1.type = execa1.sources.r1.command = tail-F / root/log/access.log# Describe the sinka1.sinks.k1.type = avroa1.sinks.k1.hostname = hdp-05a1.sinks.k1.port = 4141a1.sinks.k1.batch-size = "Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1 .transactionCapacity = 10 percent Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = C1

Receives data from the avro port and sinks to hdfs

Collect configuration files, avro-hdfs.conf

The avro component in # Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = cached Describe/configure the source##source is a recipient service a1.sources.r1.type = avroa1.sources.r1.bind = hdp-05a1.sources.r1.port = 414 Describe the sinka1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = / flume/taildata/%y-%m-%d/a1.sinks.k1.hdfs.filePrefix = tail-a1.sinks.k1 .hdfs.round = truea1.sinks.k1.hdfs.roundValue = 24a1.sinks.k1.hdfs.roundUnit = houra1.sinks.k1.hdfs.rollInterval = 0a1.sinks.k1.hdfs.rollSize = 0a1.sinks.k1.hdfs.rollCount = 50a1.sinks.k1.hdfs.batchSize = 10a1.sinks.k1.hdfs.useLocalTimeStamp = file type generated by true# The default is Sequencefile, DataStream is available, then the plain text a1.sinks.k1.hdfs.fileType = DataStream# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 10 records Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = C1 so far, I believe you have a deeper understanding of "the use of Flume log collection framework", you might as well do it in practice! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.