Detailed explanation of big data flume Log Collection system 04/26 Update SLTechnology News&Howtos

Detailed explanation of big data flume Log Collection system

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

I. introduction to flume

Flume is a highly available, reliable, distributed mass log collection aggregation transmission system provided by cloudera. Flume supports customizing all kinds of data senders in the log system to collect data. At the same time, flume provides the ability to simply process data and write to various data receivers (customizable).

two。 Function introduction

Log collection

Flume was originally a log collection system provided by Cloudera and is currently an incubation project under Apache. Flume supports customizing various data senders in the log system to collect data.

Process: hang Seng data receiving middleware-file.txt which port for monitoring-data monitoring-receiving data-memory-storing local hard disk

Which ip and which port is monitored by Flume- data monitoring-- receiving data-memory-storing local hard disk

Data processing.

Flume provides the ability to simply process data and write to various data recipients (customizable). Flume provides the ability to collect data from data sources such as Console (console), RPC (Thrift-RPC), Text (file), Tail (UNIX tail), Syslog (Syslog log system, supporting TCP and UDP modes) and exec (command execution).

III. Flume schematic structure

Flume is logically divided into three layers: Agent,Collector,Storage.

Flume OG adopts a multi-Master approach. In order to ensure the consistency of the configuration data, Flume introduces ZooKeeper to save the configuration data. ZooKeeper itself can ensure the consistency and high availability of the configuration data. In addition, ZooKeeper can notify the Flume Master node when the configuration data changes. Gossip protocol is used to synchronize data between Flume Master.

The characteristics of FLUM OG are:

FLUM OG has three roles of nodes: proxy node (agent), collection node (collector), and master node (master).

Agent collects log data from various data sources, centralizes the collected data to Collector, and then aggregates and stores it in HDFS by the collection node. Master is responsible for managing the activities of agent,collector.

The roles of agent and collector, both known as node,node, are divided into logical node (logical node) and physical node (physical node) according to the configuration.

Agent and collector are composed of source and sink, which means that the data is transferred from source to sink at the current node.

Introduction to Flume-NG Architecture

The most obvious change in Flume NG is that it removes the centralized management configuration of Master and Zookeeper and becomes a pure transport tool. Another major difference in Flume NG is that read-in and write-out data are now handled by different worker threads (called Runner). In Flume NG, the read-in thread also does the write job (except for a failure retry). If the write is slow (not a complete failure), it will block Flume's ability to receive data. This asynchronous design allows the read thread to work smoothly without paying attention to any downstream problems.

The characteristics of FLUME NG are:

NG has only one role node: the proxy node (agent).

There are no collector, master nodes, which is the core change of the core components.

The concepts and related contents of physical nodes and logical nodes are removed.

The composition of agent nodes has also changed. The agent of Flume NG is composed of source, sink and Channel.

IV. Introduction of the three major components of flume (agent,channel,sink)

Flume takes Agent as the smallest independent operating unit. Agent is the place where data flow is generated in Flume, and an Agent is a JVM. A single Agent consists of three components: Source, Sink and Channel.

Source: complete the collection of log data, divide it into transtion and event and put it into Channel.

Channel: mainly provides a queue function for simple caching of data provided by source.

Sink: take the data from the Channel, store the file system, database, or submit it to a remote server.

The smallest change to the existing program is to read the log file recorded by the program directly, which can be accessed seamlessly without any changes to the existing program.

2 Source

Flume has many types of Source, see the official website user manual:

Http://flume.apache.org/FlumeUserGuide.html#flume-sources

Summarize and sort out the agent source list of flume as follows

Source Typ

Description

Avro Source

Supports Avro protocol (actually Avro RPC) and provides an interface for Avro. Source can receive Avro messages to the set address and port. For example, Log4j Appender sends messages to Agent through Avro Source.

Thrift Source

Supports Thrift protocol and provides a Thrift interface, similar to Avro

Exec Source

When Source starts, it runs a set UNIX command (such as cat file), which continuously outputs data to standard output (stdout), and the data is packaged into Event for processing.

JMS Source

Read data from JMS systems (messages, topics), similar to ActiveMQ

Spooling Directory Source

Listen to a directory where when a new file appears, package the contents of the file into Event for processing

Netcat Source

Monitor a port and input every line of text that flows through the port as Event

Sequence Generator Source

Sequence generator data source, production sequence data

Syslog Sources

Read syslog data and generate Event, which supports UDP and TCP protocols

HTTP Source

Data sources based on HTTP POST or GET, supporting JSON and BLOB representations

Legacy Sources

Compatible with Source in older Flume OG (version 0.9.x)

Custom Source

Users customize the Source that meets the needs by implementing the interface provided by Flume.

There are two main ways to read a file Source directly:

ü Exec source

Data can be organized by writing Unix command, the most common of which is tail-F [file].

Real-time transmission can be achieved, but when the flume is not running and the script is wrong, the data will be lost, and the breakpoint continuation function is not supported. Because there is no record of the location of the last file, there is no way to know where to start the next time you read it. Especially when log files are growing. Flume's source is dead. During the period of time when flume's source is turned on again, the added log content cannot be read by source. However, flume has an extension to execStream, which can write a log to monitor the increase, and send the added log to the node of flume through the tool written by itself. And then send it to the node of sink. If it can be supported in the source of the tail class, it will be even more perfect to hang up the content during this period of time in the node and continue to transmit it after the next node is turned on.

ü Spooling Directory Source

SpoolSource: monitors the newly added files in the configured directory and reads the data in the files to achieve quasi-real-time. There are two points to note:

1. Files copied to the spool directory can no longer be opened for editing.

2. The spool directory may not contain corresponding subdirectories. In the process of actual use, it can be used in combination with log4j. When using log4j, set the file division mechanism of log4j to once every minute, and copy the files to the monitoring directory of spool. Log4j has a plug-in for TimeRolling that can transfer log4j split files to the spool directory. The real-time monitoring is basically realized. After transferring the file, Flume will modify the suffix of the file to .COMPLETED (the suffix can also be flexibly specified in the configuration file).

Note: ExecSource,SpoolSource comparison

ExecSource can realize real-time log collection, but when Flume is not running or instruction execution is wrong, log data can not be collected and the integrity of log data cannot be verified. Although SpoolSource cannot collect data in real time, it can split files in minutes to approach real-time. If the application cannot cut log files in minutes, it can be used in combination with the two collection methods.

2 Channel

There are currently several Channel to choose from, which are Memory Channel, JDBC Channel, and File Channel,Psuedo Transaction Channel. The first three kinds of Channel are more common.

V Memory Channel can achieve high-speed throughput, but can not guarantee the integrity of the data.

V Memory Recover Channel has suggested that it should be replaced with File Channel on the recommendation of the official documentation.

V File Channel ensures the integrity and consistency of data. When configuring File Channel, it is recommended that the directory set by File Channel and the directory saved by the program log file should be set to different disks in order to improve efficiency.

File Channel is a persistent Channel that persists all events and stores them to disk. Therefore, even if the Java virtual machine fails, or the operating system crashes or restarts, or the event is not successfully passed to the next agent (agent) in the pipeline, none of this will result in data loss. Memory Channel is an unstable tunnel because it stores all events in memory. If the Java process dies, any events stored in memory will be lost. In addition, memory space is limited by the size of RAM, and File Channel has the advantage that as long as there is enough disk space, it can store all event data on disk.

Types supported by Flume Channel:

Channel Typ

Description

Memory Channel

Event data is stored in memory

JDBC Channel

Event data is stored in persistent storage. Currently, Flume Channel supports Derby built-in.

File Channel

Event data is stored in disk files

Spillable Memory Channel

Event data is stored in memory and on disk, and when the memory queue is full, it will be persisted to disk files (currently experimental, not recommended for production environments)

Pseudo Transaction Channel

Test purpose

Custom Channel

Custom Channel implementation

2 Sink

When Sink sets up to store data, it can store data in the file system, database and Hadoop. When there is less log data, it can store the data in the file system, and set a certain time interval to save the data. When there is too much log data, the corresponding log data can be stored in Hadoop to facilitate the corresponding data analysis in the future.

Types supported by Flume Sink

Sink Typ

Description

HDFS Sink

Data is written into HDFS

Logger Sink

Data is written to the log file

Avro Sink

The data is converted to Avro Event and then sent to the configured RPC port

Thrift Sink

The data is converted to Thrift Event and then sent to the configured RPC port

IRC Sink

The data is played back on IRC

File Roll Sink

Store data to the local file system

Null Sink

Discard all data

HBase Sink

Data is written to HBase database

Morphline Solr Sink

Data is sent to Solr search server (cluster)

ElasticSearch Sink

Data is sent to Elastic Search search server (cluster)

Kite Dataset Sink

Write data to Kite Dataset, experimental

Custom Sink

Custom Sink implementation

Flume provides a large number of built-in Source, Channel, and Sink types. Different types of Source,Channel and Sink can be combined freely. The combination method is based on the profile set by the user, which is very flexible. For example, Channel can store events in memory temporarily or persist them to a local hard disk. Sink can write logs to HDFS, HBase, or even another Source, etc. Flume allows users to establish multi-level streams, that is, multiple Agent can work together and support Fan-in, Fan-out, Contextual Routing, Backup Routes

Collector

There is no concept of Collector in Flume NG. The function of Collector is to aggregate the data of multiple Agent and load them into Storage.

2.2.3 Storage

Storage is a storage system, which can be a normal File, HDFS,HIVE,HBase, and so on.

2.2.4 Master

For the OG version.

Master is the controller of Flume cluster, which manages and coordinates the configuration of Agent and Collector.

In Flume, the most important abstraction is data flow (data flow). Data flow describes a path through which data is generated, transmitted, processed, and eventually written to the destination.

For Agent data flow configuration is where the data is obtained and to which Collector the data is sent.

For Collector, it receives the data sent by Agent and sends the data to the specified target machine.

5. Introduction to flume features

(1) Reliability

When a node fails, the log can be sent to other nodes without being lost. Flume provides three levels of reliability assurance, from strong to weak:

End-to-end (agent first writes the event to disk when the data is received, and then deletes it when the data transfer is successful; if the data transmission fails, it can be re-sent. )

Store on failure (this is also the strategy adopted by scribe, which writes the data locally when the data receiver crash, and then continues to send it after recovery)

Best effort (no acknowledgement is made after the data is sent to the receiver).

(2) scalability

Flume uses a three-tier architecture, namely agent,collector and storage, each of which can be scaled horizontally. Among them, all agent and collector are managed by master, which makes the system easy to monitor and maintain, and master allows multiple (using ZooKeeper for management and load balancing), which avoids the problem of single point of failure.

(3) manageability

All agent and colletor are managed by master, which makes the system easy to maintain. In the case of multi-master, Flume uses ZooKeeper and gossip to ensure the consistency of dynamic configuration data. Users can view the execution of each data source or data flow on master, and can configure and dynamically load each data source. Flume provides two forms of web and shell script command to manage the data flow.

(4) functional scalability

Users can add their own agent,collector or storage as needed. In addition, Flume comes with many components, including various agent (file, syslog, etc.), collector and storage (file,HDFS, etc.).

Summary of the study:

Flume is a distributed, reliable and highly available system for massive log collection, aggregation and transmission. Support customizing various data senders in the log system for data collection; at the same time, Flume provides the ability to simply process the data and write to various data recipients (such as text, HDFS, Hbase, etc.).

The data flow of the Flume is run through by the Event. An event is the basic data unit of a Flume. It carries log data (in the form of a byte array) and header information. These Event are generated by the Source outside the Agent. When the Source captures the event, it is formatted specifically, and then the Source pushes the event into (single or multiple) Channel. You can think of Channel as a buffer that will hold the event until the Sink finishes processing it. Sink is responsible for persisting logs or pushing events to another Source.

When a node fails, the log can be sent to other nodes without being lost. Flume provides three levels of reliability assurance, from strong to weak:

End-to-end: agent first writes the event to disk when the data is received, and then deletes it when the data transfer is successful. If the data transmission fails, you can resend it.

Store on failure: this is also the strategy adopted by scribe. When the data receiver crash, the data is written locally. After it is restored, it continues to be sent.

Best effort: after the data is sent to the receiver, it will not be acknowledged.

-- +

one。 Introduction to the Construction and installation of flume Environment

Flume official website: http://flume.apache.org/download.html

It is recommended that you use version 1.6, version 1.7 currently has some minor problems to be solved. Later, the authorities may fix it.

Set up 3 nodes in the cluster, each of which is installed with flume.

Decompress after download: tar-zxvf apache-flume-1.7.0-bin.tar.gz-C / opt/modules/flume

Hadoop and jdk have been installed by default this time. So ignore the installation steps.

Modify the configuration file: cp flume-env.sh.template flume-env.sh

Vim flume-env.sh

Export JAVA_HOME=/usr/local/java_1.7.0_25

Set the system environment variables:

Export FLUME_HOME=/opt/modules/flume

Export PATH=$PATH:$FLUME_HOME/bin:

After saving and exiting. Source / etc/profile takes effect immediately.

Test: enter flume-ng version on the terminal

The above prompt indicates that the flume is built successfully.

Flume can be distributed to other nodes.

Scp-r / opt/modules/flume/* root@slave1:/opt/modlues/flume

Scp-r / opt/modules/flume/* root@slave2:/opt/modlues/flume

two。 Collect data and test.

1) Avro

Avro can send a given file to the Flume,Avro source using the AVRO RPC mechanism.

Create an agent profile

# > vi / opt/modules/flume/conf/avro.conf

Add the following:

A1.sources = R1

A1.sinks = K1

A1.channels = C1

# Describe configure the source

A1.sources.r1.type = avro

A1.sources.r1.bind = 0.0.0.0

A1.sources.r1.port = 4141

# Describe the sink

A1.sinks.k1.type = logger

# Use a channel which buffers events in memory

A1.channels.c1.type = memory

A1.channels.c1.capacity = 1000

A1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

A1.sources.r1.channels = C1

A1.sinks.k1.channel = C1

Explain the above:

Specified name: A1 is the name of the Agent we want to start

A1.sources = R1 name the sources of Agent as R1

A1.sinks = K1 name the sinks of Agent as K1

A1.channels = C1 name the channels of Agent as C1

# Describe configure the source

A1.sources.r1.type = avro specifies that R1 is of type AVRO

A1.sources.r1.bind = 0.0.0.0 binds Source to IP address (in this case, native machine)

A1.sources.r1.port = 4141 specifies a communication port of 4141

# Describe the sink

A1.sinks.k1.type = logger specifies that the type of K1 is Logger (no physical files are generated, only displayed on the console)

# Use a channel which buffers events in memory

A1.channels.c1.type = memory

A1.channels.c1.capacity = 1000

A1.channels.c1.transactionCapacity = 100

Configuration description

Specifies that the type of Channel is Memory

Set the maximum number of storage event for Channel to 1000

The maximum number of event that can be obtained in source or sent to sink at a time is also 100.

You can also set other properties of Channel here:

Allowed time for a1.channels.c1.keep-alive=1000event to be added to or removed from the channel (seconds)

A1.channels.c1.byteCapacity = limit on the number of bytes for 800000event, including only eventbody

A1.channels.c1.byteCapacityBufferPercentage = 20

The cache ratio of event is 20% (20% of 800000), that is, the maximum number of bytes of event is 800000 to 120%.

# Bind the source and sink to the channel

A1.sources.r1.channels = C1

A1.sinks.k1.channel = C1

Bind source and sink to Channel C1 respectively

Start flume agent A1

# > flume-ng agent-c. -f / home/bigdata/flume/conf/avro.conf-n A1-Dflume.root.logger=INFO,console

-c: use the directory where the configuration file is located (here the default path is $FLUME_HOME/conf)

-f:flume defines the configuration file for the component

-n: name of the startup Agent, which is defined in the component configuration file

-Dflume.root.logger:flume 's own running status log, on-demand configuration, details, console print

Start the flume process on Node 2

Flume-ng agent-c. -f / opt/modules/flume/conf/avro.conf-n A1-Dflume.root.logger=INFO,console / / start command

Create a file and send it.

Echo "china 51cto" > > / home/avro_log

Flume-ng avro-client-c. -H master-p 4141-F / home/avro_log

Note: the Flume framework relies on Hadoop and zookeeper only on the jar package, and does not require Hadoop and zookeeper services to be started when flume is started.

Let's explain a pattern first. Follow-up sharing of other data collection methods.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.