Introduction to flume Architectur 07/03 Update SLTechnology News&Howtos

Introduction to flume Architectur

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Friday, 2019-2-22 introduction to flume Architecture

Where is the official website of flume?

Http://flume.apache.org/

Http://www.apache.org/dyn/closer.cgi/flume/1.5.0/apache-flume-1.5.0-bin.tar.gz

Introduction of data acquisition tool flume

What is Flume?

As a real-time log collection system developed by cloudera, flume has been recognized and widely used in the industry. The initial release of Flume is now collectively referred to as Flume OG (original generation) and belongs to cloudera. However, with the expansion of FLume functions, the shortcomings such as bloated Flume OG code engineering, unreasonable design of core components and non-standard core configuration are exposed, especially in the last release version 0.94.0 of Flume OG, the phenomenon of unstable log transmission is particularly serious, in order to solve these problems.

On October 22, 2011, cloudera completed Flume-728 and made landmark changes to Flume: refactoring core components, core configuration, and code architecture, the refactored version is collectively referred to as Flume NG (next generation); another reason for the change is that Flume is included in apache, and cloudera Flume is renamed Apache Flume.

Characteristics of flume:

Flume is a distributed, reliable and highly available system for massive log collection, aggregation and transmission. Support customizing various data senders in the log system for data collection; at the same time, Flume provides the ability to simply process the data and write to various data recipients (such as text, HDFS, Hbase, etc.).

The data flow of the flume is run through by the Event. / / the data flow processed by this flume is the whole process of event.

An event (event) is the basic data unit of a Flume. It carries log data (in the form of a byte array) and header information. These Event are generated by the Source outside the Agent. When the Source captures the event, it is formatted specifically, and then the Source pushes the event into (single or multiple) Channel. You can think of Channel as a buffer that will hold the event until the Sink finishes processing it. Sink is responsible for persisting logs or pushing events to another Source

Reliability of flume

When a node fails, the log can be sent to other nodes without being lost. Flume provides three levels of reliability guarantee, from strong to weak: end-to-end (agent writes the event to disk first, and then deletes it when the data is transferred successfully; if the data fails to be sent, it can be resend)

Store on failure (this is also the strategy adopted by scribe, which writes the data locally when the data receiver crash, and then continues to send it after recovery)

Besteffort (no acknowledgement is made after the data is sent to the receiver).

Recoverability of flume:

It's still Channel. FileChannel is recommended and events are persisted in the local file system (poor performance).

Some core concepts of flume:

Agent uses JVM to run Flume. Each machine runs one agent, but it can contain multiple sources and sinks in one agent.

/ / agent is composed of source+channel+sink

Client production data, running in a separate thread. Client

Source collects data from Client and passes it to Channel. / / receives data from the data generator and passes it to one or more channel in the form of flume event.

Channel connects sources and sinks, which is a bit like a queue. / / temporarily store the event data passed by source and cache it until sink consumption. It is the bridge between source and sink.

Sink collects data from Channel and runs on a separate thread. / / Storage data to hdfs/hbase, extract data from channel (event), and distribute it to destination. The destination of the sink can be two agent or central storage.

Events can be logging, avro objects, and so on. / / A data unit consisting of a message header and a message body

Flume architecture

Flume takes agent as the smallest independent operating unit. An agent is a JVM. A single agent consists of three major components: Source, Sink and Channel. Note that an agent can have multiple source,sink and channel. As shown below:

It is worth noting that Flume provides a large number of built-in Source, Channel, and Sink types. Different types of Source,Channel and Sink can be combined freely. The combination method is based on the profile set by the user, which is very flexible. For example:

Channel can store events temporarily in memory or persist them to your local hard disk.

Sink can write logs to HDFS, HBase, or even another Source, etc.

Flume allows users to establish multi-level streams, that is, multiple agent can work together, and support Fan-in, Fan-out, Contextual Routing, Backup Routes, which is where NB is. / / Multi-level flow is a feature as shown in the following figure:

Summary of flume's advantages:

1. Store data to any central database

two。 The incoming data rate is greater than the write-out rate, which can play a caching role and ensure the stability of the flow.

3. Provide text routing

4. Support transactions.

5. Reliable, fault-tolerant, scalable, customizable, manageable

Advanced component explanation:

[interceptor]: interceptor, monitoring data in source and channel.

[channel selector]: in the case of multiple channels, which channel is used to transfer data. There are two types of channel selectors

A.Default channel selectors:

Copy each event in the channel.

B.Multiplexing channel selectors:

By judging the header information of the event, the channel is determined to send the event.

[sink processor]: sink processor, select a specific sink from the sink group to call. You can create a disaster recovery path for sink or achieve responsible balance among multiple sink.

[collector]: runs behind agent.

[multi-hop]: multi-level jump, from sink to agent

[fan-out]: from a source to a channel

[fan-in]: from multiple source to one channel.

Detailed explanation of the 3 major components: (source channel sink)

Source

Is the data collection side, responsible for capturing the data after special formatting, encapsulating the data into an event (event), and then pushing the event into the Channel.

Flume provides a variety of source implementations, including Avro Source, Exce Source, Spooling Directory Source, NetCat Source, Syslog Source, Syslog TCP Source, Syslog UDP Source, HTTP Source, HDFS Source,etc. Flume also supports custom Source if the built-in Source does not meet the needs.

/ / explain in detail below / / the case study in the detailed notes first

(1) Avro Source:Avro can send a given file to the Flume,Avro source using the AVRO RPC mechanism. / / use too much

(2) Spooling Directory Source:Spool monitors the new files in the configured directory and reads the data in the files. There are two points to note:

1) files copied to the spool directory can no longer be opened for editing.

2) the spool directory may not contain corresponding subdirectories

(3) Exce Source:EXEC executes a given command to get the source of the output. If you want to use the tail command, you must make the file large enough to see the output.

(4) the port of Syslog TCP Source:Syslogtcp snooping TCP is used as the data source

(5) HTTP Source:JSONHandler

(6) source of HDFS Source:hadoop

Channel

It is a component that connects Source and Sink. You can think of it as a data buffer (data queue). It can temporarily store the event in memory or persist it to the local disk until Sink finishes processing the event.

For Channel, Flume provides Memory Channel, JDBC Chanel, File Channel,etc.

MemoryChannel can achieve high-speed throughput, but can not guarantee the integrity of the data.

MemoryRecoverChannel has suggested that it should be replaced with FileChannel on the recommendation of the official documentation.

FileChannel ensures the integrity and consistency of data. When configuring a non-current FileChannel, it is recommended that the directory set by FileChannel and the directory saved by the program log file be set to different disks in order to improve efficiency.

Sink

It is the Flume Sink that takes the data from the Channel, stores the file system, database, or submits it to a remote server.

Flume also provides a variety of sink implementations, including HDFS sink, Logger sink, Avro sink, File Roll sink, Null sink, HBase sink,etc.

When Flume Sink sets up to store data, it can store data in the file system, database and hadoop. When there is less log data, it can store the data in the file system, and set a certain time interval to save the data. When there is a lot of log data, the corresponding log data can be stored in Hadoop to facilitate the corresponding data analysis in the future.

The reference link is: https://www.cnblogs.com/qingyunzong/p/8994494.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.