Using Apache Flume to grab data (1) 04/28 Update SLTechnology News&Howtos

Using Apache Flume to grab data (1)

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Using Apache Flume to grab data, how to grab it? However, before we understand this problem, we must be clear about what ApacheFlume is.

What is Apache Flume

Apache Flume is a high-performance system for data acquisition, whose name comes from the original almost real-time log data acquisition tool. Now it is widely used to collect any stream event data, supporting data aggregation from many data sources to HDFS.

Originally developed by Cloudera, it contributed to the Apache Foundation in 2011 and became a top-level project for Apache in 2012, and Flume OG was upgraded to Flume NG.

Flume has the advantages of horizontal expansion, extensibility and reliability.

II. Flume architecture

Source: accept external systems to generate event

Sink: sends an event to a specified destination

Channel: cache event from Source until Sink removes event

Agent: a separate Flume process that contains source,channel and sink components

Third, Flume design goal: reliability

Channels provides Flume reliability assurance, so how does it guarantee it? The default mode is that Memory Channel,Memory Channel is memory, and all data is stored in memory. So, there's a problem here? If the node of the Channel is powered off, the data will be lost. To solve this problem, there is another mode, disk-based Channel, which ensures that data is not lost in the event of a power outage.

In addition, the data transfer between Agent and Channel is transactional, and data that fails to be transferred to the downstream agent is rolled back and retried. Multiple Agent can be configured for the same task

For example, if two agent complete a data collection job, if one agent fails, the upstream agent will fail to switch to the other.

IV. Flume design goal: expansibility

When we collect a lot of data, we can linearly increase the system performance by adding more system resources. And Flume can scale horizontally, and as complexity increases, more machines can be added to the configuration.

Fifth, Flume design goal: malleability

Malleability is the ability to add new functions to the system. Flume adds Sources and Sinks to existing storage tiers or data platforms, and common Sources includes files, syslog, and standard output data from any linux process; common Sinks includes local file system or HDFS, and developers can write their own Sources or Sinks.

VI. Common Flume data sources

VII. Large-scale deployment examples

Flume uses agents to collect data, and Agents can receive data from many sources, including other agents. Large-scale deployments use multiple layers to achieve scalability and reliability, and Flume supports inspection and modification of data in transit.

These are some of the details about Apache Flume, which will be shared later. Big data will be the tuyere of the future. If you want to stand on the tuyere well, you must continue to study and work hard. I recommend you to follow a Wechat official account "big data cn", in which there are many introductions about big data, which is a good platform for those who want to know and learn big data.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.