How to run Apache Flume regular filter 07/19 Update SLTechnology News&Howtos

How to run Apache Flume regular filter

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how the Apache Flume regular filter works". In the daily operation, I believe that many people have doubts about how the Apache Flume regular filter works. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how the Apache Flume regular filter works". Next, please follow the editor to study!

In today's big data world, applications generate large amounts of electronic data-these huge electronic data repositories contain valuable and valuable information. For human analysts or domain experts, it is difficult to make interesting discoveries or find patterns that can help the decision-making process. We need automated processes to effectively use large, informative data for planning and investment decisions. Before processing the data, it is absolutely necessary to collect, aggregate and transform the data, and eventually move the data to repositories that use different analysis and data mining tools.

One of the popular tools to perform all these steps is Apache Flume. This data is usually stored in the form of events or logs. Apache Flume has three main components:

Source: data sources can be enterprise servers, file systems, clouds, data repositories, etc.

Sink:Sink is the target repository where data can be stored. It can be a centralized place, such as HDFS, a processing engine like Apache Spark, or a data repository / search engine like ElasticSearch.

Channel: stored by Channel before the event is consumed by sink. Channel is passive storage. Channel supports failure recovery and high reliability; the Channel example is a file channel supported by the local file system and memory-based Channel.

Flume is highly configurable and supports many sources, channel,serializer and sink. It also supports data flow. The powerful feature of Flume is the interceptor, which supports the ability to modify / delete events while running. One of the supported interceptors is regex_filter.

Regex_filter interprets the event body as text, compares it with the regular expression provided, and includes or excludes events based on matching patterns and expressions. We will take a closer look at regex_filter.

Request

From the data source, we get the data in the form of street number, name, city and role. Now, the data source could be real-time streaming data, or it could be any other source. In this example, I have used the Netcat service as the source for listening on a given port and converted each line of text into an event. Requires that the data be saved to HDFS in text format. Before saving the data to HDFS, the data must be filtered by role. Only the manager's records need to be stored in HDFS; data for other roles must be ignored. For example, the following data is allowed:

1,alok,mumbai,manager 2,jatin,chennai,manager

The following data is not allowed:

3,yogesh,kolkata,developer 5,jyotsana,pune,developer

How to meet this requirement

This can be achieved by using the regex_filter interceptor. This interceptor will filter events according to the rule basis, and only interested events will be sent to the corresponding slot, while ignoring other events.

# # Describe regex_filter interceptor and configure exclude events attribute a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = regex_filter a1.sources.r1.interceptors.i1.regex = developer a1.sources.r1.interceptors.i1.excludeEvents = true

HDFS slots allow data to be stored in HDFS in text / sequence format. It can also be stored in a compressed format.

A1.channels = C1 a1.sinks = K1 a1.sinks.k1.type = hdfs a1.sinks.k1.channel = C1 # # assumption is that Hadoop is CDH a1.sinks.k1.hdfs.path = hdfs://quickstart.cloudera:8020/user/hive/warehouse/managers a1.sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.writeFormat = Text

How to run the example

First, you need Hadoop to run the example as a slot for HDFS. If you don't have a Hadoop cluster, you can change the slot to log, and then just start Flume. Store the regex_filter_flume_conf.conf file in a directory and run the agent using the following command.

Flume-ng agent-conf conf--conf-file regex_filter_flume_conf.conf-name A1-Dflume.root.logger=INFO,console

Notice that the proxy name is A1. I used Netcat as a source.

A1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444

Once the Flume agent starts, run the following command to send events to Flume.

Telnet localhost 40000

Now we just need to provide the following input text:

1,alok,mumbai,manager 2,jatin,chennai,manager 3,yogesh,kolkata,developer 4,ragini,delhi,manager 5,jyotsana,pune,developer 6,valmiki,banglore,manager

Visit HDFS and you will observe that HDFS creates a file under hdfs://quickstart.cloudera:8020/user/hive/warehouse/managers that contains only the manager's data.

The complete flume configuration-regex_filter_flume_conf.conf-is as follows:

# Name the components on this agent a1.sources = R1 a1.sinks = K1 a1.channels = C1 # Describe/configure the source-netcat a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # Describe the HDFS sink a1.channels = C1 a1.sinks = K1 a1.sinks.k1.type = hdfs a1.sinks.k1.channel = C1 a1.sinks.k1.hdfs.path = hdfs://quickstart.cloudera:8020/user/hive/warehouse/managers A1 .sinks.k1.hdfs.fileType = DataStream a1.sinks.k1.hdfs.writeFormat = Text # # Describe regex_filter interceptor and configure exclude events attribute a1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = regex_filter a1.sources.r1.interceptors.i1.regex = developer a1.sources.r1.interceptors.i1.excludeEvents = true # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = C1 a1.sinks.k1.channel = C1 to this point The study on "how the Apache Flume regular filter works" is over. I hope I can solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.