What is Apache Beam? 07/13 Update SLTechnology News&Howtos

What is Apache Beam?

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the relevant knowledge of "what is Apache Beam". In the operation of actual cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Overview

In this tutorial, we will introduce Apache Beam and explore its basic concepts. We will first demonstrate the use cases and benefits of using Apache Beam, and then introduce the basic concepts and terminology. After that, we will use a simple example to illustrate all the important aspects of Apache Beam.

2. What is Apache Beam?

Apache Beam (Batch+strEAM) is a unified programming model for batch and streaming data processing jobs. It provides a software development kit for defining and building data processing pipes and executing runners for those pipes.

Apache Beam is designed to provide a portable programming layer. In fact, the Beam pipeline runner converts the data processing pipeline into an API that is compatible with the back end of the user's choice. Currently, the backends that support these distributed processing include:

Apache Apex

Apache Flink

Apache Gearpump (incubating)

Apache Samza

Apache Spark

Google Cloud Dataflow

Hazelcast Jet

3. Why choose Apache Beam?

Apache Beam combines batch and streaming data processing, while other components typically do this through a separate API. Therefore, it is easy to change streaming to batch processing, and vice versa, for example, as requirements change.

Apache Beam improves portability and flexibility. We focus on logic, not the underlying details. In addition, we can change the data processing backend at any time.

Apache Beam can use SDK such as Java, Python, Go, and Scala. In fact, everyone on the team can use the language of their choice.

4. Basic concept

With Apache Beam, we can build workflow diagrams (pipes) and execute them. The key concepts in the programming model are:

PCollection- represents a dataset that can be a fixed batch or data stream

PTransform- A data processing operation that accepts one or more PCollections and outputs zero or more PCollections.

Pipeline- represents the directed acyclic graph of PCollection and PTransform, so it encapsulates the entire data processing job.

PipelineRunner- executes pipelines on the specified distributed processing backend.

In a nutshell, PipelineRunner executes a pipe consisting of PCollection and PTransform.

5. Word count example

Now that we have learned the basic concepts of Apache Beam, let's design and test a word counting task.

5.1 Construction of beam pipes

Designing a workflow diagram is the first step in each Apache Beam job, and the steps for the word counting task are defined as follows:

1. Read the text from the original text.

two。 Divide the text into a vocabulary.

3. All words are lowercase.

4. Delete punctuation.

5. Filter the stop words.

6. Count the number of unique words.

To achieve this, we need to convert the above steps into pipes using PCollection and PTransform abstractions.

5.2. Dependence

Before implementing the workflow diagram, add Apache Beam dependencies to our project:

Org.apache.beam beam-sdks-java-core ${beam.version}

The Beam pipeline runner relies on the distributed processing backend to perform tasks. We add DirectRunner as a runtime dependency:

Org.apache.beam beam-runners-direct-java ${beam.version} runtime

Unlike other pipeline running programs, DirectRunner does not require any additional settings, which is a good choice for beginners.

5.3. Realize

Apache Beam uses the Map-Reduce programming paradigm (similar to Java Stream). Before talking about the following, it is best to have a basic concept and understanding of reduce (), filter (), count (), map (), and flatMap ().

The first thing to do is to create a pipe:

PipelineOptions options = PipelineOptionsFactory.create (); Pipeline p = Pipeline.create (options)

Six-step word counting task:

PCollection wordCount = p. Apply ("(1) Read all lines", TextIO.read (). From (inputFilePath)) .apply ("(2) Flatmap to a list of words", FlatMapElements.into (TypeDescriptors.strings ()) .via (line-> Arrays.asList ("\\ s") .apply ("(3) Lowercase all") MapElements.into (TypeDescriptors.strings ()) .via (word-> word.toLowerCase ()) .apply ("(4) Trim punctuations", MapElements.into (TypeDescriptors.strings ()) .via (word-> trim (word) .apply ("5) Filter stopwords", Filter.by (word->! isStopWord (word)) .apply ("(6) Count words", Count.perElement ())

The first (optional) parameter of apply () is a String, which is just to improve the readability of the code. Here's what each apply () does in the above code:

First, we use TextIO to read the input text file line by line.

Separate each line by a space and map it to a vocabulary.

Word counting is not case-sensitive, so we lowercase all words.

Previously, we separated lines with spaces, but like "word!" And "word?" In this case, you need to delete punctuation.

Stop words like "is" and "by" are common in almost every English article, so we delete them.

Finally, we calculate the number of unique words using the built-in function Count.perElement ().

As mentioned earlier, pipes are processed at the distributed back end. It is not possible to iterate over an in-memory PCollection because it is distributed across multiple backends. Instead, we write the results to an external database or file.

First, we convert PCollection to String. Then, write the output using TextIO:

WordCount.apply (MapElements.into (TypeDescriptors.strings ()) .via (count-> count.getKey () + "- >" + count.getValue ()) .apply (TextIO.write () .to (outputFilePath)

Now that the pipeline is defined, let's do a simple test.

5.4. Run the test

So far, we have defined the pipeline for the word counting task, and now run the pipeline:

P.run () .waitUntilFinish ()

In this line of code, Apache Beam will send our task to multiple DirectRunner instances. Therefore, several output files will be generated in the end. They will contain the following:

... Apache-> 3 beam-- > 5 rocks-- > 2.

It is so easy to define and run distributed jobs in Apache Beam. For comparison purposes, word counting implementations are also available on Apache Spark, Apache Flink, and Hazelcast-Jet

This is the end of "what is Apache Beam". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.