In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article introduces the relevant knowledge of "what is Apache Beam". In the operation of actual cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
1. Overview
In this tutorial, we will introduce Apache Beam and explore its basic concepts. We will first demonstrate the use cases and benefits of using Apache Beam, and then introduce the basic concepts and terminology. After that, we will use a simple example to illustrate all the important aspects of Apache Beam.
2. What is Apache Beam?
Apache Beam (Batch+strEAM) is a unified programming model for batch and streaming data processing jobs. It provides a software development kit for defining and building data processing pipes and executing runners for those pipes.
Apache Beam is designed to provide a portable programming layer. In fact, the Beam pipeline runner converts the data processing pipeline into an API that is compatible with the back end of the user's choice. Currently, the backends that support these distributed processing include:
Apache Apex
Apache Flink
Apache Gearpump (incubating)
Apache Samza
Apache Spark
Google Cloud Dataflow
Hazelcast Jet
3. Why choose Apache Beam?
Apache Beam combines batch and streaming data processing, while other components typically do this through a separate API. Therefore, it is easy to change streaming to batch processing, and vice versa, for example, as requirements change.
Apache Beam improves portability and flexibility. We focus on logic, not the underlying details. In addition, we can change the data processing backend at any time.
Apache Beam can use SDK such as Java, Python, Go, and Scala. In fact, everyone on the team can use the language of their choice.
4. Basic concept
With Apache Beam, we can build workflow diagrams (pipes) and execute them. The key concepts in the programming model are:
PCollection- represents a dataset that can be a fixed batch or data stream
PTransform- A data processing operation that accepts one or more PCollections and outputs zero or more PCollections.
Pipeline- represents the directed acyclic graph of PCollection and PTransform, so it encapsulates the entire data processing job.
PipelineRunner- executes pipelines on the specified distributed processing backend.
In a nutshell, PipelineRunner executes a pipe consisting of PCollection and PTransform.
5. Word count example
Now that we have learned the basic concepts of Apache Beam, let's design and test a word counting task.
5.1 Construction of beam pipes
Designing a workflow diagram is the first step in each Apache Beam job, and the steps for the word counting task are defined as follows:
1. Read the text from the original text.
two。 Divide the text into a vocabulary.
3. All words are lowercase.
4. Delete punctuation.
5. Filter the stop words.
6. Count the number of unique words.
To achieve this, we need to convert the above steps into pipes using PCollection and PTransform abstractions.
5.2. Dependence
Before implementing the workflow diagram, add Apache Beam dependencies to our project:
Org.apache.beam beam-sdks-java-core ${beam.version}
The Beam pipeline runner relies on the distributed processing backend to perform tasks. We add DirectRunner as a runtime dependency:
Org.apache.beam beam-runners-direct-java ${beam.version} runtime
Unlike other pipeline running programs, DirectRunner does not require any additional settings, which is a good choice for beginners.
5.3. Realize
Apache Beam uses the Map-Reduce programming paradigm (similar to Java Stream). Before talking about the following, it is best to have a basic concept and understanding of reduce (), filter (), count (), map (), and flatMap ().
The first thing to do is to create a pipe:
PipelineOptions options = PipelineOptionsFactory.create (); Pipeline p = Pipeline.create (options)
Six-step word counting task:
PCollection wordCount = p. Apply ("(1) Read all lines", TextIO.read (). From (inputFilePath)) .apply ("(2) Flatmap to a list of words", FlatMapElements.into (TypeDescriptors.strings ()) .via (line-> Arrays.asList ("\\ s") .apply ("(3) Lowercase all") MapElements.into (TypeDescriptors.strings ()) .via (word-> word.toLowerCase ()) .apply ("(4) Trim punctuations", MapElements.into (TypeDescriptors.strings ()) .via (word-> trim (word) .apply ("5) Filter stopwords", Filter.by (word->! isStopWord (word)) .apply ("(6) Count words", Count.perElement ())
The first (optional) parameter of apply () is a String, which is just to improve the readability of the code. Here's what each apply () does in the above code:
First, we use TextIO to read the input text file line by line.
Separate each line by a space and map it to a vocabulary.
Word counting is not case-sensitive, so we lowercase all words.
Previously, we separated lines with spaces, but like "word!" And "word?" In this case, you need to delete punctuation.
Stop words like "is" and "by" are common in almost every English article, so we delete them.
Finally, we calculate the number of unique words using the built-in function Count.perElement ().
As mentioned earlier, pipes are processed at the distributed back end. It is not possible to iterate over an in-memory PCollection because it is distributed across multiple backends. Instead, we write the results to an external database or file.
First, we convert PCollection to String. Then, write the output using TextIO:
WordCount.apply (MapElements.into (TypeDescriptors.strings ()) .via (count-> count.getKey () + "- >" + count.getValue ()) .apply (TextIO.write () .to (outputFilePath)
Now that the pipeline is defined, let's do a simple test.
5.4. Run the test
So far, we have defined the pipeline for the word counting task, and now run the pipeline:
P.run () .waitUntilFinish ()
In this line of code, Apache Beam will send our task to multiple DirectRunner instances. Therefore, several output files will be generated in the end. They will contain the following:
... Apache-> 3 beam-- > 5 rocks-- > 2.
It is so easy to define and run distributed jobs in Apache Beam. For comparison purposes, word counting implementations are also available on Apache Spark, Apache Flink, and Hazelcast-Jet
This is the end of "what is Apache Beam". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.