Big data frame-Flink and Beam 04/21 Update SLTechnology News&Howtos

Big data frame-Flink and Beam

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Overview of Flink

Flink is a top-level project of Apache, and Apache Flink is an open source distributed streaming and batch processing system. The core of Flink is to provide data distribution, communication and fault-tolerant distributed computing on the data stream. At the same time, Flink builds a batch engine on top of the streaming engine, which natively supports iterative computing, memory management, and program optimization.

The existing open source computing solutions will regard streaming and batch processing as two different application types, because the SLA (Service-Level-Aggreement) they provide is completely different: stream processing generally needs to support low latency and Exactly-once guarantee, while batch processing needs to support high throughput and efficient processing.

Flink looks at stream processing and batch processing from another perspective, unifying the two: Flink fully supports stream processing, that is, the input data stream is × × when viewed as stream processing; batch processing is treated as a special stream, but its input data stream is defined as bounded.

Flink stream processing features:

Support for high throughput, low latency, high performance flow processing support window with event time (Window) operations support stateful computing Exactly-once semantics support highly flexible window (Window) operations, support based on time, count, session And data-driven 's window operation support persistent flow model with Backpressure function supports fault tolerance based on lightweight distributed snapshot (Snapshot) implementation, a runtime supports both Batch on Streaming processing and Streaming processing, Flink implements its own memory management support iterative computing support program automatic optimization within JVM: to avoid expensive operations such as Shuffle and sorting under specific circumstances, it is necessary to cache intermediate results

Flink architecture diagram:

Flink consists of its software stack in the form of a hierarchical system, and the stacks of different layers are based on its lower layers, and each layer accepts the abstract form of different layers of the program.

At the most basic level, a Flink application consists of the following parts:

Data source: data source, input data into Flink Transformations: process data Data sink: transfer the processed data to some place

As shown below:

Currently, Flink supports the following frameworks:

Apache Kafka (sink/source) Elasticsearch 1.x / 2.x / 5.x (sink) HDFS (sink) RabbitMQ (sink/source) Amazon Kinesis Streams (sink/source) Twitter (source) Apache NiFi (sink/source) Apache Cassandra (sink) Redis, Flume, and ActiveMQ (via Apache Bahir) (sink)

The official website address of Flink is as follows:

Http://flink.apache.org/

For some of the contents, please refer to the following article:

Https://blog.csdn.net/jdoouddm7i/article/details/62039337

Use Flink to complete wordcount statistics

Download address of Flink:

Http://flink.apache.org/downloads.html

Flink Quick start document address:

Https://ci.apache.org/projects/flink/flink-docs-release-1.4/quickstart/setup_quickstart.html

Note: an environment with a version of jdk1.7 or above is required on the system before installing Flink

What I download here is version 2.6 of Flink:

[root@study-01 ~] # cd / usr/local/src/ [root @ study-01 / usr/local/src] # wget http://mirrors.tuna.tsinghua.edu.cn/apache/flink/flink-1.4.2/flink-1.4.2-bin-hadoop26-scala_2.11.tgz[root@study-01 / usr/local/src] # tar-zxvf flink-1.4.2-bin-hadoop26-scala_2.11.tgz-C / usr/local [ Root@study-01 / usr/local/src] # cd.. / flink-1.4.2/ [root@study-01 / usr/local/flink-1.4.2] # lsbin conf examples lib LICENSE log NOTICE opt README.txt resources tools [root @ study-01 / usr/local/flink-1.4.2] #

Start Flink:

[root@study-01 / usr/local/flink-1.4.2] #. / bin/start-local.sh [root@study-01 / usr/local/flink-1.4.2] # jps6576 Jps6131 JobManager6499 Task Manager [root @ study-01 / usr/local/flink-1.4.2] #

After starting successfully, you can access port 8081 of the host ip and go to the web page of Flink:

We can start implementing the wordcount case now. I have a file here that reads as follows:

[root@study-01 / usr/local/flink-1.4.2] # cat / data/hello.txt hadoop welcomehadoop hdfs mapreducehadoop hdfshello hadoopspark vs mapreduce [root @ study-01 / usr/local/flink-1.4.2] #

Execute the following command to implement the wordcount case. If you have studied Hadoop, you will find that this command is similar to the wordcount case using MapReduce on Hadoop:

[root@study-01 / usr/local/flink-1.4.2] #. / bin/flink run. / examples/batch/WordCount.jar-- input file:///data/hello.txt-- output file:///data/tmp/flink_wordcount_out

After the execution is completed, you can go to the web page to view the execution information of the task:

View the output:

[root@study-01 / usr/local/flink-1.4.2] # cat / data/tmp/flink_wordcount_out hadoop 4hdfs 2hello 1mapreduce 2spark 1vs 1welcome 1 [root@study-01 / usr/local/flink-1.4.2] # Beam Overview

Google's new and old troika:

The old troika: GFS, MapReduce, BigTable the new troika: Dremel, Pregel, Caffeine

As we all know, several frameworks in the Hadoop ecosystem are derived from Google's old troika, and some of the new framework implementations are partly derived from Google's new troika concept. So now there are many big data-related frameworks on the market, which will lead to many programming specifications and inconsistent processing patterns, and we hope to have a tool to unify these programming models, so Beam was born.

Apache Beam is an open source platform announced by the Apache Software Foundation on January 10, 2017. Beam provides a removable (compatible) API layer for creating parallel processing pipes for complex data. The core concept of this layer of API is based on the Beam model (formerly known as the Dataflow model) and is implemented to varying degrees on each Beam engine.

Background:

In February 2016, Google and its partners donated a large number of codes to Apache, creating the incubating Beam project (originally called Apache Dataflow). Most of this code comes from the library used by Google Cloud Dataflow SDK-- developers to write streaming and batch pipeline (pipelines), which can be run on any supported execution engine. At that time, the main engine supported was Google Cloud Dataflow, with support for Apache Spark and Apache Flink in development. Today, when it officially opens, it already has five officially supported engines. In addition to the three already mentioned, the Beam model and Apache Apex are also included.

Beam features:

Unifies the data batch processing (batch) and stream processing (stream) programming paradigm and can run on any execution engine. It not only provides a unified model for model design, but also implements a series of data-oriented workflows. These workflows include data processing, absorption, and integration.

The official website of Beam:

Https://beam.apache.org/

Run WordCount's Beam program in a variety of different Runner

Quick start documentation for Beam Java:

Https://beam.apache.org/get-started/quickstart-java/

The pre-installation of Beam also requires the system to have a jdk1.7 or above environment, as well as a Maven environment.

Download Beam and the wordcount case code using the following command:

Mvn archetype:generate\-DarchetypeGroupId=org.apache.beam\-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples\-DarchetypeVersion=2.4.0\-DgroupId=org.example\-DartifactId=word-count-beam\-Dversion= "0.1"\-Dpackage=org.apache.beam.examples\-DinteractiveMode=false

Enter the downloaded directory to view:

[root@study-01 / usr/local/src] # cd word-count-beam/ [root@study-01 / usr/local/src/word-count-beam] # tree. ├── pom.xml └── src ├── main │ └── java │ org │ └── apache │ └── beam │ └── examples │ ├── common │ │ ├── ExampleBigQueryTableOptions.java │ │ ├── ExampleOptions.java │ │ ├── ExamplePubsubTopicAndSubscriptionOptions.java │ │ ├── ExamplePubsubTopicOptions.java │ │ ├── ExampleUtils.java │ │ └── WriteOneFilePerWindow.java │ ├── complete │ │ └── game │ │ ├── GameStats.java │ │ ├── HourlyTeamScore.java │ │ ├── injector │ ├── Injector.java │ ├── InjectorUtils.java │ │ │ └── RetryHttpInitializerWrapper.java │ │ ├── LeaderBoard.java │ │ ├── StatefulTeamScore.java │ │ ├── UserScore.java │ │ └── utils │ │ ├── GameConstants.java │ │ ├── WriteToBigQuery.java │ │ ├── WriteToText.java │ │ └── WriteWindowedToBigQuery.java │ ├── DebuggingWordCount.java │ ├── MinimalWordCount.java │ ├── WindowedWordCount.java │ └── WordCount.java └── test └── java └── org └── apache └── beam └── examples ├── complete │ └── game │ ├── GameStatsTest.java │ ├── HourlyTeamScoreTest.java │ ├── LeaderBoardTest.java │ ├── StatefulTeamScoreTest.java │ └── UserScoreTest.java ├── DebuggingWordCountTest.java ├── MinimalWordCountTest.java └── WordCountTest.java20 directories 31 Files [root @ study-01 / usr/local/src/word-count-beam] #

By default, the runner of beam is Direct, so run the wordcount case with Direct as follows:

[root@study-01 / usr/local/src/word-count-beam] # lspom.xml src target [root @ study-01 / usr/local/src/word-count-beam] # [root@study-01 / usr/local/src/word-count-beam] # mvn compile exec:java-Dexec.mainClass=org.apache.beam.examples.WordCount-Dexec.args= "- inputFile=/data/hello.txt-- output=counts"-Pdirect-runner

The results of the run will be stored in the current directory:

[root@study-01 / usr/local/src/word-count-beam] # lscounts-00000-of-00003 counts-00001-of-00003 counts-00002-of-00003 pom.xml src target [root @ study-01 / usr/local/src/word-count-beam] # more counts* # View the result file: counts-00000-of-00003:welcome: 1spark: 1 lscounts-00000-of-00003 counts-00001-of-00003 counts-00002-of-00003 pom.xml src: : counts-00001-of-00003:hdfs: 2hadoop: 4mapreduce: 2:counts-00002-of-00003:hello: 1vs: 1 [root@study-01 / usr/local/src/word-count-beam] #

If you need to specify another runner, you can use the-- runner parameter. For example, if I want to specify runner as Flink, then modify the command as follows:

[root@study-01 / usr/local/src/word-count-beam] # mvn compile exec:java-Dexec.mainClass=org.apache.beam.examples.WordCount-Dexec.args= "--runner=FlinkRunner-- inputFile=/data/hello.txt-- output=counts"-Pflink-runner

Delete the previously generated files and directories, and let's run them using Spark. If you use Spark, you only need to modify the-- runner and-Pspark parameters:

[root@study-01 / usr/local/src/word-count-beam] # mvn compile exec:java-Dexec.mainClass=org.apache.beam.examples.WordCount-Dexec.args= "--runner=SparkRunner-- inputFile=/data/hello.txt-- output=counts"-Pspark-runner

After running successfully, the following files and directories will also be generated:

[root@study-01 / usr/local/src/word-count-beam] # lscounts-00000-of-00003 counts-00001-of-00003 counts-00002-of-00003 pom.xml src target [root @ study-01 / usr/local/src/word-count-beam] #

View the processing results:

[root@study-01 / usr/local/src/word-count-beam] # more counts*:counts-00000-of-00003:spark: 1:counts-00001-of-00003:welcome: 1hello: 1mapreduce: 2:counts-00002 -of-00003:vs: 1hdfs: 2hadoop: 4 [root@study-01 / usr/local/src/word-count-beam] #

The above two examples just want to show that the same code can be run on different computing engines. There is no need to develop different code for different engines, which is one of the main design goals of the Beam framework.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.