How to realize Flink batch processing 07/02 Update SLTechnology News&Howtos

How to realize Flink batch processing

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article focuses on "how to achieve Flink batch processing", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "how to achieve Flink batch processing"!

Introduction to 1.Flink

Apache Flink is a framework and distributed processing engine for stateful computing of unbounded and bounded data flows. Flink is designed to run in all common cluster environments, performing calculations at memory execution speed and on any scale. Flink is an open source streaming framework with the following features

Batch and stream as a whole: unified batch processing and flow processing

Distributed: Flink programs can run on multiple machines

High performance: high processing performance

High availability: Flink supports high availability (HA)

Accuracy: Flink can ensure the accuracy of data processing.

2.Flink core module composition

First of all, by analogy to Spark, let's look at the module partition of Flink.

Deploy layer

You can start a single JVM and let Flink run in Local mode or in Standalone cluster mode. At the same time, it also supports Flink ON YARN,Flink applications to be submitted directly to YARN and Flink can also run on GCE (Google Cloud Service) and EC2 (Amazon Cloud Service).

Core layer (Runtime)

Two core sets of API,DataStream API (streaming) and DataSet API (batch processing) are provided on top of Runtime

APIs & Libraries layer

Some high-level libraries and API are extended on top of the core API

CEP stream processing

Table API and SQL

Flink ML machine learning library

Gelly diagram calculation

3.Flink ecological composition

As a member of big data's ecology, Flink can be well combined with other components in the ecology besides itself. Generally speaking, there are input and output aspects.

In the middle part, which has been introduced above, the home page looks at both sides, in which the green background is the streaming scene and the blue background is the batch scene.

Enter Connectors (left part)

Stream processing methods: including Kafka (message queue), AWS kinesis (real-time data flow service), RabbitMQ (message queue), NIFI (data pipeline), Twitter (API)

Batch processing: including HDFS (distributed file system), HBase (distributed database), Amazon S3 (file system), MapR FS (file system), ALLuxio (memory-based distributed file system)

Output Connectors (right side)

Stream processing methods: including Kafka (message queue), AWS kinesis (real-time data flow service), RabbitMQ (message queue), NIFI (data pipeline), Cassandra (NOSQL database), ElasticSearch (full-text search), HDFS rolling file (scrolling file)

Batch processing: including HBase (distributed determinant database), HDFS (distributed file system)

Introduction to 4.Flink stream processing mode

There are mainly two kinds of flow processing in Spark, one is dimensional batch processing, if there is no requirement for the time in the event, this way can meet many needs, the other is that Structed Streaming is based on an unbounded large table, the core API is Spark Sql, while Flink focuses on infinite flow and regards bounded flow as a special case of infinite flow, and the other two frameworks have state management.

Infinite flow processing

The input data is endless, like a stream of water, and the data processing starts at a certain point in the present or past, and goes on and on.

Finite flow processing

Processing data from one point in time, and then ending at another point in time, the input data itself may be limited (that is, the input data set will not grow with time), or it may be artificially set to a limited set for analysis purposes (that is, only analyzing events in a certain period of time) Flink encapsulates DataStream API for streaming processing and encapsulates DataSet API for batch processing. At the same time, Flink is also a batch-stream integrated processing engine, which provides Table API / SQL to unify batch and stream processing.

Stateful flow processing application

Based on SubTask, each SubTask processing will get the status and update the status.

Getting started with 5.Flink

Take the classic WordCount as an example, let's take a look at two batch stream processing cases of Flink. The case uses nc-lp as Source and console output as Sink, which is divided into Java and Scala versions.

Batch import org.apache.flink.api.scala._object WordCountScalaBatch {def main (args: Array [String]): Unit = {val inputPath = "E:\\ hadoop_res\\ input\\ a.txt" val outputPath = "E:\\ hadoop_res\\ output2" val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment val text: DataSet [String] = environment.readTextFile (inputPath) text .flatMap (_ .split ("\\ s +")) .map (_ ) .groupBy (0) .sum (1) .setParallelism (1) .writeAsCsv (outputPath, "\ n", " ") / / setParallelism (1) environment.execute (" job name ")}} Scala version of stream processing import org.apache.flink.streaming.api.scala._object WordCountScalaStream {def main (args: Array [String]): Unit = {/ / processing streaming data val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment val streamData: DataStream [String] = environment.socketTextStream (" linux121 ", 7777) val out: DataStream [(String) Int)] = streamData .flatMap (_ .split ("\\ s +")) .map ((_, 1)) .keyby (0) .sum (1) out.print () environment.execute ("test stream")}} batch package com.hoult.demo for Java version Import org.apache.flink.api.common.functions.FlatMapFunction;import org.apache.flink.api.java.ExecutionEnvironment;import org.apache.flink.api.java.operators.AggregateOperator;import org.apache.flink.api.java.operators.DataSource;import org.apache.flink.api.java.operators.FlatMapOperator;import org.apache.flink.api.java.operators.UnsortedGrouping;import org.apache.flink.api.java.tuple.Tuple2;import org.apache.flink.util.Collector Public class WordCountJavaBatch {public static void main (String [] args) throws Exception {String inputPath = "E:\\ hadoop_res\\ input\\ a.txt"; String outputPath = "E:\ hadoop_res\\ output"; / / get the running environment of flink ExecutionEnvironment executionEnvironment = ExecutionEnvironment.getExecutionEnvironment (); DataSource text = executionEnvironment.readTextFile (inputPath); FlatMapOperator wordsOne = text.flatMap (new SplitClz ()) / / hello,1 you,1 hi,1 him,1 UnsortedGrouping groupWordAndOne = wordsOne.groupBy (0); AggregateOperator wordCount = groupWordAndOne.sum (1); wordCount.writeAsCsv (outputPath, "\ n", "\ t"). SetParallelism (1); executionEnvironment.execute () } static class SplitClz implements FlatMapFunction {public void flatMap (String s, Collector collector) throws Exception {String [] strs = s.split ("\\ s +"); for (String str: strs) {collector.collect (new Tuple2 (str, 1));} Java version of the stream processing package com.hoult.demo;import org.apache.flink.api.common.functions.FlatMapFunction Import org.apache.flink.api.java.tuple.Tuple2;import org.apache.flink.streaming.api.datastream.DataStreamSource;import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;import org.apache.flink.util.Collector;public class WordCountJavaStream {public static void main (String [] args) throws Exception {StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment (); DataStreamSource dataStream = executionEnvironment.socketTextStream ("linux121", 7777) SingleOutputStreamOperator sum = dataStream.flatMap (new FlatMapFunction () {public void flatMap (String s, Collector collector) throws Exception {for (String word: s.split ("")) {collector.collect (new Tuple2 (word, 1));}) .keyby (0) .sum (1); sum.print () ExecutionEnvironment.execute ();}} at this point, I believe you have a deeper understanding of "how to implement Flink batch processing". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.