New acquaintance Flink, you should know this! 07/09 Update SLTechnology News&Howtos

New acquaintance Flink, you should know this!

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

First acquaintance of Flink

Official website: https://flink.apache.org/

Apache Flink is a distributed, high-performance, high-availability, high-precision open-source streaming framework for data stream applications. It was accepted by the Apache Incubator in 2014 and quickly became one of the top projects of ASF (Apache Software Foundation).

The core of Flink is a streaming data flow execution engine written in Java and Scala, which provides functions such as data distribution, data communication and fault tolerance for distributed computing of data streams.

Stateful calculations can be performed on infinite data streams (real-time streams) and finite data streams (batch processing). It can be deployed in a variety of cluster environments to quickly calculate the size of data of various sizes.

Flink natively supports iterative computing, memory management, and program optimization.

II. Introduction to the basic architecture of Flink

Specific components:

The above picture can be roughly divided into three parts: data input on the left, data output on the right, and Flink data processing in the middle.

Flink supports the input of Events (supporting real-time events) of the message queue. Upstream data is continuously generated and put into the message queue. Flink continues to consume and process the data in the message queue. After processing, the data is written to the downstream system. This process is continuous.

Data source:

1.Transactions: transaction data. For example, users of various e-commerce platforms place orders, and this data is constantly written into message queues.

2.Logs: for example, error log messages generated during the operation of web applications are continuously sent to the message queue, and subsequent Flink processing provides monitoring basis for the operation and maintenance department.

3.IOT: the Internet of things, the English full name is Internet of things. Internet of things terminal devices, such as Huawei bracelet and Xiaomi bracelet, continue to generate data to write to the message queue, and subsequent Flink processing provides health reports.

4.Clicks: clickstream, such as opening the Taobao website. There are many data collection points or probes buried on the Taobao page. When a user clicks on the Taobao page, it will collect the details of the user's click behavior. The data stream generated by these users' click behavior is called clickstream.

Data entry system:

Flink supports both real-time (Real-time) streaming and batch processing. Real-time streaming messaging systems, such as Kafka. There are many batch systems, DataBase (such as traditional MySQL, Oracle databases), KV-Store (such as HBase, MongoDB databases), File System (such as local file system, distributed file system HDFS).

Flink data processing:

In the process of data processing in Flink, resource management and scheduling can use K8s (Kubernetes referred to as K8s, a container orchestration engine open source by Google), YARN, Mesos, and intermediate data storage can use HDFS, S3, NFS, etc.

Data output:

Flink can output processed data to downstream applications (Application), write processed data to message queues (such as Kafka), and write processed input to Database, File System, and KV-Store.

Third, Flink core component stack

From the figure above, you can see that the underlying layer of Flink is that Deploy,Flink can run in Local mode and start a single JVM. Flink can also run in Standalone cluster mode, and it also supports Flink ON YARN,Flink applications to be submitted to YARN and run directly. In addition, Flink can also run on GCE (Google Cloud Service) and EC2 (Amazon Cloud Service).

The upper layer of Deploy is the Core part of Flink, Runtime. Two core sets of API,DataStream API (streaming) and DataSet API (batch processing) are provided on top of Runtime. On top of the core API, some high-level libraries and API are extended, such as CEP stream processing, Table API and SQL,Flink ML machine learning libraries, Gelly graph computing. SQL can run in both DataStream API and DataSet API.

IV. Past Life and present Life of Flink

At a critical moment in the development of Flink:

Born in 2009, it was originally called StratoSphere, a research project of the Technical University of Berlin, which focused on batch computing in its early days.

The Flink project was hatched in 2014 and donated to Apache.

In 2015, it began to attract everyone's attention and appeared on the big data stage.

It was widely used in Ali in 2016.

5. Why Flink?

Big data ecosystem is very large, excellent framework and components, why is Flink so popular?

1. From a technical point of view, among the current big data computing engines that can support both streaming and batch processing, only Spark and Flink (Storm only supports streaming). The technical idea of Spark is to simulate flow computing based on micro-batch processing. Flink, by contrast, simulates batch computing based on stream computing. From the perspective of technological development, using batches to simulate flow has some technical limitations, and this limitation may be difficult to break through. Flink, on the other hand, simulates batches based on streams and is more scalable in technology.

two。 In terms of language, the large number of java users is also an important reason for providing friendly and elegant and fluent java and scala api and support.

3. The role of big companies as a weather vane, Ali's full-scale shift to Flink is undoubtedly a catalyst. At present, all of Alibaba's business, including all subsidiaries of Alibaba, have adopted a real-time computing platform based on Flink.

Alibaba Computing platform Division senior technical experts do not ask the content of the speech at the Yunqi Conference-Why did Alibaba choose Apache Flink? The performance of this framework is indeed excellent. When Flink was first launched, Alibaba had only a few hundred servers, but now it has reached tens of thousands of servers, and this scale is only a handful in the world. Based on Flink, the status data accumulated within Ali is already at the PB level. Today, more than a trillion data are processed on Ali Flink's computing platform every day. It can undertake more than 472 million visits per second during the peak period. The most typical application scenario is Alibaba double 11 big screen.

In fact, not only Ali, but also many domestic front-line companies have invested a lot of manpower and financial resources in Flink real-time computing.

VI. Representative of streaming Computing: comparison of Flink, Spark Streaming and Storm

Comparative analysis and suggestions:

If the delay requirement is not high, it is recommended to use Spark Streaming, rich advanced, easy to use, and naturally dock with other components in the Spark ecological stack, with large throughput, simple deployment, more intelligent UI interface, high community activity, and fast response to problems, which is more suitable for streaming ETL, and the development momentum of Spark is obvious to all. I believe that the performance and features will be more perfect in the future.

If the requirement for latency is relatively high, it is recommended to try that Flink,Flink is a popular stream system at present. The native stream processing system ensures low latency and is relatively perfect in fault tolerance. It is relatively simple to use, easy to deploy, and the momentum of development is getting better and better. I believe that the response speed of community problems in the future should also be relatively fast.

7. Case demonstration (java&scala)

1. Import org.apache.flinkflink-java1.8.0org.apache.flinkflink-streaming-java_2.111.8.0org.apache.flinkflink-scala_2.111.8.0org.apache.flinkflink-streaming-scala_2.111.8.0org.scala-langscala-library2.11.7java code based on maven dependency: package com.fwmagic.flink;import org.apache.flink.api.common.functions.FlatMapFunction;import org.apache.flink.api.java.utils.ParameterTool;import org.apache.flink.streaming.api.datastream.DataStream Import org.apache.flink.streaming.api.datastream.DataStreamSource;import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;import org.apache.flink.streaming.api.windowing.time.Time;import org.apache.flink.util.Collector;import java.text.SimpleDateFormat;import java.util.Date / * use flink to perform real-time statistics on the data in the specified window, and finally print the result * first execute nc-lk 9000*/public class StreamingWindowWordCountJava {public static void main (String [] args) throws Exception {/ / define the port number of socket on the machine, default 9999final int port;try {final ParameterTool parameterTool = ParameterTool.fromArgs (args); port = parameterTool.getInt ("port", 9999);} catch (Exception e) {System.err.println ("No port specified. Please run 'SocketWindowWordCount-- port'); return;} / get the running environment StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment (); / / connect socket to get the input data DataStreamSource text = env.socketTextStream ("localhost", port, "\ n"); / / calculate the data / / flatten the data DataStream windowCount = text.flatMap (new FlatMapFunction () {public void flatMap (String value, Collector out) {String [] splits = value.split ("\\ s")) For (String word: splits) {out.collect (new WordWithCount (word, 1L));}} / / groups the same word data}. KeyBy ("word") / / specifies the window size and sliding window size of the calculated data, and calculates the result of the last 5 seconds every second. Timewindow (Time.seconds (5), Time.seconds (1)). Sum ("count"); windowCount.print (). SetParallelism (1) / / print the data to the console, using a degree of parallelism / / windowCount.print (). SetParallelism (1); / / Note: because flink is lazily loaded, the execute method must be called before the above code executes env.execute ("streaming word count");} / * stores words and the number of word occurrences * / public static class WordWithCount {public String word;public long count Public WordWithCount () {} public WordWithCount (String word, long count) {this.word = word;this.count = count;} @ Overridepublic String toString () {String date = new SimpleDateFormat ("yyyy-MM-dd HH:mm:ss") .format (new Date ()); return date + ": {" + "word='" + word +'\'+ ", count=" + count +'}' } scala Code: package com.fwmagic.flinkimport org.apache.flink.api.java.utils.ParameterToolimport org.apache.flink.streaming.api.scala. {DataStream, StreamExecutionEnvironment} import org.apache.flink.streaming.api.windowing.time.Timeimport org.apache.flink.streaming.api.scala._object StreamingWWC {def main (args: Array [String]): Unit = {val parameterTool: ParameterTool = ParameterTool.fromArgs (args) val port: Int = parameterTool.getInt ("port" 9999) val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironmentval text: DataStream [String] = env.socketTextStream ("localhost", port) val wc: DataStream [WordCount] = text.flatMap (t = > t.split (",")) .map (w = > WordCount (w, 1)). KeyBy ("word") .timewindow (Time.seconds (5), Time.seconds (1)). Reduce (Time.seconds b) = > WordCount (a.word) A.count+b.count)) / / .sum ("count") wc.print () .setParallelism (1) env.execute ("word count streaming!")}} case class WordCount (word:String,count:Long)

VIII. Flink deployment

Take the local deployment model as an example, deployment on yarn will be described later:

Download: http://apache.website-solution.net/flink/flink-1.8.0/flink-1.8.0-bin-scala_2.11.tgz

Decompress: tar-zxvf flink-1.8.0-bin-scala_2.11.tgz

Launch: bin/start-cluster.sh

View the page: http://localhost:8081/

Package and submit tasks:

Mvn clean package

1. Submit the task through the Submit new Job of the page

2. Submit through the command line: bin/flink run-c com.fwmagic.flink.StreamingWindowWordCountJava examples/myjar/fwmagic-flink.jar-- port 6666

Note: open the port before submitting the task: nc-lk 6666

Test:

Send a message and view the log on the page: TaskManagers- > Click Task-> Stdout

Or view the log on the command line: tail-f log/flink-*-taskexecutor-*.out

Stop the task

1:web ui interface stops

2: command line executes bin/flink cancel

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.