III. The principle and usage of flink--DataStreamAPI 07/06 Update SLTechnology News&Howtos

III. The principle and usage of flink--DataStreamAPI

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. What is the basic overview of DataStream 1.1what is datastream?

Datastream is the api provided by flink for stream computing and batch processing. It is the api encapsulation of the underlying streaming computing model, which is convenient for users to program.

1.2 datastream operation model

A complete datastream running model is generally composed of three parts, namely Source, Transformation and Sink. DataSource is mainly responsible for reading data (that is, reading from data sources, batch data sources, or streaming data sources), Transformation is mainly responsible for the conversion operations to which they belong (that is, normal business office logic), and Sink is responsible for the output of the final data (export of calculation results).

1.3 datastream program architecture

Generally speaking, writing a flink program using datastream api includes the following process:

1. Obtain an execution environment; (Execution Environment)

2. Load / create initial data; (Source)

3. Specify the conversion of these data; (Transformation)

4. Specify the location where the calculation results are placed; (Sink)

5. Trigger program execution (this is a necessary operation for streaming computing, but not required if it is a batch)

II. DataStream api using 2.1 maven dependent configuration 4.0.0 SparkDemo SparkDemoTest 1.0-SNAPSHOT UTF-8 2.11.8 2.7.3 2.11 1.6.1 org.apache.hadoop hadoop-client ${hadoop.version} mysql Mysql-connector-java 8.0.12 junit junit 4.12 org.apache.logging.log4j log4j-core 2.9.0 org.apache.flink flink-java 1.6.1 org.apache.flink flink-streaming-java_2.11 1.6.1 org.apache.flink flink-streaming-scala_2.11 1.6.1 Org.apache.flink flink-scala_2.11 1.6.1 org.apache.flink flink-clients_2.11 1.6.1 org.apache.flink flink-table_2.11 1.6.1 Provided org.apache.hadoop hadoop-client ${hadoop.version} com.alibaba fastjson 1.2.22 org.apache.flink flink-connector-kafka-0.10_$ {scala.binary.version} ${flink.version} org.scala-tools maven-scala-plugin 2.15.2 compile TestCompile maven-compiler-plugin 3.6.0 1.8 1.8 Org.apache.maven.plugins maven-surefire-plugin 2.19 true 2.2 get execution Environment (Execution Environment)

There are three types of execution environments:

1. StreamExecutionEnvironment.getExecutionEnvironment () creates an execution environment that represents the context of the currently executing program. If the program is called independently, this method returns the local execution environment; if the program is called from a command line client to submit to the cluster, this method returns the execution environment of the cluster, that is, getExecutionEnvironment determines which runtime environment to return based on how the query runs, which is the most common way to create an execution environment. 2. StreamExecutionEnvironment.createLocalEnvironment () returns the local execution environment and needs to specify the default degree of parallelism when calling. 3. StreamExecutionEnvironment.createRemoteEnvironment () returns to the cluster execution environment and submits the Jar to the remote server. You need to specify the IP and port number of the JobManager when calling, and specify the Jar package to run in the cluster. 2.3Common data sources (source) 2.3.1 file-based data sources

1. Env.readTextFile (path)

Reads a text file that conforms to the TextInputFormat specification in a column and returns the result as String.

Package flinktest;import org.apache.flink.streaming.api.datastream.DataStreamSource;import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;public class ExampleDemo {public static void main (String [] args) throws Exception {/ / 1, create environment object StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment (); / / 2, read file as data source DataStreamSource fileSource = env.readTextFile ("/ tmp/test.txt") / / 3. Print data fileSource.print (); / / 4. Start task execution env.execute ("test file source");}}

2. Env.readFile (fileInputFormat,path)

Reads the file in the specified fileinputformat format. The fileinputformat here can customize the class

Package flinktest;import org.apache.flink.api.java.io.TextInputFormat;import org.apache.flink.core.fs.Path;import org.apache.flink.streaming.api.datastream.DataStreamSource;import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;public class ExampleDemo {public static void main (String [] args) throws Exception {/ / 1, create environment object StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment () / 2. Read the file as the data source DataStreamSource fileSource = env.readFile (new TextInputFormat (new Path ("/ tmp/test.txt")), "/ tmp/test.txt"); / / 3, print the data fileSource.print (); / / 4, start the task execution env.execute ("test file source");} 2.3.2 based on the socket data source

SocketTextStream (host,port)

Package flinktest;import org.apache.flink.streaming.api.datastream.DataStreamSource;import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;public class ExampleDemo {public static void main (String [] args) throws Exception {/ / 1, create environment object StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment (); / / 2, read socket as data source DataStreamSource sourceSocket = env.socketTextStream ("127.0.0.1", 1000) / / 3. Print data sourceSocket.print (); / / 4. Start task execution env.execute ("test socket source");} 2.3.3 data source based on collection collection

1. FromCollection (Collection)

Create a data flow from the collection where all elements are of the same type.

List list = new ArrayList (); DataStreamSource sourceCollection = env.fromCollection (list)

2. FromCollection (Iterator)

Create a data flow from the Iterator, and the class that specifies the element data type is returned by iterator.

3. FromElements (Object)

To create a data stream from a given sequence of objects, all objects must be of the same type

4. GenerateSequence (from, to)

Generates a sequence of numbers in parallel from a given interval. Read a range of sequnce objects

2.3.4 Custom data Source

Env.addSource (SourceFuntion)

Customize a data source implementation class, and then addSource it to env. For example, the scene reads data from kafka, reads data from mysql

2.4Common output (sink)

Data Sink consumes data from DataStream and forwards it to files, sockets, external systems, or prints. Flink has many built-in output formats encapsulated in DataStream operations.

1 、 writeAsText

TextOutputFormat elements line by line as strings, which are obtained by calling the toString () method of each element.

2 、 WriteAsCsv

Write tuples to the file (CsvOutputFormat) separated by commas, and the separation between lines and fields is configurable. The value of each field comes from the object's toString () method.

3 、 print/printToErr

Print the value of each element's toString () method to standard output or standard error output stream. Or you can add a prefix to the output stream, which can help distinguish between different print calls, and if the degree of parallelism is greater than 1, the output will also have a flag identifying which task is generated.

4 、 writeUsingOutputFormat

Custom file output methods and base classes (FileOutputFormat) that support custom object to byte conversion.

5 、 writeToSocket

Writes the element to the socket according to SerializationSchema.

6. Stream.addSink (SinkFunction)

Use custom sink classes

2.5Common operators (transformation operator) 2.5.1 map

DataStream → DataStream: input a parameter and process it to produce a new parameter

DataStream dataStream = / /... dataStream.map (new MapFunction () {@ Override / / here take each parameter * 2, and then return public Integer map (Integer value) throws Exception {return 2 * value;}}); 2.5.2 flatMap

DataStream → DataStream: enter a parameter to produce 0, 1, or more outputs.

DataStream.flatMap (new FlatMapFunction () {@ Override public void flatMap (String value, Collector out) throws Exception {/ / cuts the string and puts the processed data into collector. For (String word: value.split ("")) {out.collect (word);}}); 2.5.3 filter

DataStream → DataStream: calculates the Boolean value of each element and returns an element with a Boolean value of true. The following example is to filter out elements that are not zero:

DataStream.filter (new FilterFunction () {@ Override public boolean filter (Integer value) throws Exception {return value! = 0;}}); 2.5.4 keyBy

DataStream → KeyedStream: the input is required to be tuple or a composite object with multiple attributes (for example, student class with more than 2 attributes such as name, age, etc.). Anyway, there must be data as key and value. Partition according to key, the same key in the same partition, internally using hash implementation.

/ / there are different ways to specify the field name of keydataStream.keyBy ("someKey") / / to specify the field name of key. It is often used to specify the location of dataStream.keyBy (0) / / in composite objects and 2.5.5 reduce in tuple.

KeyedStream → DataStream: an aggregation operation of a packet data stream that merges the current element with the result of the last aggregation to produce a new value. The returned stream contains the result of each aggregation, rather than just the final result of the last aggregation, that is, the result of each aggregation is returned until the end of the last aggregation, so not only the last aggregation result is returned.

KeyedStream.reduce (new ReduceFunction () {@ Override public Integer reduce (Integer value1, Integer value2) throws Exception {return value1 + value2;}}); 2.5.6 fold

KeyedStream → DataStream

A scrolling collapse operation of a packet data stream with initial values, merging the results of the current element and the previous collapse operation, and generating a new value, the returned stream contains the result of each fold, rather than just returning the final result of the last fold.

DataStream result = keyedStream.fold ("start", new FoldFunction () {@ Override public String fold (String current, Integer value) {return current + "-" + value;}}), and the result is: start-1,start-1-2. 2.5.7 aggregations

KeyedStream → DataStream: a rolling aggregation operation on a packet data stream. The difference between min and minBy is that min returns a minimum value, while minBy returns the element with the lowest value in its field (the same principle applies to max and maxBy), and the returned stream contains the result of each aggregation, rather than just the final result of the last aggregation.

KeyedStream.sum (0); keyedStream.sum ("key"); keyedStream.min (0); keyedStream.min ("key"); keyedStream.max (0); keyedStream.max ("key"); keyedStream.minBy (0); keyedStream.minBy ("key"); keyedStream.maxBy (0); keyedStream.maxBy ("key")

Note: operators before 2.3.10 can be directly applied to Stream because they are not aggregate operations, but after 2.3.10 you will find that although we can directly apply aggregation operators to an unbounded stream data, it will record the results of each aggregation, which is often not what we want. In fact, aggregation operators such as reduce, fold and aggregation are all used with Window. Only by cooperating with Window can you get the desired results.

2.5.8 connect 、 coMap 、 coFlatMap

1 、 connect:

DataStream,DataStream → ConnectedStreams: connect two streams that keep their type. After the two streams are Connect, they are only placed in the same stream, the internal data and form remain unchanged, and the two streams are independent of each other.

DataStream someStream = /... DataStream otherStream = / /... ConnectedStreams connectedStreams = someStream.connect (otherStream)

2 、 coMap 、 coFlatMap

ConnectedStreams → DataStream: map and flatmap operators dedicated to stream operations after connect.

ConnectedStreams.map (new CoMapFunction () {@ Override public Boolean map1 (Integer value) {return true;} @ Override public Boolean map2 (String value) {return false;}}); connectedStreams.flatMap (new CoFlatMapFunction () {@ Override public void flatMap1 (Integer value, Collector out) {out.collect (value.toString ()) } @ Override public void flatMap2 (String value, Collector out) {for (String word: value.split ("")) {out.collect (word);}); 2.5.9 split and select

Split:

DataStream → SplitStream: splits a data stream into two or more data streams. And give each data stream an alias.

Select:SplitStream → DataStream: get one or more DataStream from a SplitStream.

SplitStream split = someDataStream.split (new OutputSelector () {@ Override public Iterable select (Integer value) {List output = new ArrayList (); if (value% 2 = = 0) {output.add ("even");} else {output.add ("odd");} return output;}}); split.select ("even"). Print () Split.select ("odd") .print (); 2.5.10 union

DataStream → DataStream: union two or more DataStream to produce a new DataStream that contains all the DataStream elements. Note: if you union a DataStream with itself, you will see each element appear twice in the new DataStream. Unlike connect, connect does not merge multiple stream

DataStream.union (otherStream1, otherStream2,...)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.