Lesson 93: SparkStreaming updateStateByKey basic operation synthesis case practice and insider source code decryption 04/27 Update SLTechnology News&Howtos

Lesson 93: SparkStreaming updateStateByKey basic operation synthesis case practice and insider source code decryption

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Spark Streaming's DStream provides us with a updateStateByKey method, its main function is to maintain a state state for each key in Spark Streaming over time, and constantly update the status of the key through the update function. For each new batch, Spark Streaming will update the status of the state for the existing key when using updateStateByKey (for each new key, it will also perform the update function operation of state), but if you update the state through the update function and return none, the state status corresponding to key will be deleted at this time. It should be noted that state can be any type of data structure. This brings unlimited imagination to our calculation.

Here comes the point! If you want to constantly update the state of each key, it must involve state preservation and fault tolerance. At this time, you need to turn on the checkpoint mechanism and function. It needs to be stated that checkpoint can save all the content that can be stored on the file system, such as the unprocessed data and the state already owned by the program.

Additional note: it is of great practical significance for streaming to preserve and update the historical state, such as advertising (the value of advertising and operational advertising effectiveness evaluation, hot spot tracking, thermal map).

To put it simply, if we need to do wordcount, each batchInterval will calculate a new batch of data. how can this data be updated to the results of previous calculations? UpdateStateByKey can achieve this function.

The function is defined as follows:

Def updateStateByKey [S: ClassTag] (updateFunc: (Seq [V], Option [S]) = > Option [S]): DStream [(K, S)] = ssc.withScope {updateStateByKey (updateFunc, defaultPartitioner ())}

UpdateStateByKey needs to pass in a function that has two arguments Seq [V] to represent the sequence of values of the latest reduce, and Option [s] to represent the previous values corresponding to key. The latest value of a key is returned.

Let's demonstrate it with an example:

Package com.dt.spark.streamingimport org.apache.spark.SparkConfimport org.apache.spark.streaming. {Seconds, StreamingContext} / * Created by Administrator on 2016-5-3. * / object UpdateStateByKeyDemo {def main (args: Array [String]) {val conf = new SparkConf () .setAppName ("UpdateStateByKeyDemo") val ssc = new StreamingContext (conf,Seconds (20)) / / to use the updateStateByKey method, Checkpoint must be set. Ssc.checkpoint ("/ checkpoint/") val socketLines = ssc.socketTextStream ("spark-master", 9999) socketLines.flatMap (_ .split (",") .map (word= > (word,1)) .updateStateByKey ((currValues:Seq [Int]) PreValue:Option [Int]) = > {val currValue = currValues.sum Some (currValue + preValue.getOrElse (0)}) .print () ssc.start () ssc.awaitTermination () ssc.stop ()}}

Package and upload to the spark cluster.

Open nc and send test data

Root@spark-master:~# nc-lk 9999 Hadoopje sparkphadoopre Scala Hvehadoopje Hbasejol Spark

Run the spark program

Root@spark-master:~# / usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin/spark-submit-- class com.dt.spark.streaming.UpdateStateByKeyDemo-- master spark://spark-master:7077. / spark.jar

View the running results:

-Time: 1462282180000 ms--- (scala,1) (hive,1) (spark,2) (hadoop,2) (Hbase,1)

We enter some more data into nc

Root@spark-master:~# nc-lk 9999 HadoopRecence SparkParkScala HeivehadoopParkParkParkhadoopdySparkParkParkParkhadoopParkSparkhadoopParkSparkhadoopParkSparkhadoopParkSparkhadoopParkSparkhadoopParkParkParkhadoopParkParkhadoopParkSparkhadoopParkParkhadoopParkParkhadoopParkspark

View the results again:

-Time: 1462282200000 ms--- (scala,2) (hive,2) (spark,4) (hadoop,4) (Hbase,2)

It can be seen that it combines our two statistical results.

Note:

1. DT big data DreamWorks Wechat official account DT_Spark

2. IMF 8: 00 p.m. Big data actual combat YY live broadcast channel number: 68917580

3. Sina Weibo: http://www.weibo.com/ilovepains

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.