Lesson 94: SparkStreaming implementation of online blacklist filtering in Advertising Billing system 07/16 Update SLTechnology News&Howtos

Lesson 94: SparkStreaming implementation of online blacklist filtering in Advertising Billing system

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The content of this course is:

Analysis of online blacklist filtering

Implementation of online blacklist filtering with SparkStreaming

Advertising billing system is an indispensable function point of e-commerce. In order to prevent malicious ad clicks (assuming that merchants An and B advertise in an e-commerce company at the same time, An and B are competitors, then if A uses a click robot to make malicious clicks on B's ads, then B's advertising fees will soon be used up), ad clicks must be blacklisted.

You can use leftOuterJoin to associate target data with blacklist data and filter out the data that hit the blacklist.

This paper mainly introduces the use of transform function of DStream.

SparkStreaming code implementation

Package com.dt.spark.sparkapps.streamingimport org.apache.spark.SparkConfimport org.apache.spark.streaming. {Seconds, StreamingContext} / * Spark online blacklist filtering program run by Scala development cluster * Created by Limaoran on 2016-5-2. * Sina Weibo: http://weibo.com/ilovepains/ * * background description: in the ad click billing system, we filter out the clicks on the blacklist online to protect the interests of advertisers * only use effective ad click billing or in the anti-brush scoring (or traffic) system to filter out invalid votes or ratings or traffic * implementation technology: use transform Api to program directly based on RDD, perform join operation * / object OnlineBlackListFilter {def main (args: Array [String]) {/ * step 1: create the configuration object SparkConf of Spark, and set the configuration information of the Spark program at run time. * for example, use setMaster to set the URL of the Master of the Spark cluster to be linked by the program. If set * to local, it means that the Spark program runs locally, which is especially suitable for beginners with very poor machine configuration conditions (for example, * only 1G of memory) * / val conf = new SparkConf () / / create the SparkConf object conf.setAppName ("OnlineBlackListFilter") / / set the name of the application. In the monitoring interface where the program is running, you can see the name conf.setMaster ("spark://Master:7077") / / at this time, the program prepares the blacklist data in the Spark cluster val ssc = new StreamingContext (conf,Seconds (30)) / *. In fact, the blacklist is generally dynamic. For example, in Redis or database, blacklist generation often has complex business * logic. The algorithms are different in specific situations. But during Spark Streaming processing, workers can access the complete information every time * / val blackList = Array (("hadoop", true), ("mahout", true)) val blackListRDD = ssc.sparkContext.parallelize (blackList,8) val adsClickStream = ssc.socketTextStream ("Master") 9999) / * each piece of data clicked on an advertisement simulated here is in the format of time, name * the result of map operation here is name, (time) Name) format * / val adsClientStreamFormated = adsClickStream.map (ads= > (ads.split (") (1), ads)) adsClientStreamFormated.transform (userClickRDD = > {/ / through leftOuterJoin operation not only retains all the RDD content of the content clicked by the user on the left, but also obtains whether the corresponding clicked content is filtered by filter when val joinedBlackListRDD = userClickRDD.leftOuterJoin (blackListRDD) / * * is in the blacklist. The input element is a Tuple: (name, ((time,name), boolean)) * where the first element is the name of the blacklist, and the second element is whether the value exists during leftOuterJoin * if so, the current ad click on the surface is a blacklist and needs to be filtered out, otherwise it will effectively click on the content. * / val validClicked = joinedBlackListRDD.filter (joinedItem= > {if (joinedItem._2._2.getOrElse (false)) {false} else {true}}) validClicked.map (validClick = > {validClick._2._1})}) .print () / * calculated valid data is generally written to Kafka The downstream billing system will pull from kafka to valid data for billing * / ssc.start () ssc.awaitTermination ()}}

Package the program and upload it to the spark cluster

On the spark-master node, start nc

Root@spark-master:~# nc-lk 9999

Run the OnlineBlacklistFilter program

Root@spark-master:~# / usr/local/spark-1.6.0/bin/spark-submit-- class com.dt.spark.sparkapps.streaming.OnlineBlackListFilter-- master spark://Master:7077. / sparkApps.jar

Input data on the NC side

Root@spark-master:~# nc-lk 999922555 spark124321 hadoop5555 Flink6666 HDFS2222 Kafka572231 Java66662 mahout

Results of SparkStreaming operation:

16-05-02 08:28:00 INFO MapPartitionsRDD: Removing RDD 8 from persistence list---5555 Flink6666 HDFS572231 Java22555 spark2222 Kafka

As a result, the hadoop and mathou set by the blacklist have been filtered out.

On the basis of this procedure, more complex business logic rules can be added to meet the needs of the enterprise.

Note:

1. DT big data DreamWorks Wechat official account DT_Spark

2. IMF 8: 00 p.m. Big data actual combat YY live broadcast channel number: 68917580

3. Sina Weibo: http://www.weibo.com/ilovepains

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.