How to use Spark Streaming SQL to count PV and UV 04/28 Update SLTechnology News&Howtos

How to use Spark Streaming SQL to count PV and UV

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "how to use Spark Streaming SQL to count PV and UV". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Background introduction

PV/UV statistics is a common scenario in streaming analysis. Through PV, you can analyze the traffic or hotspots of the websites you visit. For example, advertisers can estimate the traffic and advertising revenue brought by advertising pages through PV values. In other scenarios, you need to analyze the visited users, such as analyzing the users' web page click behavior, and then you need to do statistics on UV.

The use of Spark Streaming SQL, combined with Redis can be very convenient for PV/UV statistics. This article will introduce how to consume the user access information stored in Loghub through Streaming SQL, do PV/UV statistics on the data in the past 1 minute, and store the results in Redis.

two。 Preparatory work

Create a Hadoop cluster with E-MapReduce version 3.23.0 or later.

Download and compile the E-MapReduce-SDK package

Git clone git@github.com:aliyun/aliyun-emapreduce-sdk.git cd aliyun-emapreduce-sdk git checkout-b master-2.x origin/master-2.x mvn clean package-DskipTests

After compilation, emr-datasources_shaded_$ {version} .jar is generated in the assembly/target directory, where ${version} is the version of sdk.

Data source

This article uses Loghub as the data source. For log collection and log resolution, please refer to Log Service.

3. Statistical PV/UV

In general scenarios, you need to store the statistical PV/UV and the corresponding statistical time in Redis. In other business scenarios, only the latest results are saved, and the old data is constantly overwritten and updated with the new results. The following first describes the operation flow of the first case.

3.1 start the client

The command line starts the streaming-sql client

Streaming-sql-- master yarn-client-- num-executors 2-- executor-memory 2g-- executor-cores 2-- jars emr-datasources_shaded_2.11-$ {version} .jar-- driver-class-path emr-datasources_shaded_2.11-$ {version} .jar

You can also create a SQL statement file and run it through streaming-sql-f.

3.1 define the data table

The data source table is defined as follows

CREATE TABLE loghub_source (user_ip STRING, _ _ time__ TIMESTAMP) USING loghub OPTIONS (sls.project=$ {sls.project}, sls.store=$ {sls.store}, access.key.id=$ {access.key.id}, access.key.secret=$ {access.key.secret}, endpoint=$ {endpoint})

The data source table contains two fields, user_ip and _ _ time__, which represent the user's IP address and the time column on loghub, respectively. The value of the configuration item in OPTIONS is based on the actual configuration.

The result table is defined as follows

CREATE TABLE redis_sink USING redis OPTIONS (table='statistic_info', host=$ {redis_host}, key.column='interval')

Where statistic_info is the name of the table where Redis stores the results, and interval corresponds to the interval field in the statistical results; the value of the configuration item ${redis_host} is based on the actual configuration.

3.2 create a stream job

CREATE SCAN loghub_scan ON loghub_source USING STREAM OPTIONS (watermark.column='__time__', watermark.delayThreshold='10 second'); CREATE STREAM job OPTIONS (checkpointLocation=$ {checkpoint_location}) INSERT INTO redis_sink SELECT COUNT (user_ip) AS pv, approx_count_distinct (user_ip) AS uv, window.end AS interval FROM loghub_scan GROUP BY TUMBLING (_ _ time__, interval 1 minute), window

4.3 View Statistical results

The final statistical results are shown below.

As you can see, a piece of data is generated every other minute, in the form of key in the form of table name: interval,value is the value of pv and uv.

3.4 implementation of overlay updates

Modify the configuration item key.column of the result table to a fixed value, such as the following definition

CREATE TABLE redis_sink USING redis OPTIONS (table='statistic_info', host=$ {redis_host}, key.column='statistic_type')

The SQL for creating a stream job is changed to

CREATE STREAM job OPTIONS (checkpointLocation='/tmp/spark-test/checkpoint') INSERT INTO redis_sink SELECT "PV_UV" as statistic_type,COUNT (user_ip) AS pv, approx_count_distinct (user_ip) AS uv, window.end AS interval FROM loghub_scan GROUP BY TUMBLING (_ _ time__, interval 1 minute), window

The final statistical results are shown below.

As you can see, the value in Redis retains a value that is updated every minute, and value contains the values of pv, uv, and interval.

This is the end of the introduction to "how to use Spark Streaming SQL to count PV and UV". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.