In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces the relevant knowledge of "how to use Spark Streaming SQL to count PV and UV". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
1. Background introduction
PV/UV statistics is a common scenario in streaming analysis. Through PV, you can analyze the traffic or hotspots of the websites you visit. For example, advertisers can estimate the traffic and advertising revenue brought by advertising pages through PV values. In other scenarios, you need to analyze the visited users, such as analyzing the users' web page click behavior, and then you need to do statistics on UV.
The use of Spark Streaming SQL, combined with Redis can be very convenient for PV/UV statistics. This article will introduce how to consume the user access information stored in Loghub through Streaming SQL, do PV/UV statistics on the data in the past 1 minute, and store the results in Redis.
two。 Preparatory work
Create a Hadoop cluster with E-MapReduce version 3.23.0 or later.
Download and compile the E-MapReduce-SDK package
Git clone git@github.com:aliyun/aliyun-emapreduce-sdk.git cd aliyun-emapreduce-sdk git checkout-b master-2.x origin/master-2.x mvn clean package-DskipTests
After compilation, emr-datasources_shaded_$ {version} .jar is generated in the assembly/target directory, where ${version} is the version of sdk.
Data source
This article uses Loghub as the data source. For log collection and log resolution, please refer to Log Service.
3. Statistical PV/UV
In general scenarios, you need to store the statistical PV/UV and the corresponding statistical time in Redis. In other business scenarios, only the latest results are saved, and the old data is constantly overwritten and updated with the new results. The following first describes the operation flow of the first case.
3.1 start the client
The command line starts the streaming-sql client
Streaming-sql-- master yarn-client-- num-executors 2-- executor-memory 2g-- executor-cores 2-- jars emr-datasources_shaded_2.11-$ {version} .jar-- driver-class-path emr-datasources_shaded_2.11-$ {version} .jar
You can also create a SQL statement file and run it through streaming-sql-f.
3.1 define the data table
The data source table is defined as follows
CREATE TABLE loghub_source (user_ip STRING, _ _ time__ TIMESTAMP) USING loghub OPTIONS (sls.project=$ {sls.project}, sls.store=$ {sls.store}, access.key.id=$ {access.key.id}, access.key.secret=$ {access.key.secret}, endpoint=$ {endpoint})
The data source table contains two fields, user_ip and _ _ time__, which represent the user's IP address and the time column on loghub, respectively. The value of the configuration item in OPTIONS is based on the actual configuration.
The result table is defined as follows
CREATE TABLE redis_sink USING redis OPTIONS (table='statistic_info', host=$ {redis_host}, key.column='interval')
Where statistic_info is the name of the table where Redis stores the results, and interval corresponds to the interval field in the statistical results; the value of the configuration item ${redis_host} is based on the actual configuration.
3.2 create a stream job
CREATE SCAN loghub_scan ON loghub_source USING STREAM OPTIONS (watermark.column='__time__', watermark.delayThreshold='10 second'); CREATE STREAM job OPTIONS (checkpointLocation=$ {checkpoint_location}) INSERT INTO redis_sink SELECT COUNT (user_ip) AS pv, approx_count_distinct (user_ip) AS uv, window.end AS interval FROM loghub_scan GROUP BY TUMBLING (_ _ time__, interval 1 minute), window
4.3 View Statistical results
The final statistical results are shown below.
As you can see, a piece of data is generated every other minute, in the form of key in the form of table name: interval,value is the value of pv and uv.
3.4 implementation of overlay updates
Modify the configuration item key.column of the result table to a fixed value, such as the following definition
CREATE TABLE redis_sink USING redis OPTIONS (table='statistic_info', host=$ {redis_host}, key.column='statistic_type')
The SQL for creating a stream job is changed to
CREATE STREAM job OPTIONS (checkpointLocation='/tmp/spark-test/checkpoint') INSERT INTO redis_sink SELECT "PV_UV" as statistic_type,COUNT (user_ip) AS pv, approx_count_distinct (user_ip) AS uv, window.end AS interval FROM loghub_scan GROUP BY TUMBLING (_ _ time__, interval 1 minute), window
The final statistical results are shown below.
As you can see, the value in Redis retains a value that is updated every minute, and value contains the values of pv, uv, and interval.
This is the end of the introduction to "how to use Spark Streaming SQL to count PV and UV". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.