E-commerce big data project (2)-Real-time analysis and offline analysis of recommendation system 07/06 Update SLTechnology News&Howtos

E-commerce big data project (2)-Real-time analysis and offline analysis of recommendation system

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

E-commerce big data project-recommendation system practice (1) Environmental construction and log, population, commodity analysis

Https://blog.51cto.com/6989066/2325073

E-commerce big data project-recommendation algorithm of recommendation system in practice

Https://blog.51cto.com/6989066/2326209

Real-time analysis and offline analysis of e-commerce big data project-recommendation system

Https://blog.51cto.com/6989066/2326214

Real-time analysis of Top IP (real-time analysis of Top users)

I) Module introduction

In the operation of e-commerce website, we need to analyze the website to visit the top N IP, which is mainly used to audit whether there is abnormal IP, and to analyze the operation of the website.

(2) demand analysis

How does ① count IP

 can be calculated by clicking on the log.

How does ② analyze IP

 analyzes the visit volume of IP in different time periods

 Analysis of Hot IP

③ thinking deeply: how to get rid of the reptile navy

(3) Technical proposal

 data acquisition Logic (ETL)

The access log (click log) of  users is generally stored in the log server and needs to be pulled through Flume

 click log cache

 user clicks logs collected by Flume and caches them in Kafka

Real-time Analysis of Top user Information by 

 uses Apache Storm for real-time analysis

 uses Spark Streaming for real-time analysis

 Note: in the Flume and Kafka environments deployed by Ambari, Flume and Kafka have been integrated, so the data collected by Flume can be sent directly to Kafka.

(4) Experimental data and explanation

 table user_click_log (user click information table)

Column name describes the data type null / non-empty constraint

User_id user ID varchar (18) Not null

User_ip user IP varchar (20) Not null

Url users click URL varchar (200)

Click_time user click time varchar (40)

Action_type Action name varchar (40)

Area_id area ID varchar (40)

Add action_type:1 collection, 2 add shopping cart, 3 clicks

 result table: table hopip (popular ip table)

Column name describes the data type null / non-empty constraint

Ip IP varchar (18) Not null

Pv visits varchar (200)

Supplementary explanation

(5) Technical realization

① collects user click log data through Flume

 creates a configuration file for Flume:

 Note: the default port for HDP cluster kafka broker is 6667, not 9092.

② uses Kafka to cache data

 create a new topic named mytopic

Bin/kafka-topics.sh-create-zookeeper hdp21:2181-replication-factor 1-partitions 1-topic mytopic

 to view the newly created Topic

Bin/kafka-topics.sh-list-zookeeper hdp21:2181

 test: create a consumer to consume the data in mytopic. Bootstrap-server, fill in the host name or IP instead of localhost

Bin/kafka-console-consumer.sh-bootstrap-server hdp21:6667-topic mytopic-from-beginning

 Delete Topic

Bin/kafka-topics.sh-delete-zookeeper hdp21:2181-topic mytopic

Note: this command will only identify topic as "deleted status". If you want to delete Topic completely, you need to delete.topic.enable=true and restart Kafka

Real-time Analysis of Top 5 user Information by ③: based on Storm and Spark Streaming

 implementation method 1: using Apache Storm for real-time analysis

Note:

In the  pom file, the Storm version is 1.1.0

 has integrated the dependencies of Storm with Kafka, Redis, JDBC and MySQL in this pom file

The running result is as follows:

 implementation method 2: using Spark Streaming for real-time analysis

 combined with Spark SQL to analyze Top users (hot users)

 added the following dependencies to the pom.xml file in the Spark project in the previous chapter

 Note: due to the problem with the Kafka version, Receiver is required to receive Kafka data.

The running results are as follows:

VI. Real-time analysis of blacklisted users

(1) Module introduction

In the operation of e-commerce website, it is necessary to analyze the top N customers, which is mainly used to audit whether there are abnormal users and analyze loyal users at the same time.

(2) demand analysis

How does ① define abnormal users?

 is analyzed by user access frequency. If the user visits more than 40 times per hour, and the average access interval is less than 4 seconds, it is an abnormal user.

 associates the exception user library for exception analysis.

How ② calculates user access content

 can be calculated by clicking on the log.

③ thinking deeply: how to get rid of the reptile navy

(3) Technical proposal

 data acquisition Logic (ETL)

The access log (click log) of  users is generally stored in the log server and needs to be pulled through Flume

 click log cache

 user clicks logs collected by Flume and caches them in Kafka

Real-time Analysis of Top user Information by 

 uses Apache Storm to analyze blacklisted users in real time

 uses Spark Streaming to analyze blacklisted users in real time

 Note: in the Flume and Kafka environments deployed by Ambari, Flume and Kafka have been integrated, so the data collected by Flume can be sent directly to Kafka.

(4) Experimental data and explanation

 table UserInfo (user information table)

Column name describes the data type null / non-empty constraint

UserID × × × varchar (18) Not null

Username name varchar (20) Not null

Sex gender varchar (10) Not null

Birthday date of birth datetime Not null

Birthprov birth province varchar (8) Not null

Birthcity was born in city varchar (8) Not null

Job working varchar (20) Not null

EducationLevel Education level int Not null

SnnualSalary annual salary double Not null

Addr_prov current residence province code varchar (8) Not null

Addr_city current residence city code varchar (8) Not null

Address address varchar (50) Not null

Mobile contact number varchar (11) Not null

Mail email varchar (30) Not null

Status user status Int

Supplementary explanation

 table user_click_log (user click information table)

Column name describes the data type null / non-empty constraint

User_id user ID varchar (18) Not null

User_ip user IP varchar (20) Not null

Url users click URL varchar (200)

Click_time user click time varchar (40)

Action_type Action name varchar (40)

Area_id area ID varchar (40)

Add action_type:1 collection, 2 add shopping cart, 3 clicks

The result table of the previous chapter of : table hopip (popular ip table)

Column name describes the data type null / non-empty constraint

User_ip IP varchar (18) Not null

Pv visits varchar (200)

Supplementary explanation

 result table: table black_list (blacklist table)

Column name describes the data type null / non-empty constraint

User_id user ID varchar (18) Not null

User_ip user IP varchar (40) Not null

Supplementary explanation

(5) Technical realization

1. First of all, in the requirements of Chapter 5, we have implemented the analysis of Hot IP. The following is the result table of Hop IP.

The result table of the previous chapter of : table hopip (popular ip table)

Column name describes the data type null / non-empty constraint

User_ip IP varchar (18) Not null

Pv visits varchar (200)

Supplementary explanation

two。 We only need to associate the user information table according to the requirements to analyze the blacklisted user information within a certain period of time, for example, every 10 seconds, count the user information that has been accessed more than 10 times in the past 30 seconds. At this point, you need to use the window function. Window functions are available in both Apache Storm and Spark Streaming.

3. Option 1: use the window function of Storm to write the result to MySQL

 creates the database and the corresponding tables in MySQL:

Create database demo

CREATE USER 'demo'@'%'IDENTIFIED BY' Welcome_1'

GRANT ALL PRIVILEGES ON. TO 'demo'@'%'

FLUSH PRIVILEGES

Create table myresult (userid int primary key,PV int)

 Note:

 in the course of the previous chapter, the pom file has integrated the dependencies of Storm with Kafka, Redis, JDBC, MySQL

If  uses the providing JdbcInsertBolt component of Storm, the result will always be written to MySQL. Better yet: create your own Bolt component: insert if the userid does not exist in the MySQL, and update if it already exists.

After analyzing the PV of each user, you can execute the following query in MySQL to view the blacklist user information.

Select userinfo.userid,userinfo.username,myresult.PV from userinfo,myresult where userinfo.userid=myresult.userid

4. Option 2: use the window function of Spark Streaming

Output result:

VII. Advertising click traffic statistics

(1) Module introduction

In the operation of e-commerce website, advertising is a very important module, which needs to analyze the click situation of advertisements, which is mainly used to optimize the click volume of each city.

(2) demand analysis

How does ① analyze ad click data?

 analyzes advertising data through the log of users' clicks on advertisements.

② calculates the daily clicks on advertisements in each province, city and city.

(3) Technical proposal

Offline analysis: to pull the advertising log, you need to pull it to HDFS through Flume, and then use MapReduce and Spark for offline analysis.

(4) Experimental data and explanation

 Advertising click Log

Column name describes the data type null / non-empty constraint

Userid user id varchar (18) Not nul

Ip Click IP varchar (18) Not null

Click_time click time varchar (20) Not null

Url Advertising Link varchar (20)

Area_id area ID varchar (20)

 form Area_info (area information table)

Column name describes the data type null / non-empty constraint

Area_id area code varchar (18) Not null

Area_name region name varchar (20) Not null

(5) Technical realization

① uses Flume to collect user click logs

 usually uses shell scripts to perform log collection

 complex situation, using visual ETL tools to control Flume Agent

Here is the configuration file. Note the port number of HDFS.

② uses Hive to analyze offline advertisements

 creates a region table

Create external table areainfo

(areaid int,areaname string)

Row format delimited fields terminated by','

Location'/ input/project07'

 create Advertising click log table

Create external table adloginfo

(userid int,ip string,clicktime string,url string,areaid int)

Row format delimited fields terminated by','

Location'/ flume/20180603'

 analyzes data through SQL

Select areainfo.areaname,adloginfo.url,adloginfo.clicktime,count (adloginfo.clicktime)

From adloginfo,areainfo

Where adloginfo.areaid=areainfo.areaid

Group by areainfo.areaname,adloginfo.url,adloginfo.clicktime

③ uses Spark for offline ad log analysis

④ uses Pig for offline ad log analysis

 load region table

Areainfo = load'/ input/areainfo.txt' using PigStorage (',') as (areaid:int,areaname:chararray)

 load Advertising log table

Adloginfo = load'/ flume/20180603/userclicklog.txt' using PigStorage (',') as (userid:int,ip:chararray,clicktime:chararray,url:chararray,areaid:int)

 is grouped according to url, region and click log.

Adloginfo1 = group adloginfo by (url,areaid,clicktime)

 extract url, areaid, click time and total frequency

Adloginfo2 = foreach adloginfo1 generate group,COUNT (adloginfo.clicktime)

 executes multi-table queries and associates region tables

Result = join areainfo by areaid, adloginfo2 by group.areaid

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.