E-commerce big data project-recommendation system (1) 07/11 Update SLTechnology News&Howtos

E-commerce big data project-recommendation system (1)

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

E-commerce big data project-recommendation system practice (1) Environmental construction and log, population, commodity analysis

Https://blog.51cto.com/6989066/2325073

E-commerce big data project-recommendation algorithm of recommendation system in practice

Https://blog.51cto.com/6989066/2326209

Real-time analysis and offline analysis of e-commerce big data project-recommendation system

Https://blog.51cto.com/6989066/2326214

Open source projects, please do not use for any commercial purposes.

Source code address: https://github.com/asdud/Bigdata_project

This project is based on the big data e-commerce recommendation system project of Spark MLLib, using scala language and java language. The recommendation system project based on python language will write a separate blog. Before reading this blog, you need to have the following foundation:

Basic commands for 1.linux

two。 At least have a math foundation in high school or above.

3. At least basic java se, scala language and Java EE is preferred (Jave EE is not required, but can help you understand the architecture of the project more quickly).

4. Have a github account and at least know the concept of git clone,fork,branch.

5. Have a network foundation, at least know the difference between the server and the client.

6. Basic knowledge of big data, preferably Hadoop,HDFS,MapReduce,Sqoop,HBase,Hive,Spark,Storm.

7. Have the foundation of mysql database, at least the most basic addition, deletion, modification and search.

If you are a great god, it is probably useless to read this blog, at least give some opinions and suggestions!

PC configuration requirement

1.CPU: mainstream CPU is fine.

two。 Memory RAM: at least 8 gigabytes, 16 gigabytes or more is recommended, 32 gigabytes are not wasted.

3. Hard disk: as VM requires high reading speed of Istroke O stream, it is recommended to use 256g or more solid state disk (SATA3 is better). NVME system disk needs 60-100G, and the rest of the disk is specially designed to install virtual machines. Or adopt the scheme of arrogant memory + mechanical hard disk.

Introduction and installation method of Aoteng memory

Https://product.pconline.com.cn/itbk/diy/memory/1806/11365945.html

4.GPU video card; not required. But if you want to learn the deep learning framework, consider 1060 6g or even 2080TI.

5. Network speed: CentOS 8GB is more, HDP is close to 7 GJE CDHs, several packets add up to 2.5G. Figure out how long it will take to download it, or consider copying it from someone else with a flash drive?

You can also consider using CVMs such as Aliyun and Tencent Cloud.

Step 1:

Build CentOS+HDP environment, or CentOS+CDH environment, these are open source, do not have to worry about copyright issues, enterprises generally use these two solutions.

What I'm using here is CentOS+HDP 's scheme.

Big data builds the HDP environment, taking three nodes as an example (above-- deploying master nodes and services)

Https://blog.51cto.com/6989066/2173573

Big data builds the HDP environment, taking three nodes as an example (part two-- expanding nodes, deleting nodes, and deploying other services)

Https://blog.51cto.com/6989066/2175476

We can also adopt the scheme of CentOS+CDH.

Set up the CDH experimental environment and take the installation and configuration of three nodes as an example

Https://blog.51cto.com/6989066/2296064

Development tools: Eclipse oxygen version or IDEA

Code implementation part

1. Data: users query log sources

Sogou laboratory

. https://www.sogou.com/labs/resource/q.php

I chose the mini version.

Introduction:

The search engine query log database is designed to include the web query log data collection of some web query requirements and user clicks of the Sogou search engine for about one month (June 2008). To provide benchmark research corpus for researchers who analyze the user behavior of Chinese search engines.

Format description:

The data format is

Access time\ t user ID\ t [query word]\ t ranking of the URL in the returned result\ t sequence number clicked by user\ t URL clicked by user

Among them, the user ID is automatically assigned according to the Cookie information when the user accesses the search engine using the browser, that is, different queries entered by the browser at the same time correspond to the same user ID

two。 First create a new Maven project MyMapReduceProject, and then update the pom.xml file

Pom file address

Https://github.com/asdud/Bigdata_project/blob/master/MyMapReduceProject/pom.xml

At this time, the corresponding dependent jar package will be downloaded automatically.

(1) case 1: Sogou log query and analysis

 query ranked the first point in search results, and clicked the data in the second order

Use MapReduce for analysis and processing

Https://github.com/asdud/Bigdata_project/blob/master/MyMapReduceProject/src/main/java/day0629/sogou/SogouLogMain.java

Https://github.com/asdud/Bigdata_project/blob/master/MyMapReduceProject/src/main/java/day0629/sogou/SogouLogMapper.java

 uses Spark for analysis and processing

First, use Ambari to add services for Spark2; since you rely on other services, such as Hive, you need to specify the JDBC driver for MySQL when you start Ambari Server.

To log in to spark-shell, you need to use the following ways:

Spark-shell-master yarn-client

Ambari-server setup-jdbc-db=mysql-jdbc-driver=/usr/share/java/mysql-connector-java.jar

The jdbc-driver address corresponds to your installation mysql-connector-java.jar directory.

(2) case 2: population analysis case

This case assumes that we need to count the gender and height of a province's population (100 million). We need to calculate the number of men and women, the highest and lowest height among men, and the highest and lowest height among women. The source file used in this case has the following format, with three columns: ID, gender, and height (cm).

 uses the Scala program to generate test data (about 1.3G)

Https://github.com/asdud/Bigdata_project/blob/master/MySparkProject/src/main/java/day0629/PeopleInfo/PeopleInfoFileGenerator.scala

Note: you can reduce the amount of data, so that the processing time will be shorter. In the example, there are 100 million records, which can be changed to 10,000 records.

And put the generated data on the HDFS:

Hdfs dfs-put sample_people_info.txt / myproject/data

Case study (using MapReduce)

Https://github.com/asdud/Bigdata_project/tree/master/MyMapReduceProject/src/main/java/day0629/peopleinfo

(3) case 3: analysis of e-commerce order sales data

(4) Accumulator and broadcast variables of Spark

Because spark is distributed computing, there are no shared variables between each task. In order to implement shared variables, spark implements two types-accumulator and broadcast variables.

1. Accumulator is a distributed variable mechanism provided in Spark. Its principle is similar to mapreduce, that is, distributed changes, and then aggregate these changes. A common use of accumulators is to count events during job execution while debugging.

Example:

Val accum = sc.accumulator (10, "My Accumulator")

Sc.parallelize (Array (1, 2, 3, 4). Foreach (x = > accum+=x)

Println (accum.value)

Final result: 20

two。 Broadcast variables allow programmers to cache a read-only variable on each machine without passing variables between tasks. Broadcast variables can be used to effectively give each node a copy of a large input data set. Spark also tries to use an efficient broadcast algorithm to distribute variables, thereby reducing communication overhead.

Example: store user information in a broadcast variable.

Case class UserInfo (userID:Int,userName:String,userAge:Int)

Val broadcastVar = sc.broadcast (UserInfo (100,100,23))

BroadcastVar.value

IV. Popular commodities in various regions

(1) Module introduction

In the operation of e-commerce website, it is necessary to make a statistical analysis of the goods concerned by users in each region to support users' decision-making.

Purpose:

 analyzes the different needs of products in different regions and carries out differentiation research, such as Beijing users like mobile phones and Shanghai users like cars.

 guides commodity discount and promotion strategy.

(2) demand analysis

(1) how to define the goods that users care about?

 measures the popularity of goods by the number of clicks users have on them.

 complex model: evaluate goods through comprehensive data such as user click + purchase and collection.

Commodity popularity score model = click times 2 + purchase times 5 + collection times * 3

Among them, 2, 5, 5 and 3 are the scoring weights.

(2) how to obtain the area

 can obtain the region of the order by clicking on the log.

① log data, sources and log systems, flume, tweets are also available every 30 minutes

② orders come from the database, and sqoop,t+1 can also do it every 30 minutes.

 database must be read-write separation

 sqoop imports data from the read database, so there will be a delay between the data and the real business library.

(3) thinking deeply: how to get rid of reptile navy.

(3) Technical proposal

 data acquisition Logic (ETL)

 e-commerce logs are generally stored in the log server and need to be pulled to the HDFS through Flume.

Cleaning Logic of  data

 uses MapReduce for data cleaning

 uses Spark for cleaning

Analysis and calculation of popular goods in different regions of 

 uses Hive for data analysis and processing

 uses Spark SQL for data analysis and processing

 thinking: can MapReduce be used for data analysis and processing?

(4) Experimental data and explanation

 form Product (product information table)

Column name describes the data type null / non-empty constraint

Product_id article number varchar (18) Not null

Product_name trade name varchar (20) Not null

Marque commodity model varchar (10) Not null

Barcode warehouse barcode varchar Not null

Price commodity price double Not null

Brand_id Brand varchar (8) Not null

Market_price market price double Not null

Stock inventory int Not null

Status status int Not null

Supplementary note Status: off the shelf-1, on the shelf 0, pre-sale (1)

 form Area_info (area information table)

Column name describes the data type null / non-empty constraint

Area_id area code varchar (18) Not null

Area_name region name varchar (20) Not null

Supplementary explanation

 form order_info (order information table)

Column name describes the data type null / non-empty constraint

Order_id order ID varchar (18) Not null

Order_date order date varchar (20) Not null

User_id user ID varchar (20)

Product_id goods ID varchar (20)

Supplementary explanation

 table user_click_log (user click information table)

Column name describes the data type null / non-empty constraint

User_id user ID varchar (18) Not null

User_ip user IP varchar (20) Not null

Url users click URL varchar (200)

Click_time user click time varchar (40)

Action_type Action name varchar (40)

Area_id area ID varchar (40)

Add action_type:1 collection, 2 add shopping cart, 3 clicks

 table area_hot_product (regional hot items table): final result table

Column name describes the data type null / non-empty constraint

ID varchar (18) Not null in area_id area

Area_name region name varchar (20) Not null

Product_id goods ID varchar

Product_name Trade name varchar (40)

Pv visits BIGINT

Supplementary explanation

(5) Technical realization

① uses Flume to collect user click logs

 usually uses shell scripts to perform log collection

 complex situation, using visual ETL tools to control Flume Agent

Here is the configuration file. Note the port number of HDFS.

Cleaning of ② data

 needs to identify the clicks on items in the user's click log.

 filters data that does not satisfy 6 fields

 filters data with empty URL, that is, filtering out log records that contain the beginning of http

 implementation method 1: use MapReduce program for data cleaning

The second way to realize  is to use Spark program to clean data.

Note: if you do not want to print too many logs during Spark execution, you can use the following statement:

Logger.getLogger ("org.apache.spark") .setLevel (Level.ERROR)

Logger.getLogger ("org.eclipse.jetty.server") .setLevel (Level.OFF)

Hot Statistics of popular goods in different regions of ③: based on Hive and Spark SQL

 method 1: use Hive for statistics

 method 2: use Spark SQL for statistics

SQL executed in Spark SQL:

Select a.areaaccounidrea.areaplaynamerecovery.productplayidgradproductplaynamememocount (c.product_id) from area aplayplayplayplayname product ppdyclicklog c where a.area_id=c.area_id and p.product_id=c.product_id group by a.areaplayidgrammage a.areaplaynameplay.productplayidplay.productplayname

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.