In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
E-commerce big data project-recommendation system practice (1) Environmental construction and log, population, commodity analysis
Https://blog.51cto.com/6989066/2325073
E-commerce big data project-recommendation algorithm of recommendation system in practice
Https://blog.51cto.com/6989066/2326209
Real-time analysis and offline analysis of e-commerce big data project-recommendation system
Https://blog.51cto.com/6989066/2326214
Open source projects, please do not use for any commercial purposes.
Source code address: https://github.com/asdud/Bigdata_project
This project is based on the big data e-commerce recommendation system project of Spark MLLib, using scala language and java language. The recommendation system project based on python language will write a separate blog. Before reading this blog, you need to have the following foundation:
Basic commands for 1.linux
two。 At least have a math foundation in high school or above.
3. At least basic java se, scala language and Java EE is preferred (Jave EE is not required, but can help you understand the architecture of the project more quickly).
4. Have a github account and at least know the concept of git clone,fork,branch.
5. Have a network foundation, at least know the difference between the server and the client.
6. Basic knowledge of big data, preferably Hadoop,HDFS,MapReduce,Sqoop,HBase,Hive,Spark,Storm.
7. Have the foundation of mysql database, at least the most basic addition, deletion, modification and search.
If you are a great god, it is probably useless to read this blog, at least give some opinions and suggestions!
PC configuration requirement
1.CPU: mainstream CPU is fine.
two。 Memory RAM: at least 8 gigabytes, 16 gigabytes or more is recommended, 32 gigabytes are not wasted.
3. Hard disk: as VM requires high reading speed of Istroke O stream, it is recommended to use 256g or more solid state disk (SATA3 is better). NVME system disk needs 60-100G, and the rest of the disk is specially designed to install virtual machines. Or adopt the scheme of arrogant memory + mechanical hard disk.
Introduction and installation method of Aoteng memory
Https://product.pconline.com.cn/itbk/diy/memory/1806/11365945.html
4.GPU video card; not required. But if you want to learn the deep learning framework, consider 1060 6g or even 2080TI.
5. Network speed: CentOS 8GB is more, HDP is close to 7 GJE CDHs, several packets add up to 2.5G. Figure out how long it will take to download it, or consider copying it from someone else with a flash drive?
You can also consider using CVMs such as Aliyun and Tencent Cloud.
Step 1:
Build CentOS+HDP environment, or CentOS+CDH environment, these are open source, do not have to worry about copyright issues, enterprises generally use these two solutions.
What I'm using here is CentOS+HDP 's scheme.
Big data builds the HDP environment, taking three nodes as an example (above-- deploying master nodes and services)
Https://blog.51cto.com/6989066/2173573
Big data builds the HDP environment, taking three nodes as an example (part two-- expanding nodes, deleting nodes, and deploying other services)
Https://blog.51cto.com/6989066/2175476
We can also adopt the scheme of CentOS+CDH.
Set up the CDH experimental environment and take the installation and configuration of three nodes as an example
Https://blog.51cto.com/6989066/2296064
Development tools: Eclipse oxygen version or IDEA
Code implementation part
1. Data: users query log sources
Sogou laboratory
. https://www.sogou.com/labs/resource/q.php
I chose the mini version.
Introduction:
The search engine query log database is designed to include the web query log data collection of some web query requirements and user clicks of the Sogou search engine for about one month (June 2008). To provide benchmark research corpus for researchers who analyze the user behavior of Chinese search engines.
Format description:
The data format is
Access time\ t user ID\ t [query word]\ t ranking of the URL in the returned result\ t sequence number clicked by user\ t URL clicked by user
Among them, the user ID is automatically assigned according to the Cookie information when the user accesses the search engine using the browser, that is, different queries entered by the browser at the same time correspond to the same user ID
two。 First create a new Maven project MyMapReduceProject, and then update the pom.xml file
Pom file address
Https://github.com/asdud/Bigdata_project/blob/master/MyMapReduceProject/pom.xml
At this time, the corresponding dependent jar package will be downloaded automatically.
(1) case 1: Sogou log query and analysis
query ranked the first point in search results, and clicked the data in the second order
Use MapReduce for analysis and processing
Https://github.com/asdud/Bigdata_project/blob/master/MyMapReduceProject/src/main/java/day0629/sogou/SogouLogMain.java
Https://github.com/asdud/Bigdata_project/blob/master/MyMapReduceProject/src/main/java/day0629/sogou/SogouLogMapper.java
uses Spark for analysis and processing
First, use Ambari to add services for Spark2; since you rely on other services, such as Hive, you need to specify the JDBC driver for MySQL when you start Ambari Server.
To log in to spark-shell, you need to use the following ways:
Spark-shell-master yarn-client
Ambari-server setup-jdbc-db=mysql-jdbc-driver=/usr/share/java/mysql-connector-java.jar
The jdbc-driver address corresponds to your installation mysql-connector-java.jar directory.
(2) case 2: population analysis case
This case assumes that we need to count the gender and height of a province's population (100 million). We need to calculate the number of men and women, the highest and lowest height among men, and the highest and lowest height among women. The source file used in this case has the following format, with three columns: ID, gender, and height (cm).
uses the Scala program to generate test data (about 1.3G)
Https://github.com/asdud/Bigdata_project/blob/master/MySparkProject/src/main/java/day0629/PeopleInfo/PeopleInfoFileGenerator.scala
Note: you can reduce the amount of data, so that the processing time will be shorter. In the example, there are 100 million records, which can be changed to 10,000 records.
And put the generated data on the HDFS:
Hdfs dfs-put sample_people_info.txt / myproject/data
Case study (using MapReduce)
Https://github.com/asdud/Bigdata_project/tree/master/MyMapReduceProject/src/main/java/day0629/peopleinfo
(3) case 3: analysis of e-commerce order sales data
(4) Accumulator and broadcast variables of Spark
Because spark is distributed computing, there are no shared variables between each task. In order to implement shared variables, spark implements two types-accumulator and broadcast variables.
1. Accumulator is a distributed variable mechanism provided in Spark. Its principle is similar to mapreduce, that is, distributed changes, and then aggregate these changes. A common use of accumulators is to count events during job execution while debugging.
Example:
Val accum = sc.accumulator (10, "My Accumulator")
Sc.parallelize (Array (1, 2, 3, 4). Foreach (x = > accum+=x)
Println (accum.value)
Final result: 20
two。 Broadcast variables allow programmers to cache a read-only variable on each machine without passing variables between tasks. Broadcast variables can be used to effectively give each node a copy of a large input data set. Spark also tries to use an efficient broadcast algorithm to distribute variables, thereby reducing communication overhead.
Example: store user information in a broadcast variable.
Case class UserInfo (userID:Int,userName:String,userAge:Int)
Val broadcastVar = sc.broadcast (UserInfo (100,100,23))
BroadcastVar.value
IV. Popular commodities in various regions
(1) Module introduction
In the operation of e-commerce website, it is necessary to make a statistical analysis of the goods concerned by users in each region to support users' decision-making.
Purpose:
analyzes the different needs of products in different regions and carries out differentiation research, such as Beijing users like mobile phones and Shanghai users like cars.
guides commodity discount and promotion strategy.
(2) demand analysis
(1) how to define the goods that users care about?
measures the popularity of goods by the number of clicks users have on them.
complex model: evaluate goods through comprehensive data such as user click + purchase and collection.
Commodity popularity score model = click times 2 + purchase times 5 + collection times * 3
Among them, 2, 5, 5 and 3 are the scoring weights.
(2) how to obtain the area
can obtain the region of the order by clicking on the log.
① log data, sources and log systems, flume, tweets are also available every 30 minutes
② orders come from the database, and sqoop,t+1 can also do it every 30 minutes.
database must be read-write separation
sqoop imports data from the read database, so there will be a delay between the data and the real business library.
(3) thinking deeply: how to get rid of reptile navy.
(3) Technical proposal
data acquisition Logic (ETL)
e-commerce logs are generally stored in the log server and need to be pulled to the HDFS through Flume.
Cleaning Logic of data
uses MapReduce for data cleaning
uses Spark for cleaning
Analysis and calculation of popular goods in different regions of
uses Hive for data analysis and processing
uses Spark SQL for data analysis and processing
thinking: can MapReduce be used for data analysis and processing?
(4) Experimental data and explanation
form Product (product information table)
Column name describes the data type null / non-empty constraint
Product_id article number varchar (18) Not null
Product_name trade name varchar (20) Not null
Marque commodity model varchar (10) Not null
Barcode warehouse barcode varchar Not null
Price commodity price double Not null
Brand_id Brand varchar (8) Not null
Market_price market price double Not null
Stock inventory int Not null
Status status int Not null
Supplementary note Status: off the shelf-1, on the shelf 0, pre-sale (1)
form Area_info (area information table)
Column name describes the data type null / non-empty constraint
Area_id area code varchar (18) Not null
Area_name region name varchar (20) Not null
Supplementary explanation
form order_info (order information table)
Column name describes the data type null / non-empty constraint
Order_id order ID varchar (18) Not null
Order_date order date varchar (20) Not null
User_id user ID varchar (20)
Product_id goods ID varchar (20)
Supplementary explanation
table user_click_log (user click information table)
Column name describes the data type null / non-empty constraint
User_id user ID varchar (18) Not null
User_ip user IP varchar (20) Not null
Url users click URL varchar (200)
Click_time user click time varchar (40)
Action_type Action name varchar (40)
Area_id area ID varchar (40)
Add action_type:1 collection, 2 add shopping cart, 3 clicks
table area_hot_product (regional hot items table): final result table
Column name describes the data type null / non-empty constraint
ID varchar (18) Not null in area_id area
Area_name region name varchar (20) Not null
Product_id goods ID varchar
Product_name Trade name varchar (40)
Pv visits BIGINT
Supplementary explanation
(5) Technical realization
① uses Flume to collect user click logs
usually uses shell scripts to perform log collection
complex situation, using visual ETL tools to control Flume Agent
Here is the configuration file. Note the port number of HDFS.
Cleaning of ② data
needs to identify the clicks on items in the user's click log.
filters data that does not satisfy 6 fields
filters data with empty URL, that is, filtering out log records that contain the beginning of http
implementation method 1: use MapReduce program for data cleaning
The second way to realize is to use Spark program to clean data.
Note: if you do not want to print too many logs during Spark execution, you can use the following statement:
Logger.getLogger ("org.apache.spark") .setLevel (Level.ERROR)
Logger.getLogger ("org.eclipse.jetty.server") .setLevel (Level.OFF)
Hot Statistics of popular goods in different regions of ③: based on Hive and Spark SQL
method 1: use Hive for statistics
method 2: use Spark SQL for statistics
SQL executed in Spark SQL:
Select a.areaaccounidrea.areaplaynamerecovery.productplayidgradproductplaynamememocount (c.product_id) from area aplayplayplayplayname product ppdyclicklog c where a.area_id=c.area_id and p.product_id=c.product_id group by a.areaplayidgrammage a.areaplaynameplay.productplayidplay.productplayname
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.