Big data's collection, cleaning and processing: a complete case of offline data analysis using MapReduce 04/12 Update SLTechnology News&Howtos

Big data's collection, cleaning and processing: a complete case of offline data analysis using MapReduce

2025-04-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

[TOC]

1 the common methods of big data's treatment

At present, there are two popular methods for big data processing, one is offline processing and the other is online processing. The basic processing architecture is as follows:

In Internet applications, no matter which processing method, the basic data source is log data, for example, for web applications, it may be the user's access log, user click log and so on.

If the analysis results of the data have relatively strict requirements in time, we can use online processing to analyze the data, such as using Spark, Storm and so on. A more appropriate example is the turnover of Tmall's Singles Day holiday. On its display board, we see that the transaction volume is updated in real time and dynamically. In this case, we need to use online processing.

Of course, if you just want to get the analysis results of the data, and the processing time is not strict, you can use offline processing. For example, we can first collect the log data into HDFS, and then further use MapReduce, Hive and so on to analyze the data, which is also feasible.

This article mainly shares the process of offline processing and analysis of the user access log (access.log) generated by an e-commerce website. Based on the processing method of MapReduce, it will finally count the uv and pv of different provinces visiting the site on a certain day.

2 production scene and demand

In our scenario, the deployment of the Web application is as follows:

That is, the typical Nginx load balancer + KeepAlive high availability cluster architecture generates user access logs on each Web server. The log format given by the business demander is as follows:

1001 211.167.248.22 eecf0780-2578-4d77-a8d6-e2225e8b9169 40604 1 GET / top HTTP/1.0 40604 null null 15231881227671003 222.68.207.11 eecf0780-2578-4d77-a8d6-e2225e8b9169 20202 1 GET / tologin HTTP/1.1 504 null Mozilla/5.0 (Windows; U Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3 15231881232671001 61.53.137.50 c3966af9-8a43-4bda-b58c-c11525ca367b 01 GET / update/pass HTTP/1.0 302 null null 15231881237681000 221.195.40.145 1aa3b538-2f55-4cd7-9f46-6364fdd1e487 00 GET / user/add HTTP/1.1 200 null Mozilla/4.0 Windows NT5.2) 15231881242691000 121.11.87.171 8b0ea90a-77a5-4034-99ed-403c800263dd 20202 1 GET / top HTTP/1.0 408 null Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12 1523188120263

Each of its fields is described as follows:

Appid ip mid userid login_type request status http_referer user_agent time among them: appid includes: web:1000,android:1001,ios:1002,ipad:1003mid: the only id this id will be planted in the browser's cookie for the first time. If it exists, it will no longer be planted. As the only sign of the browser. The mobile end or pad fetches the machine code directly. Login_type: login status, 0 not logged in, 1: login user request: similar to this kind of "GET / userList HTTP/1.1" status: the status of the request mainly includes: 200ok, 404not found, 408Request Timeout, 500Internal Server Error, 504Gateway Timeout and other http_referer: request the last url address of the url. User_agent: information from the browser, for example: "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36" time: long format of time: 1451451433818.

According to the log data within a given time range, the business side now has the following requirements:

Count the daily visits of PV and UV in each province. (3) data acquisition: obtaining native data

Data collection is done by OPS personnel. Flume is used for the collection of user access logs, and the collected data is saved to HDFS. The structure is as follows:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.