Use Hadoop to count log data 07/19 Update SLTechnology News&Howtos

Use Hadoop to count log data

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Overview of user behavior Log

User behavior log:

Every time the user visits the site, all the behavior data access, browse, search, click. User behavior track, traffic log (other names of user behavior log)

Why log user access behavior:

Statistical analysis of visits to website pages stickiness training recommendation system of website

User behavior log generation channel:

Web server records web access logs ajax records access logs and other related logs

The user behavior log is roughly as follows:

Access time the client (UserAgent) used by the visitor the IP address of the visitor the residence time of a page, the time and place of access to the link address (referer) access information, for example: session_id module AppID

The significance of user behavior log analysis:

The eyes of the website can see the main sources of users, what content on the website they like, as well as the nerves of the website, such as users' loyalty. By analyzing the user behavior log, we can further optimize the layout and function of the website. In order to improve the user experience, the brain of the website, through the analysis of the results, divides the promotion budget, and focuses on optimizing the offline data processing architecture such as the tendency point of the user group.

Offline data processing process:

Data acquisition for example, Flume can be used for data collection: writing web logs to HDFS data cleaning can use Spark, Hive, MapReduce and other frameworks for data cleaning. After cleaning, the data can be stored in HDFS or Hive or Spark SQL. Statistics and analysis of the corresponding business can be carried out according to our needs. The results of data processing can be stored in RDBMS and NoSQL databases. The visual display of data can be displayed graphically: pie chart, bar chart, map, line chart and other tools: ECharts, HUE, Zeppelin.

Process diagram:

Project requirements

Demand:

Count the number of visits to each browser in the website access log

The log snippet is as follows:

183.162.52.7-[10/Nov/2016:00:01:02 + 0800] "POST / api3/getadv HTTP/1.1" 200813 "www.xxx.com"-"cid=0 × tamp=1478707261865&uid=2871142&marking=androidbanner&secrect=a6e8e14701ffe9f6063934780d9e2e6d&token=f51e97d1cb1a9caac669ea8acc162b96" mukewang/5.0.0 (Android 5.1.1 Xiaomi Redmi 3 Build/LMY47V) Network 2G/3G "-" 10.100.134.244 Network 2G/3G 80 200 0.027 0.02710.100.0.1-- [10/Nov/2016:00:01:02 + 0800] "HEAD / HTTP/1.1" 3010 "117.121.101.40"-"-" curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.16.2.3 Basic ECC zlib/1.2. 3 UserAgent parsing class test of libidn/1.18 libssh3/1.4.2 "-"-0.000 function implementation

First of all, we need to extract browser information according to the log information and carry out statistical operations for different browsers. Although I can implement this function by myself, I don't bother to recreate the wheel, so I found a gadget in GitHub that can do this function. The GitHub address is as follows:

Https://github.com/LeeKemp/UserAgentParser

After downloading it locally through git clone or browser, use the command line to go to its home directory, then package it with the maven command and install it into the local repository:

$mvn clean package-DskipTest$ mvn clean install-DskipTest

After the installation is complete, add dependencies and plug-ins to the project:

Cloudera https://repository.cloudera.com/artifactory/cloudera-repos/ true false UTF-8 2.6.0-cdh6.7.0 org.apache.hadoop hadoop-client ${hadoop.version} provided com.kumkee UserAgentParser 0.0.1 junit junit 4.10 test maven-assembly-plugin jar-with-dependencies

Then we write a test case to test the parsing class, because we haven't used this tool before, so for an unused tool, get into the habit of testing it before using it in a project:

Package org.zero01.project;import com.kumkee.userAgent.UserAgent;import com.kumkee.userAgent.UserAgentParser;/** * @ program: hadoop-train * @ description: UserAgent parsing test class * @ author: 01 * @ create: 2018-04-01 22:43 * * / public class UserAgentTest {public static void main (String [] args) {String source = "Mozilla/5.0 (Windows NT 6.1) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36 "; UserAgentParser userAgentParser = new UserAgentParser (); UserAgent agent = userAgentParser.parse (source); String browser = agent.getBrowser (); String engine = agent.getEngine (); String engineVersion = agent.getEngineVersion (); String os = agent.getOs (); String platform = agent.getPlatform () Boolean isMobile = agent.isMobile (); System.out.println ("browser:" + browser); System.out.println ("engine:" + engine); System.out.println ("engine version:" + engineVersion); System.out.println ("operating system:" + os); System.out.println ("platform:" + platform) System.out.println ("is it a mobile device:" + isMobile);}}

The output from the console is as follows:

Browser: Chrome engine: Webkit engine version: 537.36 operating system: Windows 7 platform: whether Windows is a mobile device: false

As can be seen from the print results, the relevant information of UserAgent has been obtained normally, so we can use this tool in the project.

Use MapReduce to complete demand statistics

Create a class and write the following code:

Package org.zero01.hadoop.project;import com.kumkee.userAgent.UserAgent;import com.kumkee.userAgent.UserAgentParser;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat Import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import java.io.IOException;import java.util.regex.Matcher;import java.util.regex.Pattern / * * @ program: hadoop-train * @ description: use MapReduce to complete the statistics of browser visits * @ author: 01 * @ create: 2018-04-02 14:20 * * / public class LogApp {/ * Map: read the contents of the input file * / public static class MyMapper extends Mapper {LongWritable one = new LongWritable (1); private UserAgentParser userAgentParser Protected void setup (Context context) throws IOException, InterruptedException {userAgentParser = new UserAgentParser ();} protected void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {/ / each line of log information received String line = value.toString (); String source = line.substring (getCharacterPosition (line, "\", 7) + 1) UserAgent agent = userAgentParser.parse (source); String browser = agent.getBrowser (); / / output the processing result of map to context.write (new Text (browser), one) through context;} protected void cleanup (Context context) throws IOException, InterruptedException {userAgentParser = null }} / * Reduce: merge operation * / public static class MyReducer extends Reducer {protected void reduce (Text key, Iterable values, Context context) throws IOException, InterruptedException {long sum = 0; for (LongWritable value: values) {/ / find the total number of key occurrences sum + = value.get () } / / output the final statistical results to context.write (key, new LongWritable (sum)) }} / * get the index position * * @ param value * @ param operator * @ param index * @ return * / private static int getCharacterPosition (String value, String operator, int index) {Matcher slashMatcher = Pattern.compile (operator) .matcher (value); int mIdex = 0 While (slashMatcher.find ()) {mIdex++; if (mIdex = = index) {break;}} return slashMatcher.start () } / * define Driver: encapsulate all the information of the MapReduce job * / public static void main (String [] args) throws IOException, ClassNotFoundException, InterruptedException {Configuration configuration = new Configuration (); / / prepare to clean the existing output directory Path outputPath = new Path (args [1]); FileSystem fileSystem = FileSystem.get (configuration) If (fileSystem.exists (outputPath)) {fileSystem.delete (outputPath, true); System.out.println ("output file exists, but is has deleted");} / / create Job, set the name of Job Job job = Job.getInstance (configuration, "LogApp") by parameter; / / set Job's processing class job.setJarByClass (LogApp.class) / / set the input path for job processing FileInputFormat.setInputPaths (job, new Path (args [0])); / / set map related parameters job.setMapperClass (LogApp.MyMapper.class); job.setMapOutputKeyClass (Text.class); job.setMapOutputValueClass (LongWritable.class); / / set reduce related parameters job.setReducerClass (LogApp.MyReducer.class) Job.setOutputKeyClass (Text.class); job.setOutputValueClass (LongWritable.class); / / set the output path after job processing: FileOutputFormat.setOutputPath (job, new Path (args [1])); System.exit (job.waitForCompletion (true)? 0: 1);}}

Open the console under the project directory and enter the following command to package:

Mvn assembly:assembly

Packaged successfully:

Upload the jar package to the server:

[root@localhost ~] # rz # uses the Xshell tool, so you can directly use the rz command to upload the file [root@localhost ~] # ls | grep hadoop-train-1.0-jar-with-dependencies.jar # to check whether the upload is successful hadoop-train-1.0-jar-with-dependencies.jar [root@localhost ~] #

Upload the pre-prepared log file to the HDFS file system:

[root@localhost ~] # hdfs dfs-put. / 10000_access.log / [root@localhost ~] # hdfs dfs-ls / 10000 localhost access.logMurray RWM-1 root supergroup 2769741 2018-04-02 22:33 / 10000Access.log [root @ localhost ~] #

Execute the following command

[root@localhost ~] # hadoop jar. / hadoop-train-1.0-jar-with-dependencies.jar org.zero01.hadoop.project.LogApp / 10000_access.log / browserout

Successful execution:

View the processing results:

[root@localhost] # hdfs dfs-ls / browseroutFound 2 items-rw-r--r-- 1 root supergroup 0 2018-04-02 22:42 / browserout/_SUCCESS-rw-r--r-- 1 root supergroup 56 2018-04-02 22:42 / browserout/part-r-00000 [root@localhost] # hdfs dfs-text / browserout/part-r-00000Chrome 2775Firefox 327MSIE 78Safari 115Unknown 6705 [root@localhost ~] #

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.