Lesson 55: practical Hive Analysis of search engine data 04/20 Update SLTechnology News&Howtos

Lesson 55: practical Hive Analysis of search engine data

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

First, obtain data

Sogou Lab provides us with logs queried by users using Sogou search engine at the download address

Http://download.labs.sogou.com/dl/q.html

This article chooses to download the simplified version.

The data format is as follows:

2. Upload data to HDFS

Set up a hdfs directory

Root@spark-master:~# hdfs dfs-mkdir-p / library/sougou

Upload files

Root@spark-master:~# hdfs dfs-put SogouQ1.txt / library/sougouroot@spark-master:~# hdfs dfs-put SogouQ2.txt / library/sougouroot@spark-master:~# hdfs dfs-put SogouQ3.txt / library/sougouroot@spark-master:~#

Third, create tables using Hive

Root@spark-master:/usr/local/hive/apache-hive-1.2.1/bin#. / hiveSLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jarbank] SLF4J: Found binding in [jar:file:/usr/local/spark/spark-1.6.0-bin-] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] SLF4J: Class path contains multiple SLF4J bindings.SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12] Class] SLF4J: Found binding in [See for an explanation.SLF4J for an explanation.SLF4J: Actual binding is of type [org.slf4j.impl.] Found binding in. Log4jLoggerFactory] Logging initialized using configuration in jar Velvet fileGroups use CREATE TABLE SOUGOU to CREATE TABLE SOUGOU Log 4j.propertieshive > LRL (ID STRING) WEBSESSION STRING,WORD STRING,S_SEQ INT,C_SEQ INT, WEBSITE STRING) > ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t 'LINES TERMINATED BY'\ n' OKTime taken: 1.995 secondshive >

Fourth, load data

Hive > LOAD DATA INPATH'/ library/sougou/SogouQ1.txt' INTO TABLE sougou;Loading data to table default.sougouTable default.sougou stats: [numFiles=1, totalSize=108750574] OKTime taken: 1.101 seconds

At this point, we look at the source directory again

SogouQ1.txt is gone. Where is the file?

As you can see, importing data is actually moving the file on the HDFS to a location. So the speed is very fast.

Is it possible to place SogouQ1.txt directly in / user/hive/warehouse/sougou/ of HDFS without using the load statement?

Because the metadata wants to know which data files are contained in the table, you must use the load statement.

5. Operational data

5.1Computing count

Hive > select count (*) from sougou Query ID = root_20160314192407_792e772a-c225-4b37-b948-7050d6b529b4Total jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1457942575478_0002 Tracking URL = http://spark-master:8088/proxy/application_1457942575478_0002/Kill Command = / usr/local/hadoop/hadoop-2.6.0/bin/hadoop job-kill job_1457942575478_0002Hadoop job information for Stage-1: number of mappers: 1 Number of reducers: 12016-03-14 19 reduce 24 Stage-1 map 29014 Stage-1 map = 0%, reduce = 0% 2016-03-14 1914 24 reduce 47137 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6. 14 sec2016-03-14 19 reduce 25 reduce 04563 Stage-1 map = 100%, 100% Cumulative CPU 9.38 secMapReduce Total cumulative CPU time: 9 seconds 380 msecEnded Job = job_1457942575478_0002MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 9.38 sec HDFS Read: 108757501 HDFS Write: 8 SUCCESSTotal MapReduce CPU Time Spent: 9 seconds 380 msecOK1000000Time taken: 58.603 seconds, Fetched: 1 row (s)

5.2 View data

Hive > select * from sougou limit 5 OK20111230000005 57375476989eea12893c0c3811607bcf 11 http://www.qiyi.com/20111230000005 66c5bb7774e31d0a22278249b26bc83a 31 http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=120111230000007 b97920521c78de70ac38e3713f524b50 11 http://www.bblianmeng.com/20111230000008 6961d0c97fe93701fc9c0d861d096cd9 11 http://lib.scnu.edu.cn/20111230000008 f2f5a21c764aebde1e8afcc2871e086f 2 1 http://proxyie.cn/Time taken: 0.246 seconds Fetched: 5 row (s)

There is garbled here because the source file is gb3212-encoded, but both Hadoop and Hive use UTF8-encoding. We transcode the file and upload it to hdfs again.

Root@spark-master:~# iconv-t utf-8-f gb2312-c SogouQ1.txt > SogouQ1.txt.utf8root@spark-master:~# rm SogouQ1.txt Mv SogouQ1.txt.utf8 SogouQ1.txtroot@spark-master:~# hdfs dfs-rm / user/hive/warehouse/sougou/SogouQ1.txt16/03/14 19:44:25 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.Deleted / user/hive/warehouse/sougou/SogouQ1.txtroot@spark-master:~# hdfs dfs-put SogouQ1.txt / user/hive/warehouse/sougou/root@spark-master:~#

Check again

Hive > select * from sougou limit 5 OK20111230000005 57375476989eea12893c0c3811607bcf Qiyi HD 11 http://www.qiyi.com/20111230000005 66c5bb7774e31d0a22278249b26bc83a Xiu Xian Zhuan 31 http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=120111230000007 b97920521c78de70ac38e3713f524b50 Ben Union 11 http://www.bblianmeng.com/20111230000008 6961d0c97fe93701fc9c0d861d096cd9 Library of South China normal University 11 Http://lib.scnu.edu.cn/20111230000008 f2f5a21c764aebde1e8afcc2871e086f online agent 21 http://proxyie.cn/Time taken: 0.151 seconds Fetched: 5 row (s)

That's normal.

5.2 A more complicated query

Hive > select count (*) from sougou where s_seq=1 and c_seq=1 and website like'% baidu%' Query ID = root_20160314194855_8c9aa844-e088-4695-942f-3579718962f6Total jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1457942575478_0003 Tracking URL = http://spark-master:8088/proxy/application_1457942575478_0003/Kill Command = / usr/local/hadoop/hadoop-2.6.0/bin/hadoop job-kill job_1457942575478_0003Hadoop job information for Stage-1: number of mappers: 1 Number of reducers: 12016-03-14 1949 Stage-1 map 12041 Stage-1 map = 0%, reduce = 0% 2016-03-14 1949 Stage-1 map 33174 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.94 sec2016-03-14 1949 Stage-1 map 48672 Stage-1 map = 100%, reduce = 100% Cumulative CPU 11.55 secMapReduce Total cumulative CPU time: 11 seconds 550msecEnded Job = job_1457942575478_0003MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 11.55sec HDFS Read: 114834192 HDFS Write: 6 SUCCESSTotal MapReduce CPU Time Spent: 11 seconds 550msecOK59090Time taken: 55.022 seconds, Fetched: 1 row (s)

Query click ranking

Hive > select word,count (*) cnt from sougou group by word order by cnt desc limit 5 / query ID = root_20160314202108_58aeca03-8ed6-4626-b15e-af6643c94107Total jobs = 2Launching Job 1 out of 2Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1457942575478_0007 Tracking URL = http://spark-master:8088/proxy/application_1457942575478_0007/Kill Command = / usr/local/hadoop/hadoop-2.6.0/bin/hadoop job-kill job_1457942575478_0007Hadoop job information for Stage-1: number of mappers: 1 Number of reducers: 12016-03-14 20 reduce 21 reduce = 0%, Stage-1 map = 0%, reduce = 0%, Cumulative CPU 14.98 sec2016-03-14 20 22 Stage-1 map 16021 Stage-1 map = 100%, reduce = 68%, Cumulative CPU 20.27 sec2016-03-14 20 22 22 Stage-1 map = 100%, reduce = 77%, Cumulative CPU 23.16 sec2016-03-14 20 22 Stage-1 map = 100% Reduce = 93%, Cumulative CPU 25.9 sec2016-03-14 2022 sec2016 23721 Stage-1 map = 100%, reduce = 100% Cumulative CPU 26.9 secMapReduce Total cumulative CPU time: 26 seconds 900 msecEnded Job = job_1457942575478_0007Launching Job 2 out of 2Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1457942575478_0008 Tracking URL = http://spark-master:8088/proxy/application_1457942575478_0008/Kill Command = / usr/local/hadoop/hadoop-2.6.0/bin/hadoop job-kill job_1457942575478_0008Hadoop job information for Stage-2: number of mappers: 1 Number of reducers: 12016-03-14 20 reduce 22 reduce = 0%, reduce = 0% 2016-03-14 20 20 Stage-2 map 07303 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 9.95 sec2016-03-14 20 20 Stage-2 map 25482 Stage-2 map = 100%, reduce = 82%, Cumulative CPU 15.54 sec2016-03-14 20 20 Stage-2 map 26563 Stage-2 map = 100%, reduce Cumulative CPU 16.88 secMapReduce Total cumulative CPU time: 16 seconds 880 msecEnded Job = job_1457942575478_0008MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 26.9 sec HDFS Read: 114832713 HDFS Write: 15044297 SUCCESSStage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 16.88 sec HDFS Read: 15048892 HDFS Write: 153 SUCCESSTotal MapReduce CPU Time Spent: 43 seconds 780 msecOK Baidu 7564baidu 3652 father of Yan Ning, head of Guantao County 23884399 Mini Game 2119Time taken: 140.18 seconds Fetched: 5 row (s)

VI. External table

The table we created in step 3 is an internal table, and when the internal table is successfully created, a directory with the same name as the table is created under / user/hive/warehouse. And when the data is imported, the source file is placed in the directory corresponding to the table. When the table is deleted, the directory and the file are deleted together

Hive > drop table sougou;OKTime taken: 0.983 seconds

View hdfs

Root@spark-master:~# hdfs dfs-ls / user/hive/warehouse/Found 1 itemsdrwxr-xr-x-root supergroup 0 2016-03-14 17:10 / user/hive/warehouse/t1

Hive also provides another type of table, called an external table.

The table is created as follows:

Hive > CREATE EXTERNAL TABLE SOUGOU (ID STRING,WEBSESSION STRING,WORD STRING,S_SEQ INT,C_SEQ INT, WEBSITE STRING) > ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t 'LINES TERMINATED BY'\ n'> STORED AS TEXTFILE LOCATION'/ library/sougou/sougouExternal';OKTime taken: 0.123 secondsroot@spark-master:~# hdfs dfs-ls / user/hive/warehouse/16/03/14 20:02:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicableFound 1 itemsdrwxr-xr-x-root supergroup 0 2016-03-14 17:10 / user/hive/warehouse/t1root@spark-master:~# hdfs dfs-ls / library/sougou/16/03/14 20:02:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicableFound 3 items-rw-r--r-- 3 root supergroup 217441417 2016-03-14 18:55 / library/sougou/SogouQ2.txt-rw-r--r-- 3 root supergroup 1086552775 2016-03-14 18:56 / library/sougou/SogouQ3.txtdrwxr-xr-x-root supergroup 0 2016-03-14 20:01 / library/sougou/sougouExternal

The directory is created directly in the specified location.

Upload files

Root@spark-master:~# hdfs dfs-put SogouQ1.txt / library/sougou/sougouExternal

Query data in Hive

Hive > select count (*) from sougou Query ID = root_20160314200414_b514251b-58d3-40aa-a9ee-4a9cf5eef8f2Total jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1457942575478_0004 Tracking URL = http://spark-master:8088/proxy/application_1457942575478_0004/Kill Command = / usr/local/hadoop/hadoop-2.6.0/bin/hadoop job-kill job_1457942575478_0004Hadoop job information for Stage-1: number of mappers: 1 Number of reducers: 12016-03-14 20 Stage-1 map 0414 Stage-1 map = 0%, reduce = 0% 2016-03-14 20 20 reduce 41458 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.66 sec2016-03-14 20 20 reduce 52341 Stage-1 map = 100%, reduce = 100% Cumulative CPU 7.26 secMapReduce Total cumulative CPU time: 7 seconds 260 msecEnded Job = job_1457942575478_0004MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.26 sec HDFS Read: 114832746 HDFS Write: 8 SUCCESSTotal MapReduce CPU Time Spent: 7 seconds 260 msecOK1000000Time taken: 39.823 seconds, Fetched: 1 row (s)

After the external table is deleted, the files on the hdfs are not deleted

Hive > drop table sougou;OKTime taken: 0.363 secondsroot@spark-master:~# hdfs dfs-ls / library/sougou/sougouExternal/16/03/14 20:16:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicableFound 1 items-rw-r--r-- 3 root supergroup 114825752 2016-03-14 20:03 / library/sougou/sougouExternal/SogouQ1.txt

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.