In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
In this issue, the editor will bring you about the common query commands and usage of Hive. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.
1. Transfer log files to HDFS
```bash
Hdfs dfs-mkdir / user/hive/warehouse/original_access_logs_0104
Hdfs dfs-put access.log / user/hive/warehouse/original_access_logs_0104
`
Check that the file has been copied correctly
```bash
Hdfs dfs-ls / user/hive/warehouse/original_access_logs_0104
`
two。 Establishing an Hive external table corresponds to the log file
```sql
DROP TABLE IF EXISTS original_access_logs
CREATE EXTERNAL TABLE original_access_logs (
Ip STRING
Request_time STRING
Method STRING
Url STRING
Http_version STRING
Code1 STRING
Code2 STRING
Dash STRING
User_agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex' =' ([^] *)--\ [([^\]] *)\] "([^\] *)" ([^ "] *)"'
'output.format.string' = "1 $$s% 2$ s% 3$ s% 4$ s% 5$ s% 6$ s% 7$ s% 8$ s% 9$ s")
LOCATION'/ user/hive/warehouse/original_access_logs_0104'
`
3. Convert TEXT table to PARQUET table
```sql
DROP TABLE IF EXISTS pq_access_logs
CREATE TABLE pq_access_logs (
Ip STRING
Request_time STRING
Method STRING
Url STRING
Http_version STRING
Code1 STRING
Code2 STRING
Dash STRING
User_agent STRING
`timestamp` int)
STORED AS PARQUET
# ADD JAR / opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar
# ADD JAR / opt/cloudera/parcels/CDH/lib/hive/contrib/hive-contrib-2.1.1-cdh7.3.2.jar
INSERT OVERWRITE TABLE pq_access_logs
SELECT
Ip
From_unixtime (unix_timestamp (request_time, 'dd/MMM/yyyy:HH:mm:ss z'),' yyyy-MM-dd HH:mm:ss z')
Method
Url
Http_version
Code1
Code2
Dash
User_agent
Unix_timestamp (request_time, 'dd/MMM/yyyy:HH:mm:ss z')
FROM original_access_logs
`
4. Count the top 5 IP visited
```sql
Select ip, count (*) cnt
From pq_access_logs
Group by ip
Order by cnt desc
Limit 5
`
Observe that Hive Job is split into Map Reduce Job and execute
How to view the log of Hive Job execution
# # demonstration-Partition Table
# steps
1. Create a partition table
```sql
DROP TABLE IF EXISTS partitioned_access_logs
CREATE EXTERNAL TABLE partitioned_access_logs (
Ip STRING
Request_time STRING
Method STRING
Url STRING
Http_version STRING
Code1 STRING
Code2 STRING
Dash STRING
User_agent STRING
`timestamp` int)
PARTITIONED BY (request_date STRING)
STORED AS PARQUET
`
two。 Write the log table to the partition table and insert it using dynamic partition
```sql
Set hive.exec.dynamic.partition.mode=nonstrict
INSERT OVERWRITE TABLE partitioned_access_logs
PARTITION (request_date)
SELECT ip, request_time, method, url, http_version, code1, code2, dash, user_agent, `timestamp`, to_date (request_time) as request_date
FROM pq_access_logs
`
Default partition: records on _ _ HIVE_DEFAULT_PARTITION__, that do not match will be placed in this partition
3. Observe the partition table directory structure
```bash
Hdfs dfs-ls / user/hive/warehouse/partitioned_access_logs
`
# # demonstration-split bucket table
# steps
1. Create a log bucket table
Separate buckets by the first paragraph of IP, and then sort by request time
```sql
DROP TABLE IF EXISTS bucketed_access_logs
CREATE TABLE bucketed_access_logs (
First_ip_addr INT
Request_time STRING)
CLUSTERED BY (first_ip_addr)
SORTED BY (request_time)
INTO 10 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t'
STORED AS TEXTFILE
! If DISTRIBUTE BY and SORT BY are not written, you need to set the hive parameter (not after 2.0, default is true)
SET hive.enforce.sorting = true
SET hive.enforce.bucketing = true
INSERT OVERWRITE TABLE bucketed_access_logs
SELECT cast (split (ip,'\\.') [0] as int) as first_ip_addr, request_time
FROM pq_access_logs
-- DISTRIBUTE BY first_ip_addr
-- SORT BY request_time
`
two。 Observe the physical storage structure of the bucket table
```bash
Hdfs dfs-ls / user/hive/warehouse/bucketed_access_logs/
# guess how many files there are?
Hdfs dfs-cat / user/hive/warehouse/bucketed_access_logs/000000_0 | head
Hdfs dfs-cat / user/hive/warehouse/bucketed_access_logs/000001_0 | head
Hdfs dfs-cat / user/hive/warehouse/bucketed_access_logs/000009_0 | head
# can you see the rule of dividing buckets?
`
# # demonstration-Compression of ORC table
1. Create a new ORC table for the access log and enable compression when inserting data
```sql
DROP TABLE IF EXISTS compressed_access_logs
CREATE TABLE compressed_access_logs (
Ip STRING
Request_time STRING
Method STRING
Url STRING
Http_version STRING
Code1 STRING
Code2 STRING
Dash STRING
User_agent STRING
`timestamp` int)
STORED AS ORC
TBLPROPERTIES ("orc.compression" = "SNAPPY")
-- SET hive.exec.compress.intermediate=true
-- SET mapreduce.map.output.compress=true
INSERT OVERWRITE TABLE compressed_access_logs
SELECT * FROM pq_access_logs
Describe formatted compressed_access_logs
`
two。 Compare with the original Parquet table that does not enable compression
Size
The original TXT is 38 MB.
`
Hdfs dfs-ls / user/hive/warehouse/pq_access_logs/
`
Parquet No Compression: 4158592 (4.1 MB)
`
Hdfs dfs-ls / user/hive/warehouse/compressed_access_logs/
`
After Orc compression: 1074404 (1.0MB)
Compression ratio: approximately 5:2 (4:1-Parquet Raw: ORC Compressed)
Note: it is recommended to enable compression when backing up data. In the case of too many data reads, enabling compression may not necessarily improve query performance.
The above is the common query commands and usage of Hive shared by Xiaobian. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.