Common query commands and usage of Hive 10/19 Update SLTechnology News&Howtos

Common query commands and usage of Hive

2025-10-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

In this issue, the editor will bring you about the common query commands and usage of Hive. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.

1. Transfer log files to HDFS

```bash

Hdfs dfs-mkdir / user/hive/warehouse/original_access_logs_0104

Hdfs dfs-put access.log / user/hive/warehouse/original_access_logs_0104

Check that the file has been copied correctly

```bash

Hdfs dfs-ls / user/hive/warehouse/original_access_logs_0104

two。 Establishing an Hive external table corresponds to the log file

```sql

DROP TABLE IF EXISTS original_access_logs

CREATE EXTERNAL TABLE original_access_logs (

Ip STRING

Request_time STRING

Method STRING

Url STRING

Http_version STRING

Code1 STRING

Code2 STRING

Dash STRING

User_agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

'input.regex' =' ([^] *)--\ [([^\]] *)\] "([^\] *)" ([^ "] *)"'

'output.format.string' = "1 $$s% 2$ s% 3$ s% 4$ s% 5$ s% 6$ s% 7$ s% 8$ s% 9$ s")

LOCATION'/ user/hive/warehouse/original_access_logs_0104'

3. Convert TEXT table to PARQUET table

```sql

DROP TABLE IF EXISTS pq_access_logs

CREATE TABLE pq_access_logs (

Ip STRING

Request_time STRING

Method STRING

Url STRING

Http_version STRING

Code1 STRING

Code2 STRING

Dash STRING

User_agent STRING

`timestamp` int)

STORED AS PARQUET

# ADD JAR / opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar

# ADD JAR / opt/cloudera/parcels/CDH/lib/hive/contrib/hive-contrib-2.1.1-cdh7.3.2.jar

INSERT OVERWRITE TABLE pq_access_logs

SELECT

From_unixtime (unix_timestamp (request_time, 'dd/MMM/yyyy:HH:mm:ss z'),' yyyy-MM-dd HH:mm:ss z')

Method

Url

Http_version

Code1

Code2

Dash

User_agent

Unix_timestamp (request_time, 'dd/MMM/yyyy:HH:mm:ss z')

FROM original_access_logs

4. Count the top 5 IP visited

```sql

Select ip, count (*) cnt

From pq_access_logs

Group by ip

Order by cnt desc

Limit 5

Observe that Hive Job is split into Map Reduce Job and execute

How to view the log of Hive Job execution

# # demonstration-Partition Table

# steps

1. Create a partition table

```sql

DROP TABLE IF EXISTS partitioned_access_logs

CREATE EXTERNAL TABLE partitioned_access_logs (

Ip STRING

Request_time STRING

Method STRING

Url STRING

Http_version STRING

Code1 STRING

Code2 STRING

Dash STRING

User_agent STRING

`timestamp` int)

PARTITIONED BY (request_date STRING)

STORED AS PARQUET

two。 Write the log table to the partition table and insert it using dynamic partition

```sql

Set hive.exec.dynamic.partition.mode=nonstrict

INSERT OVERWRITE TABLE partitioned_access_logs

PARTITION (request_date)

SELECT ip, request_time, method, url, http_version, code1, code2, dash, user_agent, `timestamp`, to_date (request_time) as request_date

FROM pq_access_logs

Default partition: records on _ _ HIVE_DEFAULT_PARTITION__, that do not match will be placed in this partition

3. Observe the partition table directory structure

```bash

Hdfs dfs-ls / user/hive/warehouse/partitioned_access_logs

# # demonstration-split bucket table

# steps

1. Create a log bucket table

Separate buckets by the first paragraph of IP, and then sort by request time

```sql

DROP TABLE IF EXISTS bucketed_access_logs

CREATE TABLE bucketed_access_logs (

First_ip_addr INT

Request_time STRING)

CLUSTERED BY (first_ip_addr)

SORTED BY (request_time)

INTO 10 BUCKETS

ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t'

STORED AS TEXTFILE

! If DISTRIBUTE BY and SORT BY are not written, you need to set the hive parameter (not after 2.0, default is true)

SET hive.enforce.sorting = true

SET hive.enforce.bucketing = true

INSERT OVERWRITE TABLE bucketed_access_logs

SELECT cast (split (ip,'\\.') [0] as int) as first_ip_addr, request_time

FROM pq_access_logs

-- DISTRIBUTE BY first_ip_addr

-- SORT BY request_time

two。 Observe the physical storage structure of the bucket table

```bash

Hdfs dfs-ls / user/hive/warehouse/bucketed_access_logs/

# guess how many files there are?

Hdfs dfs-cat / user/hive/warehouse/bucketed_access_logs/000000_0 | head

Hdfs dfs-cat / user/hive/warehouse/bucketed_access_logs/000001_0 | head

Hdfs dfs-cat / user/hive/warehouse/bucketed_access_logs/000009_0 | head

# can you see the rule of dividing buckets?

# # demonstration-Compression of ORC table

1. Create a new ORC table for the access log and enable compression when inserting data

```sql

DROP TABLE IF EXISTS compressed_access_logs

CREATE TABLE compressed_access_logs (

Ip STRING

Request_time STRING

Method STRING

Url STRING

Http_version STRING

Code1 STRING

Code2 STRING

Dash STRING

User_agent STRING

`timestamp` int)

STORED AS ORC

TBLPROPERTIES ("orc.compression" = "SNAPPY")

-- SET hive.exec.compress.intermediate=true

-- SET mapreduce.map.output.compress=true

INSERT OVERWRITE TABLE compressed_access_logs

SELECT * FROM pq_access_logs

Describe formatted compressed_access_logs

two。 Compare with the original Parquet table that does not enable compression

Size

The original TXT is 38 MB.

Hdfs dfs-ls / user/hive/warehouse/pq_access_logs/

Parquet No Compression: 4158592 (4.1 MB)

Hdfs dfs-ls / user/hive/warehouse/compressed_access_logs/

After Orc compression: 1074404 (1.0MB)

Compression ratio: approximately 5:2 (4:1-Parquet Raw: ORC Compressed)

Note: it is recommended to enable compression when backing up data. In the case of too many data reads, enabling compression may not necessarily improve query performance.

The above is the common query commands and usage of Hive shared by Xiaobian. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.