In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Wednesday, 2019-2-20
Basic Syntax Summary of hive (hql)
-
Hive learns three kinds of table sentences of 3:Hive to explain https://blog.csdn.net/qq_36743482/article/details/78383964 in detail
There are three ways to create tables in Hive:
1. Direct table building method
For example: create table table_name (col_name data_type)
2. Query and table building method
For example: build a table through AS query statement: store the results of the subquery in the new table with data, which is generally used in the intermediate table.
3. Like table building method
A table with exactly the same structure is created, but there is no data. Middle table of common expressions
/ / for more information, please see the link.
Hive file format (four types of table STORE AS): https://blog.csdn.net/hereiskxm/article/details/42171325
The hive file storage format includes the following categories:
1 、 TEXTFILE
2. SEQUENCEFILE / / sequence file
3 、 RCFILE
4. ORCFILE (appears after 0.11)
Summary
TEXTFILE is the default format, which is not specified when creating the table. When importing data, the data file will be copied directly to hdfs without processing.
A table in SEQUENCEFILE,RCFILE,ORCFILE format cannot import data directly from a local file. The data is first imported into a table in textfile format, and then imported into a SequenceFile,RCFile,ORCFile table using insert from the table.
/ / for more information, please see the link.
Summary
1. Textfile default format, no data compression, high disk overhead and high data parsing overhead.
Can be used in combination with Gzip and Bzip2 (automatically checked by the system and decompressed automatically when the query is executed), but in this way, hive will not split the data
As a result, it is impossible to operate the data in parallel.
2. SequenceFile is a kind of binary file support provided by Hadoop API, which is easy to use, divisible and compressible.
SequenceFile supports three compression options: NONE,RECORD,BLOCK. Record compression ratio is low, it is generally recommended to use BLOCK compression.
3. RCFILE is a storage mode that combines row and column storage. First of all, it divides the data into rows to ensure that the same record is on the same block, avoiding the need to read multiple block to read a record. Secondly, block data column storage is beneficial to data compression and fast column access.
4. ORCFILE ()
Summary:
Compared with TEXTFILE and SEQUENCEFILE,RCFILE due to column storage, the performance consumption of data loading is higher, but it has better compression ratio and query response. The data warehouse is characterized by one write and multiple reads, so on the whole, RCFILE has obvious advantages over the other two formats.
Set hive.cli.print.header=true; / / hive client print header
CREATE TABLE page_view (viewTime INT, userid BIGINT
Page_url STRING, referrer_url STRING
Ip STRING COMMENT'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY (dt STRING, country STRING) / / partition
The sentence ROW FORMAT DELIMITED / / means that a line in the file is a record in our table.
FIELDS TERMINATED BY'\ 001' / / We use 'tab' as the delimiter
STORED AS SEQUENCEFILE; / / is stored as a sequence file
/ / sequencefile
Create table tab_ip_seq (id int,name string,ip string,country string)
Row format delimited
Fields terminated by','
Stored as sequencefile
/ / external external table
CREATE EXTERNAL TABLE tab_ip_ext (id int, name string,ip STRING,country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY','
STORED AS TEXTFILE
LOCATION'/ external/user'
/ / Import data locally into the hive table (essentially uploading files to the hive management directory in hdfs)
Load data local inpath'/ home/hadoop/ip.txt' into table tab_ext
/ / Import data from hdfs into hive table (essentially moving files from the original directory to a directory managed by hive)
Load data inpath 'hdfs://ns1/aa/bb/data.log' into table tab_user
/ / use select statements to insert data in bulk
Insert overwrite table tab_ip_seq select * from tab_ext
/ / create & load
Create table tab_ip (id int,name string,ip string,country string)
Row format delimited
Fields terminated by','
Stored as textfile
/ / CTAS builds the table structure according to the select statement
CREATE TABLE tab_ip_ctas
AS
SELECT id new_id, name new_name, ip new_ip,country new_country FROM tab_ip_ext
SORT BY new_id; / / sort by new_id
/ / CLUSTER / / this part means bucket table
Create table tab_ip_cluster (id int,name string,ip string,country string)
Clustered by (id) into 3 buckets; / / gathered into 3 buckets by (id)
Load data local inpath'/ home/hadoop/ip.txt' overwrite into table tab_ip_cluster
Set hive.enforce.bucketing=true; / / hive execute split bucket
Insert into table tab_ip_cluster select * from tab_ip
Select * from tab_ip_cluster tablesample (bucket 2 out of 3 on id)
/ / PARTITION partition table
Create table tab_ip_part (id int,name string,ip string,country string)
Partitioned by (year string)
Row format delimited
Fields terminated by','
Load data local inpath'/ home/hadoop/data.log' overwrite into table tab_ip_part
Partition (year='1990')
Load data local inpath'/ home/hadoop/data2.log' overwrite into table tab_ip_part
Partition (year='2000')
The above means to insert / home/hadoop/data2.log data into this partition of 2000 under tab_ip_part
Select from tab_ip_part
Select from tab_ip_part where part_flag='part2'
Select count (*) from tab_ip_part where part_flag='part2'
Change command
Alter table tab_ip change id id_alter string
ALTER TABLE tab_cts ADD PARTITION (partCol = 'dt') location' / external/hive/dt'
Show partitions tab_ip_part; / / View Partition
Detailed explanation of HIVE- bucket table and creation of instance https://www.cnblogs.com/kouryoushine/p/7809299.html
The difference between Hive partition table and bucket table https://blog.csdn.net/jenrey/article/details/80588493
The basis of Hive: partition, bucket, Sort Merge Bucket Join https://blog.csdn.net/wisgood/article/details/17186107
The partition and bucket division of hive is an advanced function, which can be learned in depth (see xinghuan)
/ / insert from select inserts data into other tables in batches through select statements
Create table tab_ip_like like tab_ip
Insert overwrite table tab_ip_like select * from tab_ip
/ / write to hdfs writes the result to the file of hdfs
Insert overwrite local directory'/ home/hadoop/hivetemp/test.txt' select from tab_ip_part where part_flag='part1'
Insert overwrite directory'/ hiveout.txt' select from tab_ip_part where part_flag='part1'
/ / cli shell executes the hql statement of hive through shell
Hive-S-e'select country,count () from tab_ext' > / home/hadoop/hivetemp/e.txt
Select from tab_ext sort by id desc limit 5
Select a.ip.b.book from tab_ext a join tab_ip_book b on (a.name=b.name)
Hive complex format array,map,struct uses: https://blog.csdn.net/u010670689/article/details/72885944
/ / array array
Create table tab_array (an array,b array)
Row format delimited
Fields terminated by'\ t'
Collection items terminated by','
Select a [0] from tab_array
Select * from tab_array where array_contains (bMagneWord`)
Insert into table tab_array select array (0), array (name,ip) from tab_ext t
The use of array https://blog.csdn.net/zhao897426182/article/details/78347960 in Hive
/ / map / / Mapping
Create table tab_map (name string,info map)
Row format delimited
Fields terminated by'\ t'
Collection items terminated by','/ / Collection of items terminated
Map keys terminated by':'; / / Mapping key terminates
Load data local inpath'/ home/hadoop/hivetemp/tab_map.txt' overwrite into table tab_map
Insert into table tab_map select name,map ('name',name,'ip',ip) from tab_ext
/ / struct
Create table tab_struct (name string,info struct)
Row format delimited
Fields terminated by'\ t'
Collection items terminated by','
Load data local inpath'/ home/hadoop/hivetemp/tab_st.txt' overwrite into table tab_struct
Insert into table tab_struct select name,named_struct ('age',id,'tel',name,'addr',country) from tab_ext
/ / the operation and maintenance staff can ignore the UDF function and learn if they are interested.
Select if (id=1,first,no-first), name from tab_ext
Hive > add jar / home/hadoop/myudf.jar
Hive > CREATE TEMPORARY FUNCTION fanyi AS 'cn.itcast.hive.Fanyi'
Select id,name,ip,fanyi (country) from tab_ip_ext
What is the UDF of Hive? https://blog.csdn.net/yqlakers/article/details/70211522
First of all, what is the full name of UDF,UDF, user-defined function, user-defined function, why does it exist? Sometimes the query you want to write can not be easily expressed using the built-in functions provided by Hive. By writing UDF,Hive, you can easily insert user-written processing code and use them in the query, which is equivalent to customizing some functions in HQL (Hive SQL). First, UDF must be written in the java language, and Hive itself is written in java. So want to learn hadoop this distributed framework related technology, proficient in the use of java is the basic skill!
Author: YQlakers
Source: CSDN
Original: https://blog.csdn.net/yqlakers/article/details/70211522
Copyright notice: this article is the original article of the blogger, please attach a link to the blog article to reprint it!
Hql query Advanced / / hive query Syntax
The function and usage of order by,sort by, distribute by and cluster by in hive: https://blog.csdn.net/jthink_/article/details/38903775
MapReduce script
Connect (join)
Internal connection (inner join)
External connection (outer join)
Semi-connected (semi join)
Map connection (map join)
Subquery (sub query)
View (view)
The overall order of the final output can be made through the order by clause provided by Hive. But because Hive is based on Hadoop, to produce such an overall ordered result, Hadoop must be forced to use only one Reduce to complete the processing. The side effect of this approach is to reduce efficiency.
If you don't need the overall order of the final result, you can use the sort by clause to sort it. This sort operation only ensures that the output of each Reduce is ordered. If you want certain lines to be processed by the same Reduce, you can use the distribute clause to do so. For example:
Table student (classNo,stuNo,score) data are as follows:
C01 N0101 82
C01 N0102 59
C02 N0201 81
C01 N0103 65
C03 N0302 92
C02 N0202 82
C02 N0203 79
C03 N0301 56
C03 N0306 72
We hope to output the performance information of each class from low to high according to the score. Execute the following statement:
Select classNo,stuNo,score from student distribute byclassNo sort by score
The output is as follows:
C02 N0203 79
C02 N0201 81
C02 N0202 82
C03 N0301 56
C03 N0306 72
C03 N0302 92
C01 N0102 59
C01 N0103 65
C01 N0101 82
We can see that the grades of all the students in each class are orderly. Because records of the same classNo are distributed to a separate reduce for processing, while sort by ensures that the output of each reduce is orderly.
Note:
To test the effectiveness of the distribute by in the example above, you should first set up enough reduce. For example, if there are three different classNo in the above example, we need to set the number of reduce to at least 3 or more. If the number of reduce set is less than 3, it will cause multiple different classNo to be distributed to the same
A reduce so that you can't produce the output you expect. The setup commands are as follows:
Set mapred.reduce.tasks = 3
MapReduce script
If we need to call an external script, such as Python, in a query statement, we can use the
Transform (transition), map,reduce, etc.
For example, we want to filter out all failed student records and output only passing student performance information.
Create a new Python script file, score_pass.py, with the following contents:
#! / usr/bin/env python
Import sys
For line in sys.stdin:
(classNo,stuNo,score) = line.strip () .split ('\ t')
Ifint (score) > = 60:
Print "% s\ t% s\ t% s"% (classNo,stuNo,score)
Execute the following statement
Add file / home/user/score_pass.py
Select transform (classNo,stuNo,score) using'score_pass.py' as classNo,stuNo,score from student
The output is as follows:
C01 N0101 82
C02 N0201 81
C01 N0103 65
C03 N0302 92
C02 N0202 82
C02 N0203 79
C03 N0306 72
Note:
1) in the above Python scripts, the delimiter can only be a tab (\ t). Similarly, the output delimiter must also be a tab. This is decided by hive itself and cannot be changed. Do not try to use other delimiters, or an error will be reported. You also need to call the strip function to remove the newline character at the end of the line. (or simply use line.split () with no arguments instead.
2) before using the script, register the script file with the add file statement so that hive can distribute it to the Hadoop cluster.
3) Transfom passes the data to the Python script, and the as statement specifies the output column.
Connect (join)
Directly programming MapReduce that uses Hadoop is a time-consuming thing. Hive greatly simplifies this operation.
Internal connection (inner join)
Similar to the internal connection of SQL. Execute the following statement to query each student's number and teacher's name:
Select a.stuNogramme b.teacherName from student a join teacherb on a.classNo = b.classNo
The output is as follows:
N0203 Sun
N0202 Sun
N0201 Sun
N0306 Wang
N0301 Wang
N0302 Wang
N0103 Zhang
N0102 Zhang
N0101 Zhang
Note:
Please refer to the previous article for the contents of the data file.
Do not use syntax like select xx from aa,bb where aa.f=bb.f, which is not supported by hive.
If you need to view the execution plan of hive, you can add explain before the statement, such as:
Explain Select a.stuNogramme b.teacherName from student a jointeacher b on a.classNo = b.classNo
External connection (outer join)
Similar to traditional SQL, Hive provides left outer join,right outer join,full out join.
Semi-connected (semi join)
Hive does not provide in subqueries. At this point you can use leftsemi join to achieve the same function. Execute the following statement:
Select * from teacher left semi join student onstudent.classNo = teacher.classNo
The output is as follows:
C02 Sun
C03 Wang
C01 Zhang
As you can see, C04 Dong does not appear in the query results because C04 does not exist in the table student.
Note:
The fields in the table on the right (student) can only appear in the on clause, not in other places, such as the select clause.
Map connection (map join)
When a table is small enough to be loaded directly into memory, map connections can be used to improve efficiency
For example:
Select / + mapjoin (teacher) / a.stuNojie b.teacherNamefrom student a join teacher b on a.classNo = b.classNo
The above red marks adopt the annotation style of C.
When non-equivalence judgment is used in connection, it is also more suitable for Map connection. The specific reason requires an in-depth understanding of how Hive and MapReduce work.
Subquery (sub query)
Running the following statement returns the highest record for the average score of all classes.
Select max (avgScore) as maScore from
(Select classNo,avg (score) as avgScore from student group byclassNo) a
Output result:
80.66666666666667
The red part of the above statement is a subquery with an alias of a. The returned subquery result is similar to a table and can be continued.
View (view)
Like views in a traditional database, the view of Hive is just a definition, and the view data is not stored in the file system. Again, the view is read-only.
Run the following two commands:
Create view avg_score as
Select classNo,avg (score) as avgScore from student groupby classNo
Select max (avgScore) as maScore From avg_score
You can see that the output is the same as in the previous example.
Hive data type
-basic types
-compound type
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.