In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
What is 01.Hive? introduction to 1.Hive
Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table and provide SQL-like query functions.
Hive is a SQL parsing engine that translates SQL statements into M _ Job and then executes them in Hadoop.
2. Hive architecture
User interface, including CLI,JDBC/ODBC,WebUI
Metadata storage, usually stored in relational databases such as mysql, derby
Interpreter, compiler, optimizer, executor
Hadoop: use HDFS for storage and MapReduce for calculation
The metadata of Ps:Hive is not stored on hdfs, but in the database (metastore). Currently, only MySQL and derby are supported. The metadata in Hive includes the name of the table, the columns and partitions of the table and its properties, the attributes of the table (whether it is an external table, etc.), the directory where the data of the table is located, and so on.
Metadata is the data that describes the data, while Hive's data is stored in Hadoop HDFS
The data is still the original text data, but now there is a catalog plan.
3. The relationship between Hive and Hadoop
Hive uses HDFS to store data and MapReduce to query data.
4. Hive installation and deployment
Hive is just a tool and does not require cluster configuration.
Export HIVE_HOME=/usr/local/hive-2.0.1
Export PATH=PATH:PATH:HIVE_HOME/bin
Configure MySql. If it is not configured, the derby database is used by default, but it is not easy to use. A metastore_db will be created wherever the. / hive command is executed.
MySQL can be installed on one of the nodes.
5. Hive's thrift service
It can be installed on one node and published as a standard service, using the beeline method on other nodes.
Startup mode (if on master):
Start the front desk service: bin/hiveserver2
Launch as background: nohup bin/hiveserver2 1 > / var/log/hiveserver.log 2 > / var/log/hiveserver.err &
Connection method:
Enter hive/bin/beeline and enter the command interface of beeline
Enter the command to connect to hiveserver2
Beeline >! connect jdbc:hive2://master:10000
Beeline >! connect jdbc:hive2://localhost:10000
(master is the hostname of the machine started by hiveserver2, and the port is 10000 by default)
The basic operation of 02.Hive 1. Create a database
Hive > create database tabletest
When you create a new database, a tabletest.db folder is generated in / user/hive/warehouse/ of HDFS.
If you do not create a new database, do not use hive > use, the system default database. You can explicitly use hive > use default; default / user/hive/warehouse/ to build tables
two。 Create a tabl
Syntax:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment],...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment],...)] [CLUSTERED BY (col_name, col_name,...) [SORTED BY (col_name [ASC | DESC],...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path]
Example:
Create table t_order (id int,name string,rongliang string,price double) row format delimited fields terminated by'\ t'
A t _ order table is created, and the TBLS table adds table information and column information in the metadata of Mysql, as follows:
At the same time, a t_order folder is added to the tabletest.db folder in HDFS. All Table data (excluding External Table) is stored in this directory.
3. Import data
You can use HDFS to upload files directly to the t_order folder, or you can use Hive's load command.
Load data local inpath'/ home/hadoop/ip.txt' [OVERWRITE] into table tab_ext
The function is the same as uploading a local Linux file to the HDFS system, but it should be noted that if the path behind the inpath is the HDFS path, it will be deleted and cut to the target folder, not good!
4. External table
The EXTERNAL keyword allows users to create an external table and specify a path to the actual data (LOCATION) while creating the table. When creating an internal table, Hive moves the data to the path pointed to by the data warehouse; if an external table is created, only the path where the data is located is recorded, and no change is made to the location of the data.
To avoid the problem of losing the source file, you can create an external table, and the data source can be anywhere.
CREATE EXTERNAL TABLE tab_ip_ext (id int, name string, ip STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY', 'STORED AS TEXTFILE LOCATION' / external/hive'
When you create the table, you specify the path to the HDFS file, so the source file can be uploaded to the / external/hive/ folder.
When the external table is deleted, only the metadata information is deleted, and the data stored in the HDFS is not deleted.
5. Partition table (partition table)
Function: if the file is very large, the partition table can be used to quickly filter out the data divided by partition field.
There are two partitions in t_order, part1 and part2.
The actual thing is to create two more folders under the t_order folder in HDFS, each with the names part1 and part2.
Create table t_order (id int,name string,rongliang string,price double) partitioned by (part_flag string) row format delimited fields terminated by'\ t'
Insert data:
Load data local inpath'/ home/hadoop/ip.txt' overwrite into table t_orderpartition (part_flag='part1')
The ip.txt file is uploaded to the / t_order/part_flag='part1'/ folder.
View the data in the partition table:
Select * from t_order where part_flag='part1'
When querying, the partition field is also displayed, but there is no such field in the actual table. It is a pseudo field.
The partition in hive is to create one more directory, which has the advantages of convenient statistics, higher efficiency, and shrinking the dataset.
Related commands:
SHOW TABLES; # View all tables
SHOW TABLES 'TMP'; # supports fuzzy queries
SHOW PARTITIONS TMP_TABLE; # check what partitions the table has
DESCRIBE TMP_TABLE; # View table structure
6. Bucket table
The bucket in Hive = the partition in MapReduce, while the partition in Hive simply divides the data into different folders.
1. Create a split bucket table create table stu_buck (Sno int,Sname string,Sex string,Sage int,Sdept string) clustered by (Sno) sorted by (Sno DESC) into 4 bucketsrow format delimitedfields terminated by','
Meaning: each bucket is divided into buckets according to the Sno field, and each bucket is partially ordered according to the Sno field, with 4 buckets. The data will not be processed when the bucket is built, but the inserted data is required to be divided into buckets.
two。 Insert data into the bucket table
Generally speaking, it is not applicable for load data to enter the bucket table, because the data entered by load will not be automatically divided into four small files according to the bucket rule. Therefore, it is common to use select queries to insert files into the bucket table.
Set the variable, set the bucket to true, and set the number of reduce to the number of barrels.
Set hive.enforce.bucketing = true;set mapreduce.job.reduces=4;insert overwrite table student_buckselect * from student cluster by (Sno); insert overwrite table stu_buckselect * from student distribute by (Sno) sort by (Sno asc)
Where distribute by (sno) sort by (sno asc) can be used instead of cluster by (Sno), which is equivalent. Cluster by (Sno) = Bucket + sort
Distribute it first, then sort it locally. The difference is that distribute is more flexible and can be partitioned by one field and sorted by another field.
The second subquery outputs four files as input to the main query.
3. The principle and function of bucket meter
Principle:
A Hive is an organization that conducts buckets for a column. Hive uses a hash of column values, and then divides the remainder by the number of buckets to determine which bucket the record is stored in. (the principle is the same as MapReduce's getPartition method)
Function:
(1) the biggest function is to improve the efficiency of join operation.
The premise is that both are bucket tables and the number of barrels is the same or multiples?
Think about this question:
Select a. ID. Name. B. addr from a join b on a.id = b.id
If tables an and b are already bucket tables, and the bucket field is the id field
When doing this join operation, do you still need to do the Cartesian product of the whole table?
For JOIN operations, two tables have the same column, if a bucket operation is performed on both tables. Then the bucket that holds the same column values can be operated by JOIN, which can greatly reduce the amount of data in JOIN.
(2) sampling is more efficient. When dealing with a large data set, it will bring a lot of convenience if the query can be run on a small part of the data in the stage of developing and modifying the query.
7. Insert statement
Insert of Hive is too slow.
But it can be batch insert. You actually want to append files to the folder.
Create table tab_ip_like like tab_ip;insert overwrite table tab_ip_likeselect * from tab_ip
Append files to tab_ip_like
8. Several ways to save the results of select query
1. Save the query results to a new hive table
Create table t_tmp
As
Select * from tweep
2. Save the query results to an existing hive table
Insert into table t_tmp
Select * from tweep
3. Save the query results to the specified file directory (either locally or HDFS)
Insert overwrite local directory'/ home/hadoop/test'
Select * from tweep
Insert HDFS
Insert overwrite directory'/ aaa/test'
Select * from tweep
9. View, delete
Syntax:
SELECT [ALL | DISTINCT] select_expr, select_expr,... FROM table_ referenced [where where_condition] [GROUP BY col_list [HAVING condition]] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY | ORDER BY col_list]] [LIMIT number]
Note:
CLUSTER BY field meaning: partition according to this field, you need to pay attention to setting the number of reduce_num.
Order by sorts the input globally, so there is only one reducer, which results in a longer computing time when the input is large.
Sort by is not a global sort, it sorts the data before it enters the reducer. Therefore, if you sort with sort by and set mapred.reduce.tasks > 1, sort by only guarantees that the output of each reducer is ordered, not globally.
Distribute by (field) divides the data into different reducer according to the specified field, and the distribution algorithm is a hash hash.
Cluster by (field) sorts the field in addition to the function of Distribute by.
So, if the bucket and sort fields are the same, then cluster by = distribute by + sort by
Select * from inner_tableselect count (*) from inner_table
When you delete a table, both metadata and data are deleted
Drop table inner_table10. Join in Hive
LEFT JOIN,RIGHT JOIN, FULL OUTER JOIN, inner join, left semi join
Prepare data
1,a
2,b
3,c
4,d
7,y
8,u
2,bb
3,cc
7,yy
9,pp
Build a table:
Create table a (id int,name string) row format delimited fields terminated by','; create table b (id int,name string) row format delimited fields terminated by','
Import data:
Load data local inpath'/ home/hadoop/a.txt' into table a * * load data local inpath'/ home/hadoop/b.txt' into table b * * from an inner join b on a.id=b.id
+-+
| | a.id | a.name | b.id | b.name | |
+-+
| | 2 | b | 2 | bb |
| | 3 | c | 3 | cc |
| | 7 | y | 7 | yy |
+-+
Is to ask for intersection.
2. Inner joinselect * from a left join b on a.id=b.id
+-+
| | a.id | a.name | b.id | b.name | |
+-+
| | 1 | a | NULL | NULL |
| | 2 | b | 2 | bb |
| | 3 | c | 3 | cc |
| | 4 | d | NULL | NULL | |
| | 7 | y | 7 | yy |
| | 8 | u | NULL | NULL | |
+-+
No empty connection was found on the left.
3. Right joinselect * from a right join b on a.idaudb.idterter4. Full outer joinselect * from a full outer join b on a.id=b.id
+-+
| | a.id | a.name | b.id | b.name | |
+-+
| | 1 | a | NULL | NULL |
| | 2 | b | 2 | bb |
| | 3 | c | 3 | cc |
| | 4 | d | NULL | NULL | |
| | 7 | y | 7 | yy |
| | 8 | u | NULL | NULL | |
| | NULL | NULL | 9 | pp |
+-+
Data on both sides show.
5. Left semi joinselect * from a left semi join b on a.id = b.id
+-+
| | a.id | a.name |
+-+
| | 2 | b | |
| | 3 | c |
| | 7 | y |
+-+
It is more efficient to return only to the left half, that is, the thing of a.
11. Create a temporary table
Intermediate results can be stored.
CREATE TABLE tab_ip_ctas
AS
SELECT id new_id, name new_name, ip new_ip,country new_country
FROM tab_ip_ext
SORT BY new_id
03. Custom functions and Transform of Hive
When the built-in functions provided by Hive cannot meet your business processing needs, you can consider using user-defined functions (UDF:user-defined function) at this time.
1. Custom function category
UDF acts on a single row of data, producing a row of data as output. (mathematical function, string function)
UDAF (user-defined aggregation function): receives multiple input data rows and produces one output data row. (count,max)
2.UDF development example
1. First develop a Java class that inherits UDF and overloads the evaluate method
Public class ToLowerCase extends UDF {/ / must be public public String evaluate (String field) {String result = field.toLowerCase (); return result;}}
2. Upload it to the server in a jar package
3. Add the jar package to hive's classpath
Hive > add JAR / home/hadoop/udf.jar
4. Create a temporary function to associate with the developed Java class
Hive > create temporary function tolowercase as' cn.itcast.bigdata.udf.ToProvince'
5. You can use the custom function strip in hql
Implementation of select id,tolowercase (name) from tweepter3.Transform
The TRANSFORM keyword of Hive provides the ability to call self-written scripts in SQL
It is suitable for situations where you do not want to write UDF and implement functions that are not available in Hive.
1. Load the rating.json file into rat_json, an original table of hive.
/ / {"movie": "1721", "rate": "3", "timeStamp": "965440048", "uid": "5114"} create table rat_json (line string) row format delimited;load data local inpath'/ opt/rating.json' into table rat_json
2. You need to parse the json data into four fields, and insert a new table t_rating (four fields are required to store the processed data)
Create table t_rating (movieid string,rate int,timestring string,uid string) row format delimited fields terminated by'\ t inserting overwrite table t_ratingselect get_json_object (line,'$.movie') as moive,get_json_object (line,'$.rate') as rate from rat_json;//get_json_object is a built-in jason function, which can also be implemented by a custom UDF function
3. Use transform+Python to convert unixtime to weekday
(1) Edit a python script file first
Python Code:
Vi daytime mapper.pythonimport sysimport datetimefor line in sys.stdin: line = line.strip () movieid, rating, unixtime,userid = line.split ('\ t') weekday = datetime.datetime.fromtimestamp (float (unixtime)). Isoweekday () print'\ t'.join ([movieid, rating, str (weekday), userid])
(2) add the file to the classpath of hive:
Hive > add FILE / home/hadoop/weekday_mapper.py;hive > create TABLE u_data_new asSELECT TRANSFORM (movieid, rate, timestring,uid) USING 'python weekday_mapper.py' AS (movieid, rate, weekday,uid) FROM tweeters select distinct (weekday) from u_data_new limit 10
Insert the new data processed by the query into the u_data_new file.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.