In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article will explain in detail why you chose Hive. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.
Why choose Hive?
Big data's Computing / Extensibility based on Hadoop
Support for SQL like query language
Unified metadata management
Simple programming
Installation of Hive
1.1 the role of a data warehouse in the hadoop ecosystem. He can manage data in hadoop and query data in hadoop at the same time.
In essence, hive is a SQL parsing engine. Hive can convert SQL queries to job in MapReduce to run.
Hive has a set of mapping tools that can convert SQL to job in MapReduce, tables and fields in SQL to files (folders) in HDFS, and columns in files.
This set of mapping tools is called metastore and is generally stored in derby and mysql.
The default location of hive in hdfs is / user/hive/warehouse, which is determined by the attribute hive.metastore.warehouse.dir in the configuration file hive-conf.xml.
Installation of 2.hive
(1) decompress, rename and set environment variables
(2) under the directory $HIVE_HOME/conf/, execute the command mv hive-default.xml.template hive-site.xml to rename
Under the directory $HIVE_HOME/conf/, execute the command mv hive-env.sh.template hive-env.sh rename
(3) modify the configuration file hadoop-env.sh of hadoop, as follows:
Export HADOOP_CLASSPATH=.:$CLASSPATH:$HADOOP_CLASSPATH:$HADOOP_HOME/bin
(4) under the directory $HIVE_HOME/bin, modify the file hive-config.sh by adding the following:
Export JAVA_HOME=/usr/local/jdk
Export HIVE_HOME=/usr/local/hive
Export HADOOP_HOME=/usr/local/hadoop
3. Install mysql
(1) remove the information of mysql-related libraries that have been installed on linux. Rpm-e xxxxxxx-- nodeps
Execute the command rpm-qa | grep mysql to check whether the deletion is clean.
(2) execute the command rpm-I mysql-server-* to install the mysql server
(3) start the mysql server and execute the command mysqld_safe &
(4) execute the command rpm-I mysql-client-* to install the mysql client
(5) execute the command mysql_secure_installation to set the root user password
4. Using mysql as the metastore of hive
(1) place the jdbc driver of mysql in the lib directory of hive
(2) modify the hive-site.xml file as follows:
Javax.jdo.option.ConnectionURL jdbc:mysql://hadoop0:3306/hive?createDatabaseIfNotExist=true javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName root javax.jdo.option.ConnectionPassword admin
User interface, including CLI,JDBC/ODBC,WebUI
Metadata storage, usually stored in relational databases such as mysql, derby
Interpreter, compiler, optimizer, executor
Hadoop: use HDFS for storage and MapReduce for calculation
There are three main user interfaces: CLI,JDBC/ODBC and WebUI
.CLI, the Shell command line
.JDBC / ODBC is the Java of Hive, similar to the way you use traditional database JDBC
Web GUI accesses Hive through a browser
Hive stores metadata in a database (metastore). Currently, only mysql and derby are supported. The metadata in Hive includes the name of the table, the columns and partitions of the table and its attributes, the attributes of the table (whether it is an external table, etc.), the directory where the data of the table is located, and so on.
The interpreter, compiler and optimizer complete the generation of HQL query statements from lexical analysis, syntax analysis, compilation, optimization and query plan (plan). The generated query plan is stored in HDFS and subsequently executed by a MapReduce call
Hive's data is stored in HDFS, and most queries are done by MapReduce (queries that include *, such as select * from table do not generate MapRedcue tasks)
Metastore of Hive
Metastore is a centralized repository of hive metadata. Metastore uses the embedded derby database as the storage engine by default
Disadvantage of Derby engine: only one session can be opened at a time
Use Mysql as the external storage engine to access multiple users at the same time
Shell of Hive
1. Hive command line mode, enter the executor of # / hive/bin/hive directly, or enter # hive-service cli
2. Startup mode of hive web interface (port number 9999)
# hive-service hwi&
Used to access hive through a browser
Http://hadoop0:9999/hwi/
3. Startup method of hive remote service (port number 10000)
# hive-service hiveserver&
Hive and traditional Database
Query language
HiveQL
SQL
Data storage location
HDFS
Raw Device or local FS
Data format
User defined
System decision
Data update
Not supported
Support
Indexes
The new version is available, but weak
Yes
Execution
MapReduce
Executor
Execution delay
High
Low
Expandability
High
Low
Data scale
Big
Small
Data type of Hive
Basic data type
Tinyint / smalint / int / bigint
Float / double
Boolean
String
Complex data type
Array/Map/Struct
No date / datetime
Data storage of Hive
The data storage of Hive is based on Hadoop HDFS
Hive does not have a special data storage format
The storage structure mainly includes: database, file, table and view.
Hive can load text files (TextFile) directly by default, and also supports sequence file
When creating a table, specify the column and row delimiters of the Hive data, and Hive can parse the data
Data Model of Hive-Internal Table
Similar in concept to Table in the database
Each Table has a corresponding directory to store data in the Hive. For example, a table test whose path in HDFS is: / warehouse/test. Warehouse is defined in hive-site.xml by ${hive.metastore.warehouse.dir}
The directory of the specified data warehouse
All Table data (excluding External Table) is stored in this directory.
When you delete a table, both metadata and data are deleted
Data Model of Hive-Partition Table
Partition corresponds to the dense index of the Partition column of the database
In Hive, a Partition in a table corresponds to a directory under the table, and all Partition data is stored in the corresponding directory
For example, the test table contains two Partition, date and city
corresponds to the HDFS subdirectory of date=20130201, city = bj:
/ warehouse/test/date=20130201/city=bj
corresponds to date=20130202. The HDFS subdirectory of city=sh is;
/ warehouse/test/date=20130202/city=sh
CREATE TABLE tmp_table # Table name
(
Title string, # Field name Field Type
Minimum_bid double
Quantity bigint
Have_invoice bigint
) COMMENT' comments: XXX' # Table comments
PARTITIONED BY (ptSTRING) # partition table field (if your file is very large, you can use the partition table to quickly filter out the data divided by partition field)
ROW FORMAT DELIMITED
What is the division of the FIELDSTERMINATED BY'\ 001' # field
STOREDAS SEQUENCEFILE; # how to store data? SEQUENCEFILE is the file compression format included with hadoop.
Some related commands
SHOW TABLES; # View all tables
SHOW TABLES'* TMP*'; # supports fuzzy query
SHOWPARTITIONS TMP_TABLE; # check what partitions the table has
DESCRIBE TMP_TABLE; # View table structure
Shell of the partition table
Create a data file partition_table.dat
Create a tabl
Create table partition_table (rectime string,msisdnstring) partitioned by (daytime string,citystring) row format delimited fields terminated by'\ t 'stored as TEXTFILE
Load data into a partition
Load data local inpath'/home/partition_table.dat' into tablepartition _ tablepartition (daytime='2013-02-01)
View data
Select * from partition_table
Select count (*) from partition_table
Delete table drop table partition_table
The data model of Hive-bucket table
Bucket tables hash the data and store them in different files.
Create table create table bucket_table (id string) clustered by (id) into 4 buckets; load data set hive.enforce.bucketing = true
Insert into table bucket_table select name from stu
Insert overwrite table bucket_table select name from stu
When the data is loaded into the bucket table, the hash value of the field is taken and modeled with the number of buckets. Put the data in the corresponding file.
Sampling query
Select * from bucket_table tablesample (bucket 1 out of 4 on id)
Data Model of Hive-external Table
To point to data that already exists in HDFS, you can create a Partition
It is the same as the internal table in the organization of metadata, but the storage of actual data is quite different.
The internal table creation process and the data loading process (both of which can be done in the same statement), in the process of loading data, the actual data is moved to the data
In the warehouse directory; then access to the data pair will be done directly in the data warehouse directory. When you delete a table, the data and metadata in the table will be deleted at the same time
There is only one process for an external table, which loads the data and creates the table at the same time, and does not move to the data warehouse directory, but simply establishes a link with the external data. When deleting an external table, only the link is deleted
CREATEEXTERNAL TABLE page_view (viewTimeINT, useridBIGINT, page_urlSTRING, referrer_urlSTRING, ipSTRING COMMENT'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT' This is the staging page view table' ROW FORMAT DELIMITED FIELDSTERMINATED BY '44' LINES TERMINATED BY' 12 'STORED ASTEXTFILE LOCATION' hdfs://centos:9000/user/data/staging/page_view'
Shell of the external table
Create a data file external_table.dat
Create a tabl
Hive > create external table external_table1 (key string) ROW FORM AT DELIMITED FIELDS TERMINATED BY'\ t 'location' / home/external'
Create a directory / home/external in HDFS
# hadoop fs-put / home/external_table.dat / home/external
Load data
LOAD DATA INPATH'/ home/external_table1.dat' INTO TABLE external_table1
View data
Select * from external_table
Select count (*) from external_table
Delete tabl
Drop table external_table
Import data
When the data is loaded into the table, no conversion is made to the data. The Load operation simply copies / moves the data to the location corresponding to the Hive table.
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE]
INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2...)]
Import one Hive table into another established Hive table
INSERT OVERWRITE TABLE tablename [PARTITION (partcol1=val1, partcol2=val2...)] Select_statementFROM from_statement
CTAS
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
(col_namedata_type,...)
AS SELECT...
Example: create table new_external_testas select * from external_table1
Query
Query based on Partition
A general SELECT query is a full table scan. But if it is a partitioned table, the query can take advantage of partition pruning (input pruning), similar to "partitioned index", to scan only the portion of a table that it cares about. Hive's current implementation is that partition pruning is enabled only if the partition assertion (Partitioned by) appears in the WHERE clause closest to the FROM clause. For example, if the page_ views table (partitioned by day) uses a date column partition, the following statement reads only data with partition '2008-03-01'.
SELECT page_views.* FROM page_views WHERE page_views.date > = '2013-03-01' AND page_views.date create table acinfo (name string,acipstring) row format delimited fields terminated by'\ t 'stored as TEXTFILE
Hive > load data local inpath'/home/acinfo/ac.dat' into table acinfo
Internal connection
Select b.name. * from dim_aca join acinfo b on (a.ac=b.acip) limit 10
Left outer connection
Select b. Name. * from dim_ac a left outer join acinfo b on a.ac=b.acip limit 10
Java client
Hive remote service startup # hive-- service hiveserver > / dev/null 2 > / dev/null &
JAVA client related code
Class.forName ("org.apache.hadoop.hive.jdbc.HiveDriver"); Connection con = DriverManager.getConnection ("jdbc:hive://192.168.1.102:10000/wlan_dw", "", ""); Statement stmt = con.createStatement (); String querySQL= "SELECT * FROM wlan_dw.dim_m order by flux desc limit 10"; ResultSet res = stmt.executeQuery (querySQL) While (res.next ()) {System.out.println (res.getString (1) + "\ t" + res.getLong (2) + "\ t" + res.getLong (3) + "\ t" + res.getLong (4) + "\ t" + res.getLong (5));}
UDF
1. The UDF function can be directly applied to the select statement. After formatting the query structure, the content is output.
2. When writing UDF functions, you need to pay attention to the following points:
A) Custom UDF needs to inherit org.apache.hadoop.hive.ql.UDF.
B) the evaluate function needs to be implemented, and the evaluate function supports overloading.
4. Steps
A) package the program on the target machine
B) enter the hive client and add the jar package: hive > add jar/ run/jar/udf_test.jar
C) create a temporary function: hive > CREATE TEMPORARY FUNCTION add_example AS 'hive.udf.Add'
D) query HQL statement:
SELECT add_example (8,9) FROM scores
SELECT add_example (scores.math, scores.art) FROM scores
SELECT add_example (6,7,8,6.8) FROM scores
E) destroy temporary function: hive > DROP TEMPORARY FUNCTION add_example
Note: UDF can only implement one-in-one-out operation. If you need to implement multiple input and output, you need to implement UDAF.
This is the end of the article on "Why choose Hive". I hope the above content can be helpful to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.