How to use Hive 04/10 Update SLTechnology News&Howtos

How to use Hive

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces how to use Hive, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

# introduction #

Apache top-level project Hive data warehouse software makes it easy to query and manage large datasets that reside on distributed storage. He provided:

A tool for simple data extraction / transformation / loading (ETL)

A mechanism for structuring a variety of data formats

Access to files stored directly on Apache HDFS or files on other data storage systems, such as Apache HBase

Execute the query through MapReduce

Hive defines a simple SQL-like language called QL. It enables users who are familiar with SQL statements to query data. At the same time, the language also enables programmers familiar with the MapReduce framework to perform more complex analysis that is not supported by the built-in functions of the language by inserting their custom map and reduce. QL can also use custom scalar functions (UDF's), aggregate functions (UDAF's), and table functions (UDTF's) extensions.

Hive does not force the use of "Hive format" for data reading and writing-- there is no such thing. Hive also works well on Thrift, controlling separation, or your particular data format. Please see File Formats and Hive SerDe in the Developer Guide for details.

Hive is not designed for OLTP, nor does it provide real-time queries and row-level updates. It is best at batch job for large append-only datasets (like web logs). What Hive values most is its scalability (using more and its dynamic addition to the Hadoop cluster for extension), extensibility (using the MapReduce framework and UDF/UDAF/UDTF), fault tolerance, and low-coupling lines for his input format.

The composition of Hive includes HCatalog and WebHCat,HCatalog is a component of Hive. It is a table and storage management layer of Hadoop, so that users can use different data processing tools, including Pig and MapReduce, to more easily read and write data on Grid.

WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Pig,Hive job, or use the HTTP (RESTFul style) interface to perform Hive metadata operations.

# install and configure # you can install a stable version of Hive by downloading the zip file, or you can download the source code of Hive and build Hive from it.

# # necessary conditions # #

Java 1.6

Hadoop 0.20.x, 0.23.x, or 2.0.x-alpha

# # installing Hive## from a stable version first downloads the latest stable version of Hive from the download image of Apache.

Next two you need to extract the package, Zhejiang to create a subdirectory named hive-x.y.z (x.y.z is the stable version).

Tar-xzvf hive-x.y.z.tar.gz

Set the environment variable HIVE_HOME to point to this installation directory.

Cd hive-x.y.z export HIVE_HOME= {{pwd}}

Finally, add $HIVE_HOME/bin to your PATH

Export $PATH:$HIVE_HOME/bin

# # Hadoop is used to run Hive## Hive, so:

You need to add Hadoop to your PATH, or

Export HADOOP_HOME=

In addition, before you can create a table in zaiHive, you must create / tmp and / user/hive/warehouse in HDFS and set them chmod Grouw.

The command to make this setting:

$HADOOP_HOME/bin/hadoop fs-mkdir / tmp $$HADOOP_HOME/bin/hadoop fs-mkdir / user/hive/warehouse $$HADOOP_HOME/bin/hadoop fs-chmod Grouw / tmp $$HADOOP_HOME/bin/hadoop fs-chmod Grouw / user/hive/warehouse

You will find it useful to set HIVE_HOME, although it is not necessary.

$export HIVE_HOME=

Use the command line interface (CLI) of Hive under shell:

$$HIVE_HOME/bin/hive

# DDL operations # DDL operations of Hive are recorded in Hive Data Definition Language.

# create Hive table #

CRATE TABLE pokes (foo INT,bar STRING)

Create a table called pokes with two columns, the first column being an integer and the second column being a String.

CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING)

Create a table, invites, with two columns and a partitioned column called ds. The partition column is a virtual column. It is not part of the data itself, but comes from a special dataset that is loaded into the partition.

By default, the table sets the input format to text and the delimiter to ^ A (ctrl-a).

# browse the table #

SHOW TABLES

List all tables.

SHOW TABLES'. * S'

Lists all tables that end in s. The matching pattern here follows the JAVA regular expression. Check out the linked document: http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html.

DESCRIBE invites

Displays a list of the columns of the table.

# modify and delete tables # Table names can be changed and columns can be added or replaced.

Hive > ALTER TABLE events RENAME TO 3koobecaf; hive > ALTER TABLE pokes ADD COLUMNS (new_col INT); hive > ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT'a comment'); hive > ALTER TABLE invites REPLACE COLUMNS (foo INT, bar STRING, baz INT COMMENT 'baz replaces new_col2')

Note that REPLACE COLUMNS replaces all existing columns and only changes the table structure, not the table data. The table must use a local SerDe. REPLACE COLUMNS can also be used to remove columns from the table structure.

Hive > ALTER TABLE invites REPLACE COLUMNS (foo INT COMMENT 'only keep the first column')

Delete tabl

Hive > DROP TABLE pokes

# metadata storage # metadata is stored in a migrated Derby database, and its disk location depends on the variable in the Hive configuration file: javax.jdo.option.ConnectionURL. By default, this location is. / metastore_db (see conf/hive-default.xml)

Now, in the default configuration, metadata can only be accessed by a user at a certain time.

Metadata can be stored in any database that supports JPOX.

# DML operation # Hive DML operation document in Hive Data Manipulation Language.

Load data from a flat file to Hive:

Hive > LOAD DATA LOCAL INPATH'. / examples/files/kv1.txt' OVERWRITE INTO TABLE pokes

Load a file with two columns split with ctrl-an into the table pokes.' LOCAL' indicates that the input file here is on the local file system, if "LOCAL' will look for the file on HDFS by default."

The keyword 'OVERWRITE' indicates that data that already exists in this table will be deleted. If "OVERWRITE" defaults, the data file will be appended to the existing dataset.

Note:

NO verification of data against the schema is performed by the load command.

If the file is in hdfs, it is moved into the Hive-controlled file system namespace. The root of the Hive directory is specified by the option hive.metastore.warehouse.dir in hive-default.xml. We advise users to create this directory before trying to create tables via Hive.

Hive > LOAD DATA LOCAL INPATH'. / examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); hive > LOAD DATA LOCAL INPATH'. / examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08')

The above two load statements load data into two different partitions of the table invites. The table invites must be created with a key ds partition for this operation to be successful.

Hive > LOAD DATA INPATH'/ user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15')

The above command loads the data on HDFS file/directory into the table.

Note that loading data from HDFS will cause the file/directory to be moved. As a result, this operation is almost instantaneous.

# SQL operation # Hive query operation document is recorded in Select.

# # query example # # below is a display of some queries, which can be found in build/dist/examples/queries.

More examples can be found in the ql/src/test/queries/positive of Hive source code.

# # SELECT and FILTERS##

Hive > SELECT a.foo FROM invites a WHERE a.ds='2008-08-15'

Query all rows of the partition ds=2008-08-15 column named foo in the table invites. The query results are not stored anywhere, but are displayed on the console.

Note that in all the examples below, INSERT (to a Hive table, local directory, HDFS directory) is optional.

Hive > INSERT OVERWRITE DIRECTORY'/ tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15'

Query all rows of partition ds=2008-08-15 in the invites table are stored in a HDFS directory. The query results are stored in files in this directory (depending on the number of mapper).

Note: partition columns if any are selected by the use of *. They can also be specified in the projection clauses. Partitioned tables must always have a partition selected in the WHERE clause of the statement.

Hive > INSERT OVERWRITE LOCAL DIRECTORY'/ tmp/local_out' SELECT a.* FROM pokes a

All rows in the query table pokes are stored in a local directory.

Hive > INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; hive > INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key

< 100; hive>

INSERT OVERWRITE LOCAL DIRECTORY'/ tmp/reg_3' SELECT a.* FROM events a; hive > INSERT OVERWRITE DIRECTORY'/ tmp/reg_4' select a.invites, a.pokes FROM profiles a; hive > INSERT OVERWRITE DIRECTORY'/ tmp/reg_5' SELECT COUNT (*) FROM invites a WHERE a.ds='2008-08-15mm; hive > INSERT OVERWRITE DIRECTORY'/ tmp/reg_5' SELECT a.foo, a.bar FROM invites a; hive > INSERT OVERWRITE LOCAL DIRECTORY'/ tmp/sum' SELECT SUM (a.pc) FROM pc1 a

Query the sum of the number of columns, average, minimum, maximum values can also be used. Note that the old version of Hive is not included in HIVE-287. You need to replace COUNT (*) with COUNT (1).

# # GROUP BY##

Hive > FROM invites an INSERT OVERWRITE TABLE events SELECT a.bar, count (*) WHERE a.foo > 0 GROUP BY a.bar; ive > INSERT OVERWRITE TABLE events SELECT a.bar, count (*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar

Note that the old version of Hive is not included in HIVE-287. You need to replace COUNT (*) with COUNT (1).

# # JOIN##

Hive > FROM pokes T1 JOIN invites T2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo

# # MULTITABLE INSERT##

FROM src INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key

< 100 INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >

= 100 and src.key

< 200 INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >

= 200 and src.key

< 300 INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >

= 300

# # STREAMING##

Hive > FROM invites an INSERT OVERWRITE TABLE events SELECT TRANSFORM (a.foo, a.bar) AS (oof, rab) USING'/ bin/cat' WHERE a.ds > '2008-08-09'

This loses data during the map phase through a script / bin/cat (similar to Hadoop streaming).

Similar-- streams can be used in reduce (see the Hive Tutorial example).

# simple use case #

# # MovieLens user score # # first, create a table in the format of a text file split by tabs.

CREATE TABLE u_data (userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITEDFIELDS TERMINATED BY'\ t'STORED AS TEXTFILE

Then, download the data file from MovieLens 100k on the GroupLens datasets page (which also has an index of README.txt files and unextracted files).

Wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

Or:

Curl-- remote-name http://files.grouplens.org/datasets/movielens/ml-100k.zip

Note: if this link GroupLens datasets is not available, please report the problem in [HIVE-5341] or send a message to the user@hive.apache.org mail group.

Extract the data file:

Unzip ml-100k.zip

Load the data u.data into the table you just created:

LOAD DATA LOCAL INPATH'/ u.data'OVERWRITE INTO TABLE u_data

Calculate the number of rows in table u_data:

SELECT COUNT (*) FROM u_data

Note that the old version of Hive does not include HIVE-287. You need to replace COUNT (*) with COUNT (1).

Now we can do some complex analysis on the table u_data:

Create weekday_mapper.py

Import sysimport datetimefor line in sys.stdin: line = line.strip () userid, movieid, rating, unixtime = line.split ('\ t') weekday = datetime.datetime.fromtimestamp (floa (unixtime)). Isoweekday () print'\ t'.join ([userid, movieid, rating, str (weekday)])

Use this mapper script:

CREATE TABLE u_data_new (userid INT, movieid INT, rating INT, weekday INT) ROW FORMAT DELIMITEDFIELDS TERMINATED BY'\ tDead FILE weekday_mapper.py;INSERT OVERWRITE TABLE u_data_newSELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM utilitarian select weekday, COUNT (*) FROM u_data_newGROUP BY weekday

Note that if you use Hive0.5.0 or previous versions, you need to replace COUNT (*) with COUNT (1).

# # Apache Weblog data # # the format of Apache weblog can be customized. But most web administrators use the default.

For the default Apache weblog, we can create a table with the following command.

More information about RegexSerDe can be found at HIVE-662 and HIVE-1719.

CREATE TABLE apachelog (host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING Agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES ("input.regex" = "([^] *) (- |\ [[^\] *\]) ([^\"] * |\ "[^\"] *\ ") (- | [0-9] *) (- | [0-9] *) (?: ([^\"] *) | |\ ".*\") ([^\ "] * |\" .*\ ")?) STORED AS TEXTFILE Thank you for reading this article carefully. I hope the article "how to use Hive" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.