A brief introduction to Hive views and indexes 04/07 Update SLTechnology News&Howtos

A brief introduction to Hive views and indexes

2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

The main content of this article is "A brief introduction to Hive View and Index". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "A brief introduction to Hive View and Index".

I. introduction to Hive View 1.1

The concept of the view in Hive is the same as that in RDBMS, which is the logical representation of a set of data and is essentially the result set of a SELECT statement. Views are purely logical objects with no associated storage (except for materialized views introduced by Hive 3.0.0). When a query references a view, Hive can combine the definition of the view with the query, such as pushing the filter in the query to the view.

1.2 Features

Materialized views are not supported

Only query, not load data operation

When a view is created, the query view executes the corresponding subquery only by saving a copy of metadata.

If the ORDER BY/LIMIT statement is included in the view definition, the ORDER BY/LIMIT statement operation is also performed when querying the view, and the priority defined in view is higher.

Hive view supports iterative view

1.3 create view CREATE VIEW [IF NOT EXISTS] [db_name.] view_name-- View name [(column_name [COMMENT column_comment],...)]-- column name [COMMENT view_comment]-- View comment [TBLPROPERTIES (property_name = property_value,...)]-- additional information AS SELECT...

Considerations for creating views

CREATE VIEW creates a view with the given name. If a table or view with the same name already exists, an error is raised. You can use IF NOT EXISTS to skip this error.

Deleting the base table does not delete the view, you need to delete the view manually

View is read-only and cannot be used as a target for LOAD / INSERT / ALTER

When you create a view, if no column name is provided, the column name is automatically derived from the SELECT statement

A view may contain ORDER BY and LIMIT clauses. If the reference query also contains these terms, the query-level clause evaluates the post-view terms (and any other actions subsequently in the query). For example, if the view specifies LIMIT 5 and the reference query executes as (select * from v LIMIT 10), a maximum of five rows will be returned.

Prepare data

-- create test table create table default.user (id string,-- key sex string,-- gender name string-- name);-- Import data insert into default.user (id, sex, name) values ("1", "male", "Zhang San"), ("2", "female", "floret"), ("3", "male", "Zhao Liu"), ("4", "male", "Li Hei Hei")

Create a test view

Hive (default) > create view if not exists default.user_view as select * from default.user;OKid sex nameTime taken: 0.181 seconds1.4 query View-- query View content? select * from default.user_view;-- query View structure desc default.user_view;-- query View details desc formatted default.user_view;-- query View is not specified in the same way as querying all tables show tables Delete View-template DROP VIEW [IF EXISTS] [db_name.] view_name;-- Delete View DROP VIEW IF EXISTS user_view;1.6 modify View Properties

Syntax:

ALTER VIEW [db_name.] view_name SET TBLPROPERTIES table_properties; table_properties:: (property_name = property_value, property_name = property_value,...)

Example:

Alter view default.user_view set tblproperties ('name'='DSJLG','GZH'='DSJLG')

Through desc formatted default.user_view; details

II. Introduction to Index 2.1

Hive introduced the function of index in 0.7.0. the design goal of the index is to improve the query speed of some columns of the table. If there is no index, a query with a predicate (such as' WHERE table1.column = 10') loads the entire table or partition and processes all rows. But if an index exists in column, only part of the file needs to be loaded and processed.

2.2 create index template CREATE INDEX index_name-index name ON TABLE base_table_name (col_name,...)-indexed column AS index_type-index type [WITH DEFERRED REBUILD]-rebuild index [IDXPROPERTIES (property_name=property_value,...)]-index extra properties [IN TABLE index_table_name]-name of index table [[ROW FORMAT...]] STORED AS. | STORED BY...]-- Index table row delimiter, storage format [LOCATION hdfs_path]-Index table storage location [TBLPROPERTIES (...)]-Index table properties [COMMENT "index comment"];-Index Note 2.3 create an index

We use the user table created above to create the name user_index for the id field, and the index is stored in the user_index_table index table.

Create index user_index on table user (id) as' org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'with deferred rebuild in table user_index_table

At this point, there is no data in the index table, and the index needs to be rebuilt to have indexed data.

Re-index hive (default) > ALTER index user_index on user rebuild; Query ID = root_20201015081313_879ce697-a6a4-4c38-a1a9-0e72a52feb6bTotal jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1602711568359_0002 Tracking URL = http://node01:8088/proxy/application_1602711568359_0002/Kill Command = / export/servers/hadoop-2.6.0-cdh6.14.0/bin/hadoop job-kill job_1602711568359_0002Hadoop job information for Stage-1: number of mappers: 1 Number of reducers: 12020-10-1508 13 Stage-1 map 47425 Stage-1 map = 0%, reduce = 0% 20-10-15 08 13 seconds 48546 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.66 sec2020-10-1508 13 seconds 1349 576 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.5 secMapReduce Total cumulative CPU time: 2 seconds 500 msecEnded Job = job_1602711568359_0002Loading data to table default.user_index_tableTable default.user_index_table stats: [numFiles=1, numRows=4, totalSize=231 RawDataSize=227] MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.5 sec HDFS Read: 12945 HDFS Write: 581944 SUCCESSTotal MapReduce CPU Time Spent: 2 seconds 500 msecOKTime taken: 12.85 seconds

Hive will start the MapReduce job to build the index, and then check the index table data as follows. The three table fields represent the value of the index column, the HDFS file path corresponding to the value, and the offset of the value in the file.

Hive (default) > select * from user_index_table OKuser_index_table.id user_index_table._bucketname user_index_table._offsets1 hdfs://node01:8020/user/hive/warehouse/user/000000_0 [0] 2 hdfs://node01:8020/user/hive/warehouse/user/000000_0 [13] 3 hdfs://node01:8020/user/hive/warehouse/user/000000_0 [26] 4 hdfs://node01:8020/user/hive/warehouse / user/000000_0 [39] Time taken: 0.047 seconds Fetched: 4 row (s) 2.5 automatically uses the index

By default, although the index is established, Hive does not automatically use the index when querying, and the relevant configuration needs to be enabled. When the configuration is turned on, queries involving index columns use the index function to optimize the query.

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;SET hive.optimize.index.filter=true;SET hive.optimize.index.filter.compact.minsize=0;2.6 View Index

Show index on user

2.7 Delete index

Deleting an index deletes the corresponding index table.

DROP INDEX [IF EXISTS] index_name ON table_name

If a table with an index is deleted, its corresponding index and index table will be deleted. If a partition of the indexed table is deleted, the partition index corresponding to the partition will also be deleted.

2.8 the principle of index

Building an index on a specified column produces an index table (a physical table of Hive) with fields including the value of the index column, the HDFS file path corresponding to the value, and the offset of the value in the file.

When performing the index field query, an extra MapReduce job is generated at first. According to the filtering conditions of the index column, the hdfs file path and offset corresponding to the value of the index column are filtered from the index table and output to a file on the hdfs. Then, according to the hdfs path and offset in these files, the original input file is filtered to generate a new split as the split of the whole job, so that there is no need to scan the whole table.

At this point, I believe you have a deeper understanding of the "brief introduction to Hive views and indexes". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.