Sample scores of HIVE from beginner to proficiency 07/19 Update SLTechnology News&Howtos

Sample scores of HIVE from beginner to proficiency

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you HIVE from entry to proficiency of the example points, I hope you will learn something after reading this article, let's discuss it together!

1 background

The number of data sets required for industrial business intelligence collection and analysis is growing, making traditional data warehouse solutions too expensive. Hadoop is a popular open source map-reduce implementation for companies like yahoo and Facebook. To store and process large-scale data sets on commercial hardware. However, the map-reduce program model is still at a very low level, that is, developers are required to write client programs, which are often difficult to maintain and reuse.

Hbase is used as the database, but because hbase does not have a sql-like query mode, it is very inconvenient to operate and calculate the data, so we integrate hive and let hive support the hql query at the hbase database level. Hive is also called data warehouse.

2 definition

Hive is a data warehouse tool based on Hadoop (HDFS, MapReduce), which can map structured data files to a database table and provide SQL-like query function.

The essence is to convert SQL into MapReduce program.

3 architecture

Hive itself is based on the architecture of Hadoop, which can map the structured data file to a database table, provide complete sql query function, and transform sql statements into MapReduce tasks. According to this plan, the MapReduce task is generated and handed over to the Hadoop cluster for processing. The architecture of Hive is shown in figure 1-1:

Figure 1-1 Architecture of Hive

Data storage of 4Hive

The storage of Hive is based on the Hadoop file system. Hive itself does not have a special data storage format, nor can it index the data. Users are free to organize tables in Hive, and they only need to tell Hive the column delimiters and row delimiters in the data when creating the table to parse the data.

Hive mainly contains four types of data models: table (Table), external table (External Table), partition (Partition) and bucket (Bucket).

Tables in Hive are conceptually similar to tables in a database, with each table having a corresponding storage directory in Hive. For example, a table pokes has the path / warehouse/pokes in HDFS, where / warehouse is the directory of the data warehouse specified by ${hive.metastore.warehouse.dir} in the hive-site.xml configuration file.

Each partition in Hive corresponds to an index of the corresponding partition column in the database, but the partition is organized differently from traditional relational databases. In Hive, a partition in a table corresponds to a directory under the table, and the data for all partitions is stored in the corresponding directory. For example, the htable table in figure 1-2 contains three partitions year, month and day, corresponding to three directories: the HDFS subdirectory for year=2012,month=01,day=01 is: / warehouse/htable/year=2012/ month=01/ day=01; for year=2012,month=02,day=14 HDFS subdirectory is: / warehouse/htable/year=2012/ month=02/ day=14

When the bucket hashes the specified column, the data is split according to the hash value, and each bucket corresponds to a file. For example, if you spread the attribute column Uniqueid column in the htable table in figure 1-2 into 32 buckets, the first step is to hash the Uniqueid. The bucket with a hash value of 0 is written to the HDFS directory: / warehouse/htable/year=2012/ month=01/ day=01/part-0;, the bucket with hash value 1 is written to the HDFS directory: / warehouse/htable/year=2012/ month=01/ day=01/part-1. Figure 1-2 Hive data Store

I. brief introduction of hive function

Function introduction PARTITIONED BY keyword is table partition 4. PATITION is divided into BUCKET 5. 5 by CLUSTERED BY keyword. Define the storage format for each record, including: how fields are separated, how elements in collection fields are separated, and how key values of Map are separated 6. 5. Specify SequenceFile with Hadoop storage format

(2) View the table structure DESCRIBE tablename; (3) modify the table and add fields ALTER TABLE pokes ADD COLUMNS (new_col INT) to the table

(4) Delete form DROP TABLE tablename

DML (1), import data import operation, just copy the file to the corresponding table directory, and do not verify the schema of the document. Import from HDFS to LOAD DATA INPATH 'data.txt' INTO TABLE page_view PARTITION (date='2008-06-08, country='US'), import from local, and overwrite the original data LOAD DATA LOCAL INPATH' data.txt' OVERWRITE INTO TABLE page_view PARTITION (date='2008-06-08, country='US')

Hive architecture hiveserver hiveserver startup method: hive-- service hiveserver HiveServer supports multiple connection methods: Thrift, JDBC, ODBC

Metastore metastore is used to store hive metadata information (tables, database definitions, etc.). By default, it is bound with hive and deployed in the same JVM. The bad thing about storing metadata in Derby is that there is no way to open multiple instances of a Hive (Derby cannot be shared among multiple service instances)

Hive provides enhanced configuration to replace databases with relational databases such as MySql, and store data independently and shared among multiple service instances

You can even separate metastore Service and deploy it to other JVM to access it through remote calls.

Common configuration of metastore: hive.metastore.warehouse.dir stores table data in the directory hive.metastore.local uses the embedded metastore service (default is true) hive.metastore.uris if you do not use the embedded metastore service, you need to specify the url javax.jdo.option.ConnectionDriverName database driver class javax.jdo.option.ConnectionUserName connection username javax.jdo.option.ConnectionPassword connection password of the database used by the uri javax.jdo.option.ConnectionURL of the remote service

Hive data storage format

If you do not specify Row Format and Stored As clauses when defining a table, hive uses the following default configuration: CREATE TABLE. ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ 001' COLLECTION ITEMS TERMINATED BY'\ 002' MAP KEYS TERMINATED BY'\ 003' LINES TERMINATED BY'\ n' STORED AS TEXTFILE; defaults to plain text file TEXTFILE

If the stored data is not plain text, but contains binary data, SequenceFile and RCFile RCFile are available: column-based storage, similar to HBase, when querying Table, if the data to be retrieved is not the whole record, but the specific column,RCFile is more efficient than SequenceFile, you can use RCFile simply by traversing the data file corresponding to the specified column, and use the following syntax when creating Table: CREATE TABLE. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS RCFILE

In addition, Hive can also specify the format of the input data source through regular expressions: CREATE TABLE stations (usaf STRING, wban STRING, name STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "(\ d {6}) (\ d {5}) (. {29}). *")

After reading this article, I believe you have a certain understanding of "HIVE sample score from entry to proficiency". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.