In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly introduces the "basic concept of Hive". In daily operation, I believe many people have doubts about the basic concept of Hive. The editor has consulted all kinds of materials and sorted out simple and easy operation methods. I hope it will be helpful for everyone to answer the doubts of "basic concept of Hive"! Next, please follow the small series to learn together!
About Hive
What is Hive?
Hive is a Hadoop-based data warehouse tool that maps structured data files to a database table and provides SQL-like query functionality (HQL).
Its essence is to convert SQL into MapReduce tasks for operation. HDFS provides data storage at the bottom. Hive can be understood as a tool for converting SQL into MapReduce tasks.
Hive can store and compute data
Data storage relies on HDFS
Data computation relies on MapReduce
Why Hive?
Problems with using hadoop directly
Staff learning costs are too high
Project cycle requirements are too short
MapReduce is too difficult to implement complex query logic development
Why use Hive?
The operation interface adopts SQL-like syntax and provides rapid development capability.
Avoid writing MapReduce and reduce the learning costs of developers.
Function expansion is easy.
Hive's characteristics
scalable
Hive is free to scale clusters and generally does not require a restart of services.
ductility
Hive supports user-defined functions, and users can implement their own functions according to their own needs.
fault-tolerant
Good fault tolerance, SQL can still complete execution if there is a problem with the node.
Hive Architecture
Metadata:
The data that describes the data is metadata.
Name of table,
table columns
the type of the column
Hive internal execution process:
Interpreters-> Compilers (metadata used)-> Optimizers-> Executors
basic composition
User interfaces: CLI, JDBC/ODBC, WebGUI. Among them, CLI(command line interface) is the shell command line;JDBC/ODBC is the JAVA implementation of Hive, similar to traditional database JDBC;WebGUI is to access Hive through a browser.
Metadata storage: Usually stored in a relational database such as mysql/derby. Hive stores metadata in a database. Metadata in Hive includes table names, table columns and partitions and their attributes, table attributes (whether it is an external table, etc.), and the directory where the table data resides.
** Interpreter, compiler, optimizer, executor:** Complete HQL query statements from lexical analysis, syntax analysis, compilation, optimization, and query plan generation. The generated query plan is stored in HDFS and subsequently executed by MapReduce calls.
Hive and Hadoop
Hive uses HDFS to store data and MapReduce queries to analyze data
Hive vs. traditional databases
hive Offline data analysis for massive data
1. Data formats. Hive does not define a specific data format
The data format can be specified by the user, and the user-defined data format requires three attributes to be specified:
Column separators (usually spaces,"\t,""\x001"),
Line separator ("\n") and
A method of reading file data (Hive defaults to three file formats TextFile, SequenceFile, and RCFile).
Hive does not require conversion from user data formats to Hive-defined data formats during data loading.
3. Hive does not modify or even scan the data itself during loading.== It simply copies or moves the data content to the appropriate HDFS directory.
4. Hive does not support rewriting and adding data, all data is determined at the time of loading.
5. Hive does not index certain keys in the data during loading. Hive needs to brute force scan the entire data to access specific values in the data that satisfy the condition, so access latency is high. Hive is not suitable for online data query because of the high latency of data access.
6. Hive is built on top of Hadoop, so Hive scalability is consistent with Hadoop scalability.
Summary: hive has the appearance of sql database, but the application scenario is completely different. hive is only suitable for statistical analysis of batch data.
Hive's Data Storage
1. All data in Hive is stored in HDFS, and there is no special data storage format (Text, SequenceFile, ParquetFile, ORC format RCFILE, etc.)
Hive can parse the data simply by telling Hive the column and row delimiters in the data when creating the table.
3. Hive contains the following data models: DB, Table, External Table, Partition, Bucket.
db: appears in hdfs as a folder under the ${hive.metastore.warehouse.dir} directory
table: represents a folder under db directory in hdfs
external table: similar to table, but its data storage location can be any specified path
partition: appears as a subdirectory under table directory in hdfs
bucket: appears in hdfs as multiple files under the same table directory after hashing
At this point, the study of "Hive's basic concepts" is over, hoping to solve everyone's doubts. Theory and practice can better match to help everyone learn, go and try it! If you want to continue learning more relevant knowledge, please continue to pay attention to the website, Xiaobian will continue to strive to bring more practical articles for everyone!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.