Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What does hive mean?

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces what hive means. It is very detailed and has certain reference value. Friends who are interested must finish reading it.

Hive is one of the most commonly used auxiliary projects in the use of hadoop throughout the project.

The basic components of hadoop are hdfs and mapreduce. Hdfs completes the storage of data in the entire hadoop cluster, using google's bigTable architecture. In fact, I think it is to divide the data into multiple parts, and then divide them equally among the distributed machines. The data stored above is like a hidden file. You can't find them on your local hard disk, but you can display and manipulate them through the hdfs instructions in hadoop, like the hadoop dfs-ls path. On the other hand, mapreduce distributes the data stored in the cluster to all the computers in the cluster, so its computing efficiency will be significantly improved.

After talking about hdfs and mapreduce, you should have a basic understanding of hadoop. Let's talk about hive.

Hive is to analyze the data in the data warehouse by parsing the HiveQL statements you write to do the mapreduce calculation. Hive actually writes a series of methods and classes for mapreduce, and we can execute mapreduce by writing HIVEQL. You can write all the functions that can be achieved with hiveQL in mapReduce, if you don't have time for trouble.

Talk about the main functions of hive: data ETL (extract, transform, load) tools, data storage, and query and analysis capabilities for large datasets.

Hive contains four data models: table (Table), external table (External Table), partition (Partition), bucket (bucket).

There is no difference between the table and the external table, except that the data is not moved to the data warehouse directory when the external table is created, that is to say, the external table is not managed by itself. When you delete an external table, only the metadata will be deleted, and the data in the external table will not be deleted.

And zoning and buckets. Just use zoning. However, partitions and buckets should be designed to optimize the efficiency of hive analysis. I used to put the data from No. 10 to No. 20 in a file, but now, through partitions, I store the daily data from the 10th to the 20th in a daily storage file. In this way, when I query the data of a certain day, I only query the files of that day, not the whole file before the partition. In this way, the efficiency comes up. Of course, this is just my understanding, there may be other functions.

I understand that the basic work of hive is to save metadata in the hive self-built table, and then get the analysis results you want by manipulating and analyzing the data in the hive table.

There are three ways to store metadata in hive:

1. Single User Mode. In fact, metadata is stored in an in-memory database (derby). How much memory is it, so it is basically not used in daily life.

2. Multi User Mode. Metadata is obtained by connecting to the native MySql. This is the most commonly used way in daily life and work. It seems to be at work.

3. Remote Server Mode. In fact, it is accessed through the native thrift protocol. The metaStoreServcer,metadata on another server is not local.

About the configuration of hive, say nothing more, a lot of things on the Internet.

With regard to the operation of tables in hive, I would like to say something that is easy to be misunderstood:

Although hive has the concept of table, it is an offline data analysis tool. Tables are built only for analysis services, and you can't insert or delete a few of the analyzed data one by one. Its normal use is to import metadata into the hive table, and then analyze the data in the hive table through hiveQL. Therefore, manual insertion of one or more pieces of data is not available in hive. If there is manual, there must be automatic. Let me give you an example: insert overwrite table abc select * from bbc;. This is what I think is automatic. You can't add data manually, you can only import data or add data automatically. Well, deletion and modification cannot be properly done.

There is no more introduction to the hive sentence, and there are a lot of things on the Internet, which are very clear. However, here I have to give an example of a mapjoin... that I wrote today that has bothered me all day.

Mapjoin has two advantages: 1. One of the two associated tables is very small, and mapjoin can load the data from the small table into memory, and then automatically the data from the large table in the map phase, which is quite efficient. Like: select / * + mapjoin*/ a. ID. A. Name from a join b on (a.id=b.id). This is in the case of reciprocity, the efficiency is quite good.

two。 Another advantage is that it can achieve non-peer-to-peer connection. Join.. on (), on will be followed by equivalent conditions, such as a.id=b.id, if you follow a like, you will report an error. And mapjoin can achieve unequal conditions, that is, after the join does not use on, directly use where, but this efficiency, try, personal feel that the efficiency is not high. Like: select / * + mapjoin*/ a. ID. A. Name from a join b where a.username like 'error').

I haven't read about other aspects of hive. After reading it later, let's talk about our feelings.

PS: non-equivalent test for mapjoin ():

Two tables: pv_temp_test1 (large table) and title_keyword_test (small table)

Pv_temp_test1: 530180

Title_keyword_test 5646

Test statement:

Hive > insert overwrite table hbase_test_keyword select / * + mapjoin (a) * / a.keywordbook b.pidrect b.areabookcount (*), count (distinct clientid) from title_keyword_test a join (select dt,pid,area,wd,clientid from pv_temp_test1 where wdwords examples' and wdwordbooks' None') b where b.wd like concat ('%', a.keywordjue%%') group by a.keywordline b.dtdirection b.areadirection b.pid

Time: 4927 seconds.

The above is all the contents of this article "what does hive mean?" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report