What are the tips for using Apache Hive? 07/13 Update SLTechnology News&Howtos

What are the tips for using Apache Hive?

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is to share with you what are the tips in Apache Hive. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

1. Do not use MapReduce

Whether you think Tez, Spark or Impala will work or not, don't count on MapReduce. It is slow by itself, slower than Hive. If you are using the Hortonwork version, you can write set hive.execution.engine=tez in front of the script; if you use Cloudera, use Impala. If Impala is not applicable, I hope I can use hive.execution.engine=spark at that time.

2. Do not do string matching in SQL

Never, especially in Hive! If you insist on using LIKE matching in WHERE statements, a cross-product warning will be generated. Your query may only take a few seconds, but if you use string matching, it will become a few minutes. The solution is to use tools that can be searched in Hadoop. Try the Hive integrated version of Elasticsearch or the Solr of Lucidwork, as well as Cloudera Search. Relational databases do not perform well in this respect, but Hive is even worse.

3. Do not join subqueries with tables

Instead of letting Hive handle subqueries intelligently, you create a temporary table and then join it. That is, do not do this:

Select A. * from something an inner join (select. From somethingelse union b select... From anotherthing c) d on a.key1 = d.key1 and a.key2 = b.key2 where a.condition=1

Instead, it should be like this:

Create var_temp as select... From somethingelse b union select... From anotherthing c and then select a.* from something an inner join from var_temp b where a.key1=b.key1 and a.key2=b.key2 where a.condition=1

In general, this is much faster than Hive's own processing of subqueries.

4. Use Parquet or ORC, but do not convert to

That is, use Parquet or ORC instead of TEXTFILE. However, if you want to import text data into more structured data, you should make some transformations and then import it into the target table. Instead of using LOAD DATA to load a text file into ORC, you should load it into a text.

If you want to create another table, and eventually most of the analysis is done on it, then you should ORC the table, because the conversion to ORC or Parquet takes a lot of time and is not worth putting into your ETL processing. If you have a simple plain text to import without any optimization, you should load it into a temporary table and put it into ORC or Parquet through select create. But it's a little slow.

5. Try the switch vectorization.

Add set hive.vectorized.execution.enabled = true and set hive.vectorized.execution.reduce.enabled = true in front of your script, and then try to turn them on or off. Because there is something wrong with the vectorization of the latest version of Hive.

6. Do not use structs in table joins

I have to admit that the SQL format in my brain is still SQL-92, so I wouldn't think of using structs anyway. But if you do something super complex, such as using an ON statement on a federated primary key, then structs is convenient. Unfortunately, Hive is not used to them, especially on ON statements. Of course, in most cases, there is nothing wrong with smaller datasets and yields. In Tez, you will get an interesting vector error. This limitation is not seen in any documentation I know, and maybe it's a good way to explore the inside of your execution engine.

7. Check your container size

You may need to increase your container size for Impala or Tez. If your node size is relatively large, the "recommended" container size may not apply to your system. You may need to make sure that your YARN queue and regular YARN memory are the right size. You may want to note that the default queue is not suitable for all regular uses.

8. Enable statistics

Hive does something stupid when joining tables unless statistics are enabled. You can also use query prompts in Impala.

9. Consider MapJoin optimization

If you analyze your query, you may find that Hive is smart enough to optimize automatically. But you may need to adjust it again.

10. If you can, put the large table on *.

Such as the title.

11. The division will always help you, no matter how much

If you have something that appears in many places, such as a date in a statement (but not a date range) or a duplicate location, you might want to partition. The basic meaning of a partition is to "split into its own directory" instead of looking for it in a large file. When you retrieve only a small dataset such as location='NC' in your join/where statement, Hive can find it in a file. In addition, unlike column values, you can add partitions to your LOAD DATA statements. Also, keep in mind that HDFS doesn't like small files.

12. Compare using hash to join the ranks

If you want to compare the same 10 fields in each query, consider using hash () to compare their check values. It may be useful to show them in an output table. Note that in Hive 0.12, the hash function is poor, and the hash in 0.13 is better.

Thank you for reading! This is the end of this article on "what are the tips for using Apache Hive?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.