How to make Hive run faster 07/13 Update SLTechnology News&Howtos

How to make Hive run faster

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to make Hive run faster. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Hive is not a relational database, but it pretends to be one of most cases. It has tables, runs SQL, and supports JDBC and ODBC.

This revelation is good and bad news: Hive does not run the query database mode. It's a long story, but I spent more than 80 hours adjusting Hive myself during the work week. Needless to say, I don't have to have a headache anymore. So, for your benefit, here are some suggestions to make your Hive project run a little faster than mine.

1. Do not use MapReduce

Do you believe in Tez,Spark or Impala, but not MapReduce. It's slow, it's even slower than Hive. If you are in Hortonwork distribution, you can type set hive.execution.engine=tez at the top of the script

Use Impala. I hope to set up hive.execution.engine=spark when your impala is no longer appropriate

2. Do not pair strings in SQL

Note that, especially in Hive, if you pair strings where they are supposed to be clauses, a cross product warning will be generated. If you have a query that runs in seconds, match a string that takes a few minutes to match. You * choose to use multiple tools that allow you to add search Hadoop. View the Elasticsearch's Hiveintegration or Lucidwork's integration for Solr. Or, where can I find Cloudera Search. RDBMSes is very good at doing this, but Hive is very bad.

3. Do not add a subquery

You create a temporary list and then add a temporary table without asking Hive how to handle subqueries intelligently. Which means don't do it.

It's about doing this.

At this point, it really shouldn't have evolved so fast in Hive, but it usually does.

4. Use Parquet or ORC, but don't turn them into motion

That is to say, relative to Parquet or ORC, for example, TEXTFILE. However, if you have text data coming in and promoting it to become more structured, convert it to the target table. You can't load data from a text file into an ORC, so do the initial load into the text.

If you create other tables, you will eventually fail to run your analysis. Do ORCing there, because it takes time to switch to ORC or Parquet, and it's not worth the steps of your ETL process. If you have a simple flat file in, and do not make any adjustments. Then you are loaded into a temporary table and choose to create an ORC or Parquet. I don't envy you because it's really slow.

5. Try to turn vectorization on or off

Increase

At the top of your script. Try to turn them on or off because there seems to be something wrong with vectorization in the new version of Hive.

6. Do not use structure to join

I have to admit that the SQL grammar of my original brain is still the SQL-92 era, so I am not inclined to use structures anyway. But if you do something like super-repetitive to the compound PKS clause, the structure is convenient. Unfortunately, Hive separates them-- especially in clauses, and of course, it doesn't do this in smaller datasets, nor does it produce any wrong timing. In the absolute forbidden zone, you get an interesting vector error. This restriction is that there is no record of anything I know. Think of this as an orderly way to understand the internal structure of your execution engine!

7. Check the size of the container

You may need to increase the size of your container for Impala or Tez. In addition, the "recommended" size may not apply to your system if you have a larger node size. You may want to make sure that your YARN queue and general YARN memory are appropriate. You may also want to pin it to something that is not the default queue used by everyone.

8. Start statistics

Hive does have something stupid to add unless the data starts up. You may also want to use query prompts in Impala.

9. Consider Mapjoin optimization

If you explain the query, you may find that the recent version of Hive is smart enough to apply optimization automatically. But you need to adjust them.

10. If possible, put * 's table on *.

11. Distinguish your friends. Uh.

If you have an item in many clauses, such as a date (but not an ideal range) or a repeating position, you may have your distinguishing key! Partition basically means "split into its own directory," which means that instead of looking for a large file, Hive looks at a file because you use your join/where clause to make you only look at location='NC', which is one of your small datasets. In addition, unlike column values, you can push partitions in the load data report. Remember, however, that HDFS doesn't like small documents.

12. Comparison of using hash table columns

If you compare the same 10 fields in each query, consider using () comparison summary. These are sometimes very useful, and you may put them in an output table. Note that Hive0.12 is low resolution, but the better available value is 0.13.

Thank you for reading! This is the end of the article on "how to make Hive run faster". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.