What are the considerations for the use of Hive sql 07/02 Update SLTechnology News&Howtos

What are the considerations for the use of Hive sql

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is to share with you what you should pay attention to when using Hive sql. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Preface

Sql is the most commonly used in data warehouse construction and data analysis, and its syntax is simple and easy to understand. at present, several mainstream frameworks in the field of big data all support sql syntax, including hive,spark,flink, so sql plays an irreplaceable role in the field of big data, which we need to focus on.

If you are not familiar with or not careful in using sql, it is very easy to make mistakes in query analysis. Next, let's take a look at a few error-prone sql statements and precautions.

The text begins with 1. Decimal

In addition to supporting common types such as int,double,string, hive also supports the decimal type, which is used to store precise numeric values in the database and is often used in fields that represent amounts.

Note:

For example, decimal stands for up to 11 digits, the last two of which are decimal places and the integer part is 9 digits

If the integer part exceeds 9 bits, the field becomes null, and if the integer part does not exceed 9 bits, the original field displays

If the decimal part is less than 2 digits, then complete the two digits with 0, and if the decimal part exceeds two digits, the excess part is rounded.

You can also write decimal directly, and the number of digits is not specified. By default, it is decimal (1010) integer 10 digits, without decimals.

2. When creating a location table, you can use location to specify a file or folder create table stu (id int, name string) location'/ user/stu2'

Note:

Use location when creating tables

When you specify a folder, hive will load all the files under the folder. When there is no partition in the table, there can be no more folders under this folder, otherwise an error will be reported.

When the table is a partitioned table, such as partitioned by (day string), each folder under this folder is a partition, and the folder name is day=20201123

In this format, then use: msck repair table score; to repair the table structure. After success, you can see that all the data has been loaded into the table.

3. Load data and load data local load the file load data inpath'/ hivedatas/techer.csv' into table techer; from hdfs and load the file load data local inpath'/ user/test/techer.csv' into table techer from the local system

Note:

Use load data local to indicate that the file is loaded from the local file system and the file is copied to the hdfs

Use load data to indicate that the file is loaded from the hdfs file system, and the file will be moved directly to the hive-related directory. Note that it is not copied in the past, because hive thinks that the hdfs file already has 3 copies, so there is no need to copy it again.

If the table is a partitioned table, an error will be reported if the partition is not specified in load

If you load a file with the same file name, it will be automatically renamed

4. Drop and truncate delete table operation drop table score1; empty table operation truncate table score2

Note:

If hdfs opens the Recycle Bin, the table data deleted by drop can be recovered from the Recycle Bin, and the table structure cannot be recovered and needs to be recreated. Tables emptied by truncate do not enter the Recycle Bin, so tables emptied by truncate cannot be restored.

Therefore, truncate must be used with caution. Once emptied, it will be powerless except for physical recovery.

5. Join connection intra-INNER JOIN join: only data matching the join conditions in both tables will be retained select * from techer t [inner] join course c on t.t_id = c.tjoin;-- inner can omit the LEFT OUTER JOIN left outer join: all data on the left will be returned, and select * from techer t left join course c on t.t_id = c.t_id will be returned on the right -- outer can omit RIGHT OUTER JOIN right outer join: all data on the right side will be returned, left qualified data will be returned, select * from techer t right join course c on t.t_id = c.tqualified full OUTER JOIN full join: all eligible records in all tables will be returned. If the specified field of any table does not have a qualified value, the null value is used instead. SELECT * FROM techer t FULL JOIN course c ON t.t_id = c.t_id

Note:

The hive2 version already supports unequal concatenation, that is, the join on condition can be followed by a greater than than symbol, and it also supports join on condition followed by or (previous versions of on only supported = and and, but not >.

< 和 or) 如hive执行引擎使用MapReduce，一个join就会启动一个job，一条sql语句中如有多个join，则会启动多个job 注意：表之间用逗号(,)连接和 inner join 是一样的，例： select tableA.id, tableB.name from tableA , tableB where tableA.id=tableB.id; 和 select tableA.id, tableB.name from tableA join tableB on tableA.id=tableB.id; 它们的执行效率没有区别，只是书写方式不同，用逗号是sql 89标准，join 是sql 92标准。用逗号连接后面过滤条件用 where ，用 join 连接后面过滤条件是 on。 6. left semi join为什么把这个单独拿出来说，因为它和其他的 join 语句不太一样，这个语句的作用和 in/exists 作用是一样的，是 in/exists 更高效的实现SELECT A.* FROM A where id in (select id from B)SELECT A.* FROM A left semi join B ON A.id=B.id上述两个 sql 语句执行结果完全一样，只不过第二个执行效率高注意事项： left semi join 的限制是：join 子句中右边的表只能在 on 子句中设置过滤条件，在 where 子句、select 子句或其他地方过滤都不行。 left semi join 中 on 后面的过滤条件只能是等于号，不能是其他的。 left semi join 是只传递表的 join key 给 map 阶段，因此left semi join 中最后 select 的结果只许出现左表。因为 left semi join 是 in(keySet) 的关系，遇到右表重复记录，左表会跳过 7. 聚合函数中 null 值hive支持 count(),max(),min(),sum(),avg() 等常用的聚合函数注意事项：聚合操作时要注意 null 值： count(*) 包含 null 值，统计所有行数； count(id) 不包含id为 null 的值； min 求最小值是不包含 null，除非所有值都是 null； avg 求平均值也是不包含 null。以上需要特别注意，null 值最容易导致算出错误的结果 8. 运算符中 null 值hive 中支持常用的算术运算符(+,-,*,/) 比较运算符(>

, 100

Get the result

Idclassifyprice1 Electrical Appliances 702 Electrical Appliances 1303 Electrical Appliances 804Furniture

The result is wrong, and all the types of electrical appliances are queried, because the priority of and is higher than that of or. The above sql statement actually executes: first find out classify = 'furniture' and price > 100, and then find out that classify = 'electrical'

The correct sql is to add parentheses and first calculate the parentheses:

Select * from product where (classify = 'Electrical' or classify = 'Furniture') and price > 100Thank you for reading! This is the end of this article on "what are the precautions for the use of Hive sql?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.