What is the compound data structure in Hive and the usage description of the function 07/06 Update SLTechnology News&Howtos

What is the compound data structure in Hive and the usage description of the function

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article shows you the compound data structure in Hive and what is the usage description of the function. The content is concise and easy to understand, which can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Currently, hive supports the following composite data types:

Map

(key1, value1, key2, value2,...) Creates a map with the given key/value pairs

Struct

(val1, val2, val3,...) Creates a struct with the given field values. Struct field names will be col1, col2,...

Named_struct

(name1, val1, name2, val2,...) Creates a struct with the given field names and values. (as of Hive 0.8.0)

Array

(val1, val2,...) Creates an array with the given elements

Create_union

(tag, val1, val2,...) Creates a union type with the value that is being pointed to by the tag parameter

1. The use of map, struct and array: 1. The use of Array to create database tables with array as the data type create table person (name string,work_locations array) ROW FORMAT DELIMITEDFIELDS TERMINATED BY'\ t'COLLECTION ITEMS TERMINATED BY','; data biansutao beijing,shanghai,tianjin,hangzhoulinan changchu,chengdu,wuhan input data LOAD DATA LOCAL INPATH'/ home/hadoop/person.txt' OVERWRITE INTO TABLE person; query hive > select * from person Biansutao ["beijing", "shanghai", "tianjin", "hangzhou"] linan ["changchu", "chengdu", "wuhan"] Time taken: 0.355 secondshive > select name from person;linanbiansutaoTime taken: 12.397 secondshive > select work_locations [0] from person;changchubeijingTime taken: 13.214 secondshive > select work_locations from person; ["changchu", "chengdu", "wuhan"] ["beijing", "shanghai", "tianjin", "hangzhou"] Time taken: 13.755 secondshive > select work_locations [3] from person NULLhangzhouTime taken: 12.722 secondshive > select work_locations [4] from person;NULLNULLTime taken: 15.958 seconds2, use of Map to create database tables create table score (name string, score map) ROW FORMAT DELIMITEDFIELDS TERMINATED BY'\ t'COLLECTION ITEMS TERMINATED BY', 'MAP KEYS TERMINATED BY':' Data to be stored in biansutao 'Mathematics': 80 'Chinese': 89 'English': 95jobs' Chinese': 60 'Mathematics': 80 'English': 99 incoming data LOAD DATA LOCAL INPATH'/ home/hadoop/score.txt' OVERWRITE INTO TABLE score; query hive > select * from score Biansutao {Mathematics: 80, Chinese: 89, English: 95} jobs {Chinese: 60, Mathematics: 80, English: 99} Time taken: 0.665 secondshive > select name from score;jobsbiansutaoTime taken: 19.778 secondshive > select t.score from score t {"Chinese": 60, "Mathematics": 80, "English": 99} {"Mathematical": 80, "Chinese": 89, "English": 95} Time taken: 19.353 secondshive > select t.score ['Chinese'] from score tten 6089 time taken: 13.054 secondshive > select t.score ['English'] from score tten 9995 time taken: 13.769 seconds3, the use of Struct to create data tables CREATE TABLE test (id int,course struct) ROW FORMAT DELIMITEDFIELDS TERMINATED BY'\ t'COLLECTION ITEMS TERMINATED BY',' Data 1 english,802 math,893 chinese,95 LOAD DATA LOCAL INPATH'/ home/hadoop/test.txt' OVERWRITE INTO TABLE test; query hive > select * from test;OK1 {"course": "english", "score": 80} 2 {"course": "math", "score": 89} 3 {"course": "chinese", "score": 95} Time taken: 0.275 secondshive > select course from test {"course": "english", "score": 80} {"course": "math", "score": 89} {"course": "chinese", "score": 95} Time taken: 44.968 secondsselect t.course.course from test t; englishmathchineseTime taken: 15.827 secondshive > select t.course.score from test tash 808995time taken: 13.235 seconds4, data combinations (complex data types that do not support combinations) LOAD DATA LOCAL INPATH'/ home/hadoop/test.txt' OVERWRITE INTO TABLE test Create table test1 (id int,a MAP) row format delimited fields terminated by'\ t 'collection items terminated by', 'MAP KEYS TERMINATED BY':'; 1 english:80,90702 math:89,78863 chinese:99,100,82LOAD DATA LOCAL INPATH'/ home/hadoop/test1.txt' OVERWRITE INTO TABLE test1; II, usage of some unusual functions in hive:

Common functions don't talk nonsense, similar to standard sql. What we're going to talk about below are basically proprietary functions in HQL.

The functions in hive are roughly divided into the following categories: Built-in, Misc., UDF, UDTF, UDAF

Let's just pick a few standard SQL that are not available, but let's talk about those that are commonly used in HIVE SQL for statistical analysis.

1. Array_contains (Collection Functions)

This is the built-in function to operate on the collection, for example:

Create EXTERNAL table IF NOT EXISTS userInfo (id int,sex string, age int, name string, email string,sd string, ed string) ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t 'location' / hive/dw';select * from userinfo where sex='male' and (idlers) 1 and idling = 2 and idling 3 and idling 4 and idling 5) and age

< 30;select * from (select * from userinfo where sex='male' and !array_contains(split('1,2,3,4,5',','),cast(id as string))) tb1 where tb1.age < 30; 其中建表所用的测试数据你可以用如下链接的脚本自动生成： http://my.oschina.net/leejun2005/blog/76631 2、get_json_object （Misc. Functions）测试数据： first {"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} third first {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":91,"type":"pear"}],"bicycle":{"price":19.952,"color":"red2"}},"email":"amy@only_for_json_udf_test.net","owner":"amy2"} third first {"store":{"fruit":[{"weight":10,"type":"apple"},{"weight":911,"type":"pear"}],"bicycle":{"price":19.953,"color":"red3"}},"email":"amy@only_for_json_udf_test.net","owner":"amy3"} third create external table if not exists t_json(f1 string, f2 string, f3 string) row format delimited fields TERMINATED BY ' ' location '/test/json'select get_json_object(t_json.f2, '$.owner') from t_json;SELECT * from t_json where get_json_object(t_json.f2, '$.store.fruit[0].weight') = 9;SELECT get_json_object(t_json.f2, '$.non_exist_key') FROM t_json; 这里尤其要注意UDTF的问题，官方文档有说明： json_tuple A new json_tuple() UDTF is introduced in hive 0.7. It takes a set of names (keys) and a JSON string, and returns a tuple of values using one function. This is much more efficient than calling GET_JSON_OBJECT to retrieve more than one key from a single JSON string. In any case where a single JSON string would be parsed more than once, your query will be more efficient if you parse it once, which is what JSON_TUPLE is for. As JSON_TUPLE is a UDTF, you will need to use the LATERAL VIEW syntax in order to achieve the same goal. For example, select a.timestamp, get_json_object(a.appevents, '$.eventid'), get_json_object(a.appenvets, '$.eventname') from log a; should be changed to select a.timestamp, b.*from log a lateral view json_tuple(a.appevent, 'eventid', 'eventname') b as f1, f2; UDTF(User-Defined Table-Generating Functions) 用来解决输入一行输出多行(On-to-many maping) 的需求。通过Lateral view可以方便的将UDTF得到的行转列的结果集合在一起提供服务，因为直接在SELECT使用UDTF会存在限制，即仅仅能包含单个字段，不光是多个UDTF，仅仅单个UDTF加上其他字段也是不可以，hive提示在UDTF中仅仅能有单一的表达式。如下： hive>

Select my_test ("abcef:aa") as qq,'abcd' from sunwg01

FAILED: Error in semantic analysis: Only a single expression in the SELECT clause is supported with UDTF's

The above requirements can be implemented using Lateral view. The Lateral view syntax is as follows:

LateralView: LATERAL VIEW udtf (expression) tableAlias AS columnAlias (', 'columnAlias) *

FromClause: FROM baseTable (lateralView) *

Hive > create table sunwg (an array, b array)

> ROW FORMAT DELIMITED

> FIELDS TERMINATED BY'\ t'

> COLLECTION ITEMS TERMINATED BY','

Time taken: 1.145 seconds

Hive > load data local inpath'/ home/hjl/sunwg/sunwg.txt' overwrite into table sunwg

Copying data from file:/home/hjl/sunwg/sunwg.txt

Loading data to table sunwg

Time taken: 0.162 seconds

Hive > select * from sunwg

[10d11] ["tom", "mary"]

[20 kate 21] ["kate", "tim"]

Time taken: 0.069 seconds

Hive >

> SELECT a, name

> FROM sunwg LATERAL VIEW explode (b) R1 AS name

[10,11] tom

[10,11] mary

[20,21] kate

[20,21] tim

Time taken: 8.497 seconds

Hive > SELECT id, name

> FROM sunwg LATERAL VIEW explode (a) R1 AS id

> LATERAL VIEW explode (b) R2 AS name

10 tom

10 mary

11 tom

11 mary

20 kate

20 tim

21 kate

21 tim

Time taken: 9.687 seconds

3 、 parse_url_tuple

Test data:

Url1 http://facebook.com/path2/p.php?k1=v1&k2=v2#Ref1

Url2 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-getjsonobject

Url3 https://www.google.com.hk/#hl=zh-CN&newwindow=1&safe=strict&q=hive+translate+example&oq=hive+translate+example&gs_l=serp.3...10174.11861.6.12051.8.8.0.0.0.0.132.883.0j7.7.0...0.0...1c.1j4.8.serp.0B9C1T_n0Hs&bav=on.2,or.&bvm=bv.44770516,d.aGc&fp=e13e41a6b9dab3f6&biw=1241&bih=589

Create external table if not exists t_url (F1 string, f2 string) row format delimited fields TERMINATED BY''location' / test/url';SELECT F1, b.* FROM t_url LATERAL VIEW parse_url_tuple (f2, 'HOST',' PATH', 'QUERY',' QUERY:k1') b as host, path, query, query_id

Results:

Url1 facebook.com / path2/p.php k1=v1&k2=v2 v1

Url2 cwiki.apache.org / confluence/display/Hive/LanguageManual+UDF NULL NULL

Url3 www.google.com.hk / NULL NULL

4 、 explode

Explode is a hive built-in table generation function: Built-in Table-Generating Functions (UDTF), mainly to solve the problem of 1 to N, that is, it can split a row of input into multiple rows, such as an array, each element split into a row, as a virtual table output. It has the following points to note:

Using the syntax "SELECT udtf (col) AS colAlias..." Has a few limitations:No other expressions are allowed in SELECTSELECT pageid, explode (adid_list) AS myCol... Is not supportedUDTF's can't be nestedSELECT explode (explode (adid_list)) AS myCol... Is not supportedGROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supportedSELECT explode (adid_list) AS myCol... GROUP BY myCol is not supported

From the above principle and grammar, we can see

Udtf and other non-udtf columns cannot be mixed in select columns

Udtf cannot be nested

GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported

In addition, the udtf that appears in select must have an alias, otherwise an error will be reported:

SELECT explode (myCol) AS myNewCol FROM myTable;SELECT explode (myMap) AS (myMapKey, myMapValue) FROM myMapTable;SELECT posexplode (myCol) AS pos, myNewCol FROM myTable;5, lateral view

Lateral view is the conjunction provided to UDTF in Hive, which solves the problem that UDTF cannot add additional select columns. When we want to split a column in the hive table, we want to convert it to the schema of 1 to N, that is, one row to multiple columns. Hive does not allow us to add other select statements to the UDTF function.

As follows, we want to put the user id who logs in to a game in a field user_ids, use UDTF for each line of data and output multiple lines.

Select game_id, explode (split (user_ids,'\\ [') as user_id from login_game_log where dt='2014-05-15'; FAILED: Error in semantic analysis: UDTF's are not supported outside the SELECT clause, nor nested in expressions.

Prompt syntax parsing error, UDTF does not support select statements outside the function, what if we want to support it? And then it's time for Lateral View to take the stage.

Lateral view is actually used in conjunction with UDTF functions like explode. Lateral view will put the results generated by UDTF into a virtual table, and then the virtual table (1 to N) will join with the input line, that is, each game_id, to connect the select fields outside the UDTF (the source table and the split virtual table are directly connected to 1 join N in the row), which is why table aliases and column aliases are needed after LATERAL VIEW udtf (expression).

Lateral View Syntax

LateralView: LATERAL VIEW udtf (expression) tableAlias AS columnAlias (', 'columnAlias) *

FromClause: FROM baseTable (lateralView) *

As you can see, you can use Lateral view in two places:

Used in front of udtf

Used after from baseTable

For example:

Pageid adid_list

Front_page [1, 2, 3]

Contact_page [3, 4, 5]

SELECT pageid, adidFROM pageAds LATERAL VIEW explode (adid_list) adTable AS adid

Pageid adid

Front_page 1

Front_page 2

Front_page 3

Contact_page 3

Contact_page 4

Contact_page 5

The From statement can be followed by multiple Lateral View.

A FROM clause can have multiple LATERAL VIEW clauses. Subsequent LATERAL VIEWS can reference columns from any of the tables appearing to the left of the LATERAL VIEW.

Given data:

Array col1 Array col2

[1, 2] [a "," b "," c "]

[3, 4] [d "," e "," f "]

Conversion target:

If you want to split the first column and the second column at the same time, it's like doing a Cartesian product.

We can write like this:

SELECT myCol1, myCol2 FROM baseTableLATERAL VIEW explode (col1) myTable1 AS myCol1LATERAL VIEW explode (col2) myTable2 AS myCol2

In another case, what if the Array converted by UDTF is empty?

The outer keyword is supported in Hive0.12. If the result of UDTF is empty, the output is ignored by default.

If you add the outer keyword, it will be like left outer join, and the output of select will still be output, while the output of UDTF will be NULL.

Lateral View usually appears with UDTF in order to solve the problem that UDTF is not allowed in the select field.

Multiple Lateral View can implement similar Cartesian products.

The Outer keyword can output the empty result of UDTF that is not output to NULL to prevent data loss.

The above is the compound data structure in Hive and what is the usage description of the function. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.