Detailed tutorial on parsing Json arrays with Hive 04/21 Update SLTechnology News&Howtos

Detailed tutorial on parsing Json arrays with Hive

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "Hive parsing Json array detailed tutorial", in the daily operation, I believe that many people have doubts on the detailed tutorial of Hive parsing Json array. Xiaobian consulted all kinds of materials and sorted out a simple and easy-to-use method of operation. I hope it will be helpful to answer the doubts of "Hive parsing Json array detailed tutorial". Next, please follow the editor to study!

In Hive, a lot of data will be stored in Json format. For example, when developers bury pages on APP, they will store multiple fields in a json array, so when the data platform calls the data, it is necessary to parse the buried data. Let's talk about how json data is parsed in Hive.

Json Analytic function included with Hive

1. Get_json_object

Syntax: get_json_object (json_string,'$.key')

Description: parses the string json_string of json and returns the content specified by path. If the json string entered is not valid, NULL is returned. This function can only return one data item at a time.

Example:

Select get_json_object ('{"name": "zhangsan", "age": 18}','$.name')

Results:

Namezhangsan

If you want to parse both name and age fields, you can write:

Select get_json_object ('{"name": "zhangsan", "age": 18}','$.name'), get_json_object ('{"name": "zhangsan", "age": 18}','$.age')

But if there are a lot of fields to parse, it would be too troublesome to write it this way, so you have the function json_tuple.

2. Json_tuple

Syntax: json_tuple (json_string, K1, K2...)

Description: parse the string json_string of json. You can specify key in multiple json data and return the corresponding value. If the json string entered is not valid, NULL is returned.

Example:

Select b.name, b.age from tableName a lateral view json_tuple ('{"name": "zhangsan", "age": 18}', 'name','age') b as name,age

Results:

Nameagezhangsan18

Note: there is no $in the json_tuple function above.

If you add $. When using the json_tuple function. The parsing fails:

Select b.name, b.age from tableName a lateral view json_tuple ('{"name": "zhangsan", "age": 18}','$.name','$.age') b as name,age

Results:

NameageNULLNULL

The fields are all NULL, so the json_tuple function does not need to add $. Otherwise, it will not be resolved.

Summary: the advantage of json_tuple equivalent to get_json_object is that it can parse multiple json fields at a time. But if we have a json array, neither of these functions can be handled.

Hive parses json array 1. Nested subquery parses json array

If you have a hive table, the contents of the json_str field in the table are as follows:

Json_str [{"website": "baidu.com", "name": "Baidu"}, {"website": "google.com", "name": "Google"}]

We want to parse this field to form the following structure:

Websitenamebaidu.com Baidu google.com Google

To parse this json array, you can't parse it with only the two functions described above, and you need to use a few more functions described below:

Explode function

Syntax: explode (Array OR Map)

Description: the explode () function receives an array or map type of data as input, and then outputs the elements in array or map in the form of each row, that is, the complex array or map structure in a column of hive is split into multiple rows to display, also known as column-to-line function.

Example:

-parsing array hive > select explode (array); OK A B C parsing map hive > select explode (map); OK A 10 B 20 C 30regexp_replace function

Syntax: regexp_replace (string A, string B, string C)

Description: replace the part of the string A that conforms to the java regular expression B with C. Note that escape characters are used in some cases, similar to the regexp_replace function in oracle.

Example:

Hive > select regexp_replace ('foobar',' oo | ar','); OK fb

The above example replaces the oo or ar in the string with''.

With the above functions, let's parse the contents of the json_str field:

First parse out the elements in the json array and convert them to each row display:

Hive > SELECT explode (regexp_replace (regexp_replace ('[{"website": "baidu.com", "name": "Baidu"}, {"website": "google.com", "name": "Google"}]','\]','),'\}\,\ {','\}\;) OK {"website": "baidu.com", "name": "Baidu"} {"website": "google.com", "name": "Google"}

Give a brief description of the above sql:

SELECT explode (split (regexp_replace ('[{"website": "baidu.com", "name": "Baidu"}, {"website": "google.com", "name": "Google"}]','\ [|\]','') Remove the square brackets on both sides of the json array from'\,\,'. Replace the comma between the elements of the json array with a semicolon (split function is separated by a semicolon)

Why replace commas with semicolons between elements of the json array?

Because the separation within the element is also a comma, if the comma between the elements is not replaced, the data within the element will be separated later when separated by the split function, which is not what we want.

In the previous step, an json array has been converted into multiple json strings, and then the fields in json are parsed with the son_tuple function:

Select json_tuple (explode (regexp_replace (regexp_replace ('[{"website": "baidu.com", "name": "Baidu"}, {"website": "google.com", "name": "Google"}]','\ [| |\]','), 'website',' name'))

The execution of the above statement resulted in an error:

FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions

It means that the UDTF function cannot be written in another function, that is, the explode function here cannot be written in json_tuple.

Since the explode function cannot be written in another json_tuple, we can use a subquery, as shown below:

Select json_tuple (json, 'website',' name') from (select explode (regexp_replace (regexp_replace ('[{"website": "baidu.com", "name": "Baidu"}, {"website": "google.com", "name": "Google"}]','\ [| |\]','),'\}\,\ {','\}\;),'\;) as json) t

No error was reported when executing the above statement, and the execution result is as follows:

Www.baidu.com Baidu google.com Google two uses lateral view to parse json array

The contents of the goods_id and json_str fields in the hive table are as follows:

Goods_idjson_str1,2,3 [{"source": "7fresh", "monthSales": 4900, "userCount": 1900, "score": "9.9"}, {"source": "jd", "monthSales": 2090, "userCount": 78981, "score": "9.8"}, {"source": "jdmart", "monthSales": 6987, "userCount": 1600, "score": "9.0"}]

Purpose: to parse the monthSales in the goods_id field and the json_str field.

Let's start with the analysis:

Split the goods_id field and convert the json array into multiple json strings:

Select explode (split (goods_id,',')) as good_id, explode (split (regexp_replace (regexp_replace (json_str,'\\ [|\]','),'\}\,\ {','\}\;) as sale_info from tableName

An error was reported when the above statement was executed:

FAILED: SemanticException 3:0 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'sale_info'

It means that when using UDTF, SELECT supports only one field. The above statement select has two fields, so an error is reported.

What should we do? to solve this problem, we need to introduce another hive syntax:

Lateral view

Lateral view is used with split, explode and other UDTF to split a row of data into multiple rows of data, and on this basis, the split data can be aggregated. Lateral view first calls UDTF,UDTF for each row of the original table to split the row into one or more rows, and lateral view is combining the results to produce a virtual table that supports alias tables.

Example:

Suppose we have a user interest table hobbies_table, which has two columns of data, the first column is name, and the second column is the user's interest id_list, which is an array that stores the id value of the interest:

Nameid_listzhangsan [1,2,3] lisi [3,4,5]

We want to count the number of id occurrences of all interests among all users:

Parse the interest id:

SELECT name, hobby_id FROM hobbies_table LATERAL VIEW explode (id_list) tmp_table AS hobby_id

The results of the above sql implementation:

Namehobby_idzhangsan1zhangsan2zhangsan3lisi3lisi4lisi5

two。 You can group and aggregate according to hobby_id:

SELECT hobby_id, count (1) client_num FROM hobbies_table LATERAL VIEW explode (id_list) tmp_table AS hobby_id group by hobby_id

Results:

Hobby_idclient_num1121324151

After introducing lateral view, let's solve the problem that when using UDTF, SELECT only supports one field:

Select good_id,get_json_object (sale_json,'$.monthSales') as monthSales from tableName LATERAL VIEW explode (split (goods_id,',')) goods as good_id LATERAL VIEW explode (split (regexp_replace (regexp_replace (json_str,'\ [|\]','),'\}\,\ {','\ {','\}\;)) sales as sale_json

Note: the above statement is the result of the Cartesian product of three tables, so this method is suitable for situations where the amount of data is not very large.

The execution result of the above statement is as follows:

Goods_idmonthSales149001209016987249002209026987349003209036987

If there are other fields in the table, we can filter out the data that matches the results based on the other fields.

Summary: lateral view usually appears with UDTF, in order to solve the problem that UDTF does not allow multiple fields in select.

At this point, on the "Hive parsing Json array detailed tutorial" is over, I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.