What is the Hive Json data storage format? 07/01 Update SLTechnology News&Howtos

What is the Hive Json data storage format?

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how the Hive Json data storage format is, the article is very detailed, has a certain reference value, interested friends must read it!

The data is stored in the form of json, one json data per line.

What if

{"field1": "data1", "field2": 100,100, "field3": "more data1", "field4": 123.001} {"field1": "data2", "field2": 200," field3 ":" more data2 "," field4 ": 123.002} {" field1 ":" data3 "," field2 ": 123.002," field3 ":" more data3 "," field4 ": 123.003} {" field1 ":" data4 "," field2 ": 400," field3": "more data4", "field4": 123.004}

Form, but cannot be formatted!

Download the corresponding version of hive-hcatalog-core.jar.

Add to hive

ADD JAR / usr/lib/hive-hcatalog/lib/hive-hcatalog-core.jar

Create a json table

CREATE TABLE json_table (a string, b bigint)

ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'

STORED AS TEXTFILE

Prepare data

{"a": "k", "b": 1}

{"a": "l", "b": 2}

Load, come in.

Load data local inpath'/ home/hadoop/json.txt' into table json_table

If the load data does not meet the format requirements, such as not json, or other questions, do not prompt.

When using a table, a problem is prompted.

Complex json processing, like this

When used, where conflict ["xxx"] = yyy is for map. Other array and struct can be used by referring to hive documentation.

Now there is a new idea, if there is a file on hdfs, what to do if it is not in the desired json format, and how to read it through jsonSerde?

For example:

{

"es": "1459442280603 es 0 gfhgfh,1511240411010000754,\ n"

"hd": {

"a": "90014A0507091BC4"

"b": "19"

"c": "74:04:2b:da:00:97"

}

The whole json is normal, but the es part is a string. I want to turn the es part into a json object array or something, but I can't change the structure of the data on the original hdfs. After all, many mr programs have been written, and the changes are huge.

The obvious answer is to customize a JsonSerDe and modify part of the source code.

On github. Https://github.com/rcongiu/Hive-JSON-Serde is very good, you can download it, modify the code, and recompile it. The code I modified is the deserialize method of org.openx.data.jsonserde.JsonSerDe. From the name, you can tell that the change method is parsing the data read from hdfs, and the parameter is a writable.

Get the es code, reparse, generate the heart's json object, and finally put it into the total json object.

In this way, we can use this es1 property when we create the hive table.

It is worth noting that:

Es is parsed to [[xx,yy,zz], [xx1,yy1,zz1]], then the definition of hive is as follows:

CREATE external TABLE jsontest (es string

Es1 array

Hd map)

And what I did at first was:

Es is parsed into [{"name1": xx, "name2": yy, "name3": zz}, {"name1": xx1, "name2": yy1, "name3": zz1}]

The hive table is then defined as:

CREATE external TABLE jsontest (es string

Es1 array

Hd map)

It's always a problem, and there's a real problem. I didn't react to it. Let's use struct. After all, name can be specified in the hive table, not written dead in the code.

Package compilation:

Mvn-Dcdh.version=1.3.1 package-Dmaven.test.skip=true

To do so, although irregular characters can be parsed into regular characters, and through the hive data result mapping, but there is a problem, es1 is an array, if I want to let es1 a struct object in where to judge to use, but the size of es1 is not fixed, and I do not know which element of the array can be used to determine, therefore, the above method has drawbacks.

New method:

Array is used in events1, but string is used instead of struct

CREATE external TABLE test.nginx_logs2 (events string

Events1 array

Header map)

Partitioned by (datepart string,app_token string)

ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'

At this time, the source code will be spelled into regular json characters on the line.

Then use methods such as hive explore to expand the array of events1, and then use get_json_obj to get an attribute in the json string. Such as the following.

SELECT event_name

Count (DISTINCT user_id) AS num

FROM (SELECT header ["user_id"] AS user_id, get_json_object (event,'$.name') AS event_name

FROM test.nginx_logs2 LATERAL VIEW explode (events1) events1 AS event

WHERE get_json_object (event,'$.name') = 'xxx'

AND get_json_object (event,'$.type') ='0') f

GROUP BY event_name

The above is all the contents of the article "what is the Hive Json data storage format?" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.