How to solve the conflict between self-defined input and output delimiters and nested delimiters in Map and Array by Pig and Hive 07/09 Update SLTechnology News&Howtos

How to solve the conflict between self-defined input and output delimiters and nested delimiters in Map and Array by Pig and Hive

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "Pig, Hive how to solve custom input and output delimiters and Map, Array nested separator conflict", the content is easy to understand, clear, hope to help you solve doubts, the following let Xiaobian lead you to study and learn "Pig, Hive how to solve custom input and output delimiters and Map, Array nested separator conflict" this article.

The input / output separator in PIG defaults to the tab character\ t, while in hive it becomes octal\ 001 by default

That is, ASCII: ctrl-A

Oct Dec Hex ASCII_Char

001 1 01 SOH (start of heading)

The official explanation is that the characters in the text should not be repeated as much as possible, so crtrl-An is chosen, and individual characters can be passed through

Row format delimited fields terminated by'#'; specifies that a single delimiter of PIG can also be specified through PigStorage

But what about multiple characters as delimiters? PIG reports an error directly, while HIVE only recognizes the first character, ignoring multiple characters that follow.

Solution:

PIG can customize the load function (load function): inherit LoadFunc, override a few methods and ok

For details, see: http://my.oschina.net/leejun2005/blog/83825

In hive, there are two ways to customize the multi-separator (Multi-character delimiter strings):

1. Take advantage of RegexSe:

RegexSerDe is a way of serialization / deserialization that comes with hive and is mainly used to deal with regular expressions.

There are three main parameters of RegexSerDe:

Input.regex

Output.format.string

Input.regex.case.insensitive

Here is a complete example:

Add jar / home/june/hadoop/hive-0.8.1-bin/lib/hive_contrib.jar;CREATE TABLE b (c0 string,c1 string,c2 string) ROW FORMATSERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'WITH SERDEPROPERTIES (' input.regex' ='([^,] *), ([^,] *), ([^,] *)', ([^,] *)', 'output.format.string' =' 1$ s% 2$ s% 3$ s') STORED AS TEXTFILE Cat b.txt1 overwrite into table data local inpath 'b.txtt1 overwrite into table BTX * from b

REF:

Http://www.oratea.net/?p=652

Http://grokbase.com/t/hive/user/115sw9ant2/hive-create-table

2, rewrite the corresponding InputFormat and OutputFormat methods: / / use multiple characters to separate fields, you need to customize the InputFormat to achieve. Package org.apache.hadoop.mapred;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileSplit;import org.apache.hadoop.mapred.InputSplit;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.LineRecordReader;import org.apache.hadoop.mapred.RecordReader;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.TextInputFormat Public class MyDemoInputFormat extends TextInputFormat {@ Override public RecordReader getRecordReader (InputSplit genericSplit, JobConf job, Reporter reporter) throws IOException {reporter.setStatus (genericSplit.toString ()); MyDemoRecordReader reader = new MyDemoRecordReader (new LineRecordReader (job, (FileSplit) genericSplit)); return reader;} public static class MyDemoRecordReader implements RecordReader {LineRecordReader reader; Text text Public MyDemoRecordReader (LineRecordReader reader) {this.reader = reader; text = reader.createValue ();} @ Override public void close () throws IOException {reader.close ();} @ Override public LongWritable createKey () {return reader.createKey () @ Override public Text createValue () {return new Text ();} @ Override public long getPos () throws IOException {return reader.getPos ();} @ Override public float getProgress () throws IOException {return reader.getProgress () } @ Override public boolean next (LongWritable key, Text value) throws IOException {Text txtReplace; while (reader.next (key, text)) {txtReplace = new Text (); txtReplace.set (text.toString (). ToLowerCase (). ReplaceAll ("\ |", "\ 001")) Value.set (txtReplace.getBytes (), 0, txtReplace.getLength ()); return true;} return false The table statement at this time is: create external table IF NOT EXISTS test (id string,name string) partitioned by (day string) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.MyDemoInputFormat' OUTPUTFORMAT' org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'LOCATION'/ log/dw_srclog/test'

Collect logs to Hive http://blog.javachen.com/2014/07/25/collect-log-to-hive/

Reference:

Hive handles logs, custom inputformat

Http://running.iteye.com/blog/907806

Http://superlxw1234.iteye.com/blog/1744970

The principle is simple: the inner delimiter of hive is "\ 001". Just replace the delimiter with "\ 001".

3. By the way, how to customize the output of NULL in hive. By default, the output is escaped to\ N when stored.

If we need to modify it to custom, such as empty, we also need to use regular serialization:

Hive > CREATE TABLE sunwg02 (id int,name STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'WITH SERDEPROPERTIES (' field.delim'='\ trecording mother escape.delimited escape', 'serialization.null.format'=') STORED AS TEXTFILE;OKTime taken: 0.046 secondshive > insert overwrite table sunwg02 select * from sunwg00 Loading data to table sunwg022 Rows loaded to sunwg02OKTime taken: 18.756 seconds View sunwg02's file in hdfs [hjl@sunwg src] $hadoop fs-cat / hjl/sunwg02/attempt_201105020924_0013_m_000000_0mary101 tomNULL value is not converted to'\ N'

PS:

In fact, this function is very simple, but somehow the author does not directly support it, perhaps the future version will support it.

For the above file, you can see that the purple box is all array, but in order to avoid the delimiter conflict in array and map nested array

Different delimiters are used, one is /, the other is\ 004, why use\ 004?

Because hive supports the level 8 delimiter:\ 001 ~\ 008 by default, users can only override the override\ 001 ~\ 003, and the delimiter hive of other levels will recognize and resolve it.

So in this case, the table statement is as follows:

Create EXTERNAL table IF NOT EXISTS testSeparator (id string, name string, itemList array, kvMap map) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'COLLECTION ITEMS TERMINATED BY' / 'MAP KEYS TERMINATED BY': 'LINES TERMINATED BY'\ n'LOCATION'/ tmp/dsap/rawdata/ooxx/3'

The hive results are as follows:

These are all the contents of the article "how Pig and Hive solve the problem of custom input and output delimiters and the conflict of Map and Array nested delimiters". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.