How to customize Hadoop users 04/16 Update SLTechnology News&Howtos

How to customize Hadoop users

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to customize Hadoop users", the content of the explanation is simple and clear, easy to learn and understand, please follow the editor's ideas slowly in depth, together to study and learn "Hadoop users how to customize" it!

One: Hadoop built-in data types.

Hadoop provides the following built-in data types that implement the WritableComparable interface so that data defined with these types can be serialized for network transfer and file storage, as well as for size comparison.

BooleanWritable standard Boolean numeric ByteWritable single byte numeric DoubleWritable double byte FloatWritable floating point IntWritable integer LongWritable long integer Text using text NullWritable stored in UTF-8 format key or value empty use / / simply know these types IntWritable iw = new IntWritable (1); System.out.println (iw.get ()); / / 1 BooleanWritable bw = new BooleanWritable (true); System.out.println (bw.get ()); / / true

Second: Hadoop- user-defined data type.

When customizing data types, you need to meet two basic requirements, namely

1. Implement the Writable interface so that the data can be serialized to complete network transfer or file input / output.

two。 If this data needs to be used as a primary key key, or if you need to compare numeric values, you need to implement the WritableComparable interface.

/ / Hadoop2.6.4 version-Writable source code: public interface Writable {void write (DataOutput out) throws IOException; void readFields (DataInput in) throws IOException;} public interface WritableComparable extends Writable, Comparable {}

Third: Hadoop built-in data input format and RecordReader.

The data entry format (InputFormat) is used to describe the data entry specification for MapReduce jobs. The MapReduce framework relies on the data input format to complete the input specification check, InputSplit the data files, and provide the input key value equivalence function of reading out the data records one by one from the input blocks and converting them into Map processes.

Hadoop provides a wealth of built-in data entry formats, the most commonly used data entry formats include: TextInputFormat and KeyValueInputFormat.

TextInputFormat is the default data entry format for the system, and text files can be chunked and read in line by line for processing by Map nodes. When you read a line, the resulting primary key key is the byte offset of the current line in the entire text file, and value is the content of the line.

/ / TextInputFormat partial source code: public class TextInputFormat extends FileInputFormat {@ Override public RecordReader createRecordReader (InputSplit split, TaskAttemptContext context) {String delimiter = context.getConfiguration () .get ("textinputformat.record.delimiter"); byte [] recordDelimiterBytes = null; if (null! = delimiter) recordDelimiterBytes = delimiter.getBytes (Charsets.UTF_8); return new LineRecordReader (recordDelimiterBytes);} / /.}

KeyValueTextInputFormat is another commonly used data entry format, which can read out a text file stored in the format line by line, and automatically parse to generate the corresponding key and value.

/ / KeyValueTextInputFormat partial source code: public class KeyValueTextInputFormat extends FileInputFormat {/ /. Public RecordReader createRecordReader (InputSplit genericSplit, TaskAttemptContext context) throws IOException {context.setStatus (genericSplit.toString ()); return new KeyValueLineRecordReader (context.getConfiguration ());}}

RecordReader: for a data input format, a corresponding RecordReader is required, which is mainly used to split the data records in a file into specific key-value pairs. The default RecordReader for TextInputFormat is LineRecordReader and the default RecordReader for KeyValueTextInputFormat is KeyValueLineRecordReader.

Four: Hadoop built-in data output format with RecordWriter.

The data output format (OutputFormat) is used to describe the data output specification for MapReduce jobs. The MapReduce framework relies on the data output format to complete the output specification check and provide the job result data output function.

Similarly, the most commonly used data output format is TextOutputFormat, which is also the default data output format of the system, and the calculation results can be output to a text file line by line in the form of "key +\ t + vaue".

Similar to the data input format, the data output format also provides a corresponding RecordWriter so that the system can specify the specific format in which the output is written to the file. The default RecordWriter for TextInputFormat is LineRecordWriter, which actually outputs the resulting data to a text file in the form of "key +\ t + value".

/ / part of the TextOutputFormat source code: public class TextOutputFormat extends FileOutputFormat {protected static class LineRecordWriter extends RecordWriter {/ /... Public LineRecordWriter (DataOutputStream out, String keyValueSeparator) {/ /...} public LineRecordWriter (DataOutputStream out) {this (out, "\ t");} private void writeObject (Object o) throws IOException {/ /} public synchronized void write (K key, V value) throws IOException {/ /. Out.write (newline);}} public RecordWriter getRecordWriter (TaskAttemptContext job) throws IOException, InterruptedException {/ /...}}

Fifth: through printout UserInfo small examples to achieve simple user-defined data types, data input format, data output format. (to put it simply, it imitates the source code, which basically hasn't changed much.)

The source code of the case is attached below:

1. Define your own UserInfo as a data type.

Package com.hadoop.mapreduce.test4.outputformat;import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;import org.apache.hadoop.io.WritableComparable;public class UserInfo implements WritableComparable {private int id; private String name; private int age; private String sex; private String address Public UserInfo () {} public UserInfo (int id, String name, int age, String sex, String address) {this.id = id; this.name = name; this.age = age; this.sex = sex; this.address = address } / / JavaBean ordinary get set method. @ Override public void readFields (DataInput in) throws IOException {this.id = in.readInt (); this.name = in.readUTF (); this.age = in.readInt (); this.sex = in.readUTF (); this.address = in.readUTF () } @ Override public void write (DataOutput out) throws IOException {out.writeInt (id); out.writeUTF (name); out.writeInt (age); out.writeUTF (sex); out.writeUTF (address) } @ Override public String toString () {return "Id:" + id + ", Name:" + name + ", Age:" + age + ", Sex:" + sex + ", Address:" + address;} @ Override public int compareTo (UserInfo userInfo) {return 0;}}

two。 Customize your own data entry format: UserInfoTextInputFormat.

Package com.hadoop.mapreduce.test4.outputformat;import java.io.IOException;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.InputSplit;import org.apache.hadoop.mapreduce.RecordReader;import org.apache.hadoop.mapreduce.TaskAttemptContext;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat Public class UserInfoTextInputFormat extends FileInputFormat {@ Override public RecordReader createRecordReader (InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {context.setStatus (split.toString ()); UserInfoRecordReader userInforRecordReader = new UserInfoRecordReader (context.getConfiguration ()); return userInforRecordReader;}}

3. Customize your own RecordReader:UserInfoRecordReader.

Package com.hadoop.mapreduce.test4.outputformat;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.InputSplit;import org.apache.hadoop.mapreduce.RecordReader;import org.apache.hadoop.mapreduce.TaskAttemptContext;import org.apache.hadoop.mapreduce.lib.input.LineRecordReader Public class UserInfoRecordReader extends RecordReader {public static final String KEY_VALUE_SEPERATOR = "mapreduce.input.keyvaluelinerecordreader.key.value.separator"; private final LineRecordReader lineRecordReader; private byte separator = (byte)'\ t'; private Text innerValue; private Text key; private UserInfo value; public Class getKeyClass () {return Text.class } public UserInfoRecordReader (Configuration conf) throws IOException {lineRecordReader = new LineRecordReader (); String sepStr = conf.get (KEY_VALUE_SEPERATOR, "\ t"); this.separator = (byte) sepStr.charAt (0);} public void initialize (InputSplit genericSplit,TaskAttemptContext context) throws IOException {lineRecordReader.initialize (genericSplit, context) } public static int findSeparator (byte [] utf, int start, int length, byte sep) {for (int I = start; I < (start + length); iTunes +) {if (utf [I] = = sep) {return I }} return-1; / / returns the location of the intercepted identifier. } public static void setKeyValue (Text key, UserInfo value, byte [] line,int lineLen, int pos) {if (pos = =-1) {key.set (line, 0, lineLen); value.setId (0); value.setName (""); value.setAge (0) Value.setSex ("); value.setAddress (");} else {key.set (line, 0, pos); / / sets the key from position 0 to the position where the identifier is intercepted Text text = new Text () Text.set (line, pos + 1, lineLen-pos-1); System.out.println ("value of text:" + text); String [] str = text.toString (). Split (","); for (int item0)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.