Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is Hadoop WritableSerialization?

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

The main content of this article is to explain "what is Hadoop WritableSerialization". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn what Hadoop WritableSerialization is.

Serialization Framework

Hadoop has a replaceable serialization framework API. A serialization framework is represented by an implementation of Serialization.

WritableSerialization

WritableSerialization is an implementation of Serialization of type Writable.

Package: org.apache.hadoop.io.serializerpublic class WritableSerialization extends Configured implements Serialization {static class WritableSerializer extends Configured implements Serializer {@ Override public void serialize (Writable w) throws IOException {}} static class WritableDeserializer extends Configured implements Deserializer {@ Override public Writable deserialize ( Writable w) throws IOException {}} @ Override public Serializer getSerializer (Class c) {return new WritableSerializer () } @ InterfaceAudience.Private @ Override public Deserializer getDeserializer (Class c) {return new WritableDeserializer (getConf (), c);}} JavaSerialization

JavaSerialization is an implementation of Serialization of type Serializable. It uses standard Java Object Serialization. Although it facilitates the convenient use of standard java types, the standardization of Java is less efficient.

Package: org.apache.hadoop.io.serializer

Public class JavaSerialization extends Object implements Serialization {}

Why not use Java Object Serialization?

1. Not Compact. Each time you serialize, the name of the class is written, and subsequent columns of the same class refer only to the handle that first occurs. This is not suitable for immediate access, sorting, and syncopation.

2. Nor Fast. Every time you need to create a new instance, it wastes space.

3. Extensible. This can be a new type that supports evolution. Currently, it is not supported with writable.

4. Interoperational. It is theoretically feasible, but currently it is only implemented by Java. The same is true of Writable.

Avro

Avro is a programming language independent data serialization system that uses Interface definition language (IDL) to define Schema, which can then generate native code for other languages. Avro Schema is usually written in JSON, and data is usually encoded in binary format.

Avro has a strong Data Schema Resolution capability, which means that the Schema for reading and writing data does not have to be exactly the same, and Avro supports data evolution.

Avro performs better than other serialization systems (Thrift and google Protocol Buffers).

Datatype and Schema of Avro

Primitive Datatype

Null, boolean,int,long,float,double,bytes,string

Complex Datatype

Array, a sorted collection of objects of the same type

{"name": "myarray", "type": "array", "items": "long"}

Map, unsorted KMTV pair, key must be string,schema and only value is defined.

{"name": "mymap", "type": "map", "values": "string"}

Record, similar to struct, is used infrequently in data formats.

{"type": "record", "name": "weather-record", "doc": "a weather reading."

"fields": [

{"name": "myint", "type": "int"}

{"name": "mynull", "type": "null"}

]

}

Enum, naming the collection

{"type": "enum"

"name": "title"

"symbols": ["engineer", "Manager", "vp"]

}

Fixed, fixed 8-bit unsigned byte

{"type": "fixed", "name": "md5"}

The union of union,Schema, using the json array flag. The data must match a type of union.

["type": "int", "type": "long", {"type": "array", "items": "long"}]

Indicates that the data must be int,long, or one of the long arrays.

The evolution of Avro.

Question:

How is the Avro sorted?

How does Avro splitable?

Object Container File

The data file structure of Avro is as follows:

File header

Four-syllable, ASCII 'Oval,' baked, 'j', followed by 1.

File metadata

The 16-byte, randomly-generated sync marker for this file.

All metadata properties that start with "avro." Are reserved.

Avro.schema contains the schema of objects stored in the file, as JSON data (required).

Avro.codec the name of the compression codec used to compress blocks, as a string. Implementations are required to support the following codecs: "null" and "deflate" If codec is absent, it is assumed to be "null". The codecs are described with more detail below.

Required Codecs

Null

The "null" codec simply passes through data uncompressed.

Deflate

The "deflate" codec writes the data block using the deflate algorithm as specified in RFC 1951, and typically implemented using the zlib library. Note that this format (unlike the "zlib format" in RFC 1950) does not have a checksum.

Optional Codecs

Snappy

The "snappy" codec uses Google's Snappy compression library. Each compressed block is followed by the 4-byte, big-endian CRC32

Checksum of the uncompressed data in the block.

One and more data blocks data blocks.

A long indicating the count of objects in this block.

A long indicating the size in bytes of the serialized objects in the current block, after any codec is applied

The serialized objects. If a codec is specified, this is compressed by that codec.

The file's 16-byte sync marker.

Thus, each block's binary data can be efficiently extracted or skipped without deserializing the contents. The combination of block size, object counts, and sync markers enable detection of corrupt blocks and help ensure data integrity.

Avro Read/Write

Schema and data access can use either GenericRecord or SpecificRecord, and Avro-Tools is needed to generate object classes

% hadoop jar / usr/lib/avro/avro-tools.jar compile schema / pair.avsc / home/cloudera/workspace/

Schema, namespace will be injected into the generated class.

{"namespace": "com.jinbao.hadoop.hdfs.avro.compile", "type": "record", "name": "MyAvro", "fields": [{"name": "name", "type": "string"}, {"name": "age", "type": "int"}, {"name": "isman", "type": "boolean"}]}

The code is as follows

Public class AvroTest {private static String avscfile = "/ home/cloudera/pair.avsc"; private static String avrofile = "/ home/cloudera/pair.avro"; / * * @ param args * @ throws IOException * / public static void main (String [] args) throws IOException {/ / schemaReadWrite (); / / WriteData () ReadData ();} private static void schemaReadWrite () throws IOException {/ Read Schema from schema file Parser ps = new Schema.Parser (); Schema schema = ps.parse (new File (avscfile)) If (schema! = null) {System.out.println (schema.getName ()); System.out.println (schema.getType ()); System.out.println (schema.getDoc ()); System.out.println (schema.getFields ()) } / construct a record. GenericRecord datum = new GenericData.Record (schema); datum.put ("left", new String ("mother")); datum.put ("right", new String ("father")); / / write to outputstream ByteArrayOutputStream out = new ByteArrayOutputStream (); DatumWriter writer = new GenericDatumWriter (schema) Encoder encoder = EncoderFactory.get (). BinaryEncoder (out, null); writer.write (datum, encoder); encoder.flush (); out.close (); / / read from inputstream DatumReader reader = new GenericDatumReader (schema) Decoder decoder = DecoderFactory.get () .binaryDecoder (out.toByteArray (), null); GenericRecord record = reader.read (null, decoder); System.out.print (record.get ("left")); System.out.print (record.get ("right")) } public static void WriteData () throws IOException {Parser ps = new Schema.Parser (); Schema schema = ps.parse (new File (avscfile)); File file = new File (avrofile); DatumWriter writer = new GenericDatumWriter (schema) DataFileWriter fileWriter = new DataFileWriter (writer); fileWriter.create (schema, file); MyAvro datum = new MyAvro (); for (int I = 0 int I)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report