AVRO of Hadoop 07/13 Update SLTechnology News&Howtos

AVRO of Hadoop

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Avro is a data serialization framework that supports multi-language data serialization. He was born mainly to make up for the defect that Writable only supports java.

1 introduction to AVRO

Many people will ask similar frameworks such as Thrift and Protocol, so why not use these frameworks, but instead build a new framework, or what are the differences in Avro. First of all, Avro and other frameworks, the data is described by language-independent schema, the difference is that Avro code generation is optional, schema and data are stored together, while schema makes the entire data processing process does not generate code, static data types, etc., in order to achieve these, it is necessary to assume that the pattern is known when reading the data, which results in tightly coupled coding and no need for user-specified field identification.

Avro's schema is in JSON format, while the encoded data is in binary format (and, of course, other options), which is easy to implement for languages that already have JSON libraries.

Avro also supports expansion. The written schema and read schema do not have to be the same, that is to say, it is compatible with the new and old schema and the reading of the new and old clients. For example, the new schema adds a field, and both the new and old clients can read the old data. The new client can write the data according to the new schema, and the new fields can be ignored when the old client reads the new data.

Avro also supports datafile files, schema is written in the metadata descriptor at the beginning of the file, and Avro datafile supports compression and segmentation, which means you can input Mapreduce.

2 Avro Schemas

2.1 Schema definition

Schema is in JSON format and includes the following three forms:

1.JSON string types, mainly native types

2.JSON array, mainly union

3.JSON object, format:

{"type": "typeName"... attributes...}

Including types other than native types and union, attributes can include attributes that are not defined by avro, which do not affect the serialization of data.

2.2 primitive types

There are 8 native types of null,boolean,int,long,float,double,bytes,strings.

1. Native types do not require attributes

two。 You can specify that "string" and {"type": "string"} are equivalent through type.

3. The implementation of different languages is different, such as the double type, which is double in Crecincter + and java, float in Python and Float in Ruby.

2.3 compound types

1 、 records

Records is generally the final presentation unit of serialized data and can be nested on its own.

{"type": "record", "name": "LongList", "aliases": ["LinkedLongs"], "fields": [{"name": "value", "type": "long"}, {"name": "next", "type": ["LongList", "null"]}]}

2. Enums, enumeration.

{"type": "enum", "name": "Suit", "symbols": ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"]}

3. Arrays, array.

{"type": "array", "items": "string"}

4 、 maps

Map,keys must be string, so only the type of values is specified here

{"type": "map", "values": "long"}

5 、 unions

Cannot contain two or more of the same type without a name attribute

["string", "null"]

6 、 fixed

Size specifies how many bytes each value takes

{"type": "fixed", "size": 16, "name": "md5"}

2.4 three kinds of mapping

Generic mapping

There may be different mapping for one language, but all languages must support dynamic mapping, and schema is not known before processing

Specific mapping

Both java and C++ can be turned into source code. There are more api of domain-oriented than generic mapping.

Reflect mapping

Use reflection to convert avro types to java types, but this mapping is slower than the first two, so it is deprecated.

3 Avro serialization and deserialization

3.1 preparation work

Save the schema as a file StringPair.avsc and put it in the src/test/resources directory.

{"type": "record", "name": "StringPair", "doc": "A pair ofstrings", "fields": [{"name": "left", "type": "string"}, {"name": "right", "type": "string"}]}

The introduction of the latest version of avro should be major, the latest avro package is 1.7.4, relying on the org.codehaus.jackson:jackson-core-asl:1.8.8 package, but this version is no longer available in the maven library, so switch to another version.

Org.codehaus.jacksonjackson-core-asl1.9.9

If you are using version 1.0.4 of hadoop (or other versions), it depends on jackson-mapper-asl. If it is inconsistent with the jackson-core-asl version, it will cause an exception such as not finding a method. You need to import the same version.

Org.codehaus.jacksonjackson-mapper-asl1.9.9

3.2 generic mode

Package com.sweetop.styhadoop; import junit.framework.Assert;import org.apache.avro.Schema;import org.apache.avro.generic.GenericData;import org.apache.avro.generic.GenericDatumReader;import org.apache.avro.generic.GenericDatumWriter;import org.apache.avro.generic.GenericRecord;import org.apache.avro.io.*;import org.junit.Test; import java.io.ByteArrayOutputStream;import java.io.File;import java.io.IOException; / * * Created with IntelliJ IDEA. * User: lastsweetop * Date: 13-8-5 * Time: 7:59 * To change this template use File | Settings | File Templates. * / public class TestGenericMapping {@ Test public void test () throwsIOException {/ / load schema from the StringPair.avsc file Schema.Parser parser = newSchema.Parser (); Schema schema = parser.parse (getClass (). GetResourceAsStream ("/ StringPair.avsc")); / / create a record example GenericRecord datum = newGenericData.Record (schema) based on schema; datum.put ("left", "L") Datum.put ("right", "R"); ByteArrayOutputStream out = new ByteArrayOutputStream (); / / DatumWriter can change GenericRecord into a type DatumWriter writer = newGenericDatumWriter (schema) that edncoder can understand; / / encoder can write data to the stream. The second parameter of binaryEncoder is the reused encoder, which is not reused here, and the empty Encoder encoder used is EncoderFactory.get (). BinaryEncoder (out, null); writer.write (datum,encoder) Encoder.flush (); out.close (); DatumReader reader=newGenericDatumReader (schema); Decoderdecoder=DecoderFactory.get (). BinaryDecoder (out.toByteArray (), null); GenericRecordresult=reader.read (null,decoder); Assert.assertEquals ("L", result.get ("left"). ToString (); Assert.assertEquals ("R", result.get ("right"). ToString ();}}

Result.get returns in utf-8 format, and you need to call the toString method to be consistent with the string.

3.3 specific mode

First use avro-maven-plugin to generate the code, the configuration of pom.

Org.apache.avroavro-maven-plugin1.7.0schemasgenerate-sources schema StringPair.avsc src/test/resources ${project.build.directory} / generated-sources/java

The avro-maven-plugin plug-in is bound in the generate-sources phase. Call mvn generate-sources to generate the source code. Let's take a look at the generated source code:

Package com.sweetop.styhadoop / * Autogenerated by Avro * * DO NOT EDIT DIRECTLY * / @ SuppressWarnings ("all") / * * Apair ofstrings * / public class StringPair extendsorg.apache.avro.specific.SpecificRecordBase implementsorg.apache.avro.specific.SpecificRecord {public static finalorg.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser (). Parse ("{\" type\ ":\" record\ ",\" name\ ":\" StringPair\ ",\" doc\ ":\" Apair ofstrings\ " \ "fields\": [{\ "name\":\ "left\",\ "type\":\ "string\",\ "avro.java.string\":\ "String\"}, {\ "name\":\ "right\",\ "type\":\ "string\"]} ") @ Deprecated public java.lang.CharSequence left; @ Deprecated public java.lang.CharSequenceright; public org.apache.avro.SchemagetSchema () {return SCHEMA$;} / / Used by DatumWriter. Applications should not call. Public java.lang.Object get (intfield$) {switch (field$) {case 0: return left; case 1: return right; default: throw neworg.apache.avro.AvroRuntimeException ("Bad index");} / / Used by DatumReader. Applications should not call. @ SuppressWarnings (value = "unchecked") public void put (int field$,java.lang.Object value$) {switch (field$) {case 0: left = (java.lang.CharSequence) value$; break; case 1: right = (java.lang.CharSequence) value$; break Default: throw neworg.apache.avro.AvroRuntimeException ("Bad index");}} / * * Gets the value of the 'left'field. * / public java.lang.CharSequencegetLeft () {return left;} / * * Sets the value of the 'left'field. * * @ param value the value toset. * / public voidsetLeft (java.lang.CharSequence value) {this.left = value;} / * * Gets the value of the 'right'field. * / public java.lang.CharSequencegetRight () {return right;} / * * Sets the value of the 'right'field. * * @ param value the value toset. * / public voidsetRight (java.lang.CharSequence value) {this.right = value;}}

In order to be compatible with the previous version, a set of get,put methods were generated. After 1.6.0, the getter/setter method was added, and there was a class with Builder, which was useless and deleted by me.

Namespaces such as com.sweetop.styhadoop.StringPair can be used in name in schama so that the generated source code is package.

Let's see how this generated class differs from the generic approach if you use it:

Package com.sweetop.styhadoop; import junit.framework.Assert;import org.apache.avro.Schema;import org.apache.avro.io.*;import org.apache.avro.specific.SpecificDatumReader;import org.apache.avro.specific.SpecificDatumWriter;import org.junit.Test; import java.io.ByteArrayOutputStream;import java.io.IOException; / * * Created with IntelliJ IDEA. * User: lastsweetop * Date: 13-8-6 * Time: 2:19 * To change this template use File | Settings | File Templates. * / public class TestSprecificMapping {@ Test public void test () throwsIOException {/ / because the source code of StringPair has been generated, so instead of using schema, you can call setter and getter directly to StringPair datum=newStringPair (); datum.setLeft ("L"); datum.setRight ("R"); ByteArrayOutputStreamout=new ByteArrayOutputStream () / / instead of passing schema, directly use StringPair as the paradigm and parameter, DatumWriter writer=newSpecificDatumWriter (StringPair.class); Encoder encoder=EncoderFactory.get () .binaryEncoder (out,null); writer.write (datum,encoder); encoder.flush (); out.close (); DatumReader reader=newSpecificDatumReader (StringPair.class); Decoder decoder=DecoderFactory.get () .binaryDecoder (out.toByteArray (), null) StringPairresult=reader.read (null,decoder); Assert.assertEquals ("L", result.getLeft (). ToString ()); Assert.assertEquals ("R", result.getRight (). ToString ());}}

To sum up at the same point:

Schema- > StringPair.class, GenericRecord- > StringPair.

4 AvroDatafile

4.1 datafile composition

The composition of datafile is shown below:

Datafile is divided into file headers are data blocks, if you still do not understand the picture, then look at this should be very clear, datafile header schema:

{"type": "record", "name": "org.apache.avro.file.Header", "fields": [{"name": "magic", "type": {"type": "fixed", "name": "Magic", "size": 4}}, {"name": "meta", "type": {"type": "map", "values": "bytes"}}, {"name": "sync" "type": {"type": "fixed", "name": "Sync", "size": 16},]}

Note the 16-byte synchronization token, which means that datafile supports random reading, can be split, and can be used as input to mapreduce.

DataFileReader can read datafile files at random through synchronization tags.

Void seek (long position) Move to a specific, known synchronization point, one returned fromDataFileWriter.sync () while writing.void sync (long position) Move to the next synchronization point after a position.

4.2 datafile write operation

Explain in the form of code comments:

/ / first create a file with the extension avro (optional, just to make it easy to tell) File file = new File ("data.avro"); / / this line is consistent with the code in the previous article, creating a datum writing class DatumWriter writer = newGenericDatumWriter (schema) for Generic Record; / / unlike Encoder, DataFileWriter can write avro data to a file DataFileWriterdataFileWriter = new DataFileWriter (writer); / / create a file and write the header information dataFileWriter.create (schema,file) / / write datum data dataFileWriter.append (datum); dataFileWriter.append (datum); dataFileWriter.close ()

4.3 datafile read operation

Explain in the form of code comments:

/ / this line is also the same as the previous article. The datum read class of Generic Record is a little different, except that there is no need to input schema here, because schema is already included in the header information of datafile:

DatumReader reader=newGenericDatumReader (); / / the reading class of the datafile file, specifying the file and datumreader DataFileReaderdataFileReader=new DataFileReader (file,reader); / / testing whether the schema read and written under the test is consistent with Assert.assertEquals (schema,dataFileReader.getSchema ()); / / traversing GenericRecord for (GenericRecord record: dataFileReader) {System.out.println ("left=" + record.get ("left") + ", right=" + record.get ("right"));}

5 Avro schema compatible

5.1 compatibility conditions

In practical applications, because the problems of the application version often encounter the situation of reading and writing different schema, fortunately, avro has provided the relevant solution.

The following illustration shows:

5.2 Record compatibility

In the practical application of hadoop, more interaction is carried out in the form of record, and then we focus on the compatibility of record.

First of all, from the point of view of reading and writing schema, there are only two differences between reading and writing schema. There is one more field in reading schema than writing schema, and one less field in reading schema than writing schema. Both cases are very simple to deal with.

First, take a look at the written schema:

{"type": "record", "name": "com.sweetop.styhadoop.StringPair", "doc": "A pair ofstrings", "fields": [{"name": "left", "type": "string"}, {"name": "right", "type": "string"}]}

1. Field has been added.

Schema after adding field:

{"type": "record", "name": "com.sweetop.styhadoop.StringPair", "doc": "A pair ofstrings", "fields": [{"name": "left", "type": "string"}, {"name": "right", "type": "string"}, {"name": "description", "type": "string", "default": ""}]}

Read the data with schema with field added.

New GenericDatumReader (null, newSchema), the first parameter is the written schema, and the second parameter is the read schema

Since the avro datafile,schema is specified in the header of the file, the written schema can be ignored.

@ Test public void testAddField () throws IOException {/ / load schema from the newStringPair.avsc file Schema.Parser parser = newSchema.Parser (); Schema newSchema = parser.parse (getClass (). GetResourceAsStream ("/ addStringPair.avsc")); File file = new File ("data.avro"); DatumReader reader = newGenericDatumReader (null, newSchema); DataFileReader dataFileReader = newDataFileReader (file, reader) For (GenericRecord record: dataFileReader) {System.out.println ("left=" + record.get ("left") + ", right=" + record.get ("right") + ", description=" + record.get ("description"));}}

The output is as follows:

Left=L,right=R,description=left=L,right=R,description=

Description is replaced with an empty string of default values.

2. Reduce the situation of field

Reduced schema for field:

{"type": "record", "name": "com.sweetop.styhadoop.StringPair", "doc": "A pair ofstrings", "fields": [{"name": "left", "type": "string"}]}

Read with schema with reduced field:

@ Test public void testRemoveField () throws IOException {/ / load schema from the StringPair.avsc file Schema.Parser parser = newSchema.Parser (); Schema newSchema = parser.parse (getClass (). GetResourceAsStream ("/ removeStringPair.avsc")); File file = newFile ("data.avro"); DatumReader reader = newGenericDatumReader (null, newSchema); DataFileReader dataFileReader = newDataFileReader (file, reader) For (GenericRecord record: dataFileReader) {System.out.println ("left=" + record.get ("left"));}}

The output is as follows:

Left=Lleft=L

Deleted field is ignored.

3. New and old versions of schema

If you consider it from the point of view of the old and new versions.

The new version of schema adds a field to the old version of schema.

1. The new version reads the data of the old version and uses the default value of the new field in the new version of schema

two。 The old version reads the data of the new version, and the new field in the new version of schema is ignored by the old version.

The new version of schema has one field less than the old version of semi-schema.

1. The new version reads the data of the old version, and the reduced field is ignored by the new version.

two。 The old version reads the new version of the data, the old version of schema uses the default value of the deleted field, if not, it will report an error, then upgrade the old version.

5.3 Alias

Aliases are another method for schema compatibility that converts the field name of the written schema into the field of the read schema, keeping in mind that the aliases field is not added.

Instead, the name attribute of the written filed is changed to aliases, and only the name attribute is recognized when reading.

Let's take a look at schema with an alias:

{"type": "record", "name": "com.sweetop.styhadoop.StringPair", "doc": "A pair ofstrings", "fields": [{"name": "first", "type": "string", "aliases": ["left"]}, {"name": "second", "type": "string", "aliases": ["right"]}]}

Use the alias schema to read the data. Instead of using left,right, you can use first,second:

@ Test public void testAliasesField () throws IOException {/ / load schema from the StringPair.avsc file Schema.Parser parser = newSchema.Parser (); Schema newSchema = parser.parse (getClass (). GetResourceAsStream ("/ aliasesStringPair.avsc")); File file = newFile ("data.avro"); DatumReader reader = newGenericDatumReader (null, newSchema); DataFileReaderdataFileReader = new DataFileReader (file, reader) For (GenericRecord record: dataFileReader) {System.out.println ("first=" + record.get ("first") + ", second=" + record.get ("second");}}

The output is as follows:

First=L,second=Rfirst=L,second=R

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.