Example Analysis of AVRO data Storage format in Hive 07/15 Update SLTechnology News&Howtos

Example Analysis of AVRO data Storage format in Hive

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you the example analysis of AVRO data storage format in Hive. I hope you will get something after reading this article. Let's discuss it together.

Avro (pronounced like [æ vr others]) is a sub-project of Hadoop, led by Doug Cutting, founder of Hadoop (who is also the founder of projects such as Lucene,Nutch, worshiped), and the latest version 1.3.3. Avro is a data serialization system designed to support mass data exchange applications. Its main features are: support for binary serialization, which can deal with large amounts of data conveniently and quickly; dynamic language friendliness, and the mechanism provided by Avro enables dynamic languages to deal with Avro data conveniently.

There are many similar serialization systems on the market, such as Google's Protocol Buffers and Facebook's Thrift. These systems respond well and can fully meet the needs of ordinary applications. In response to the confusion of repeated development, Doug Cutting explained that Hadoop's existing RPC system encountered some problems, such as performance bottlenecks (currently using IPC system, which uses DataOutputStream and DataInputStream inherent in Java); server-side and client-side must run the same version of Hadoop; can only be developed using Java, and so on. But there is something wrong with the existing serialization systems. Protocol Buffers, for example, requires users to define a data structure, generate code based on that data structure, and then assemble the data. If you need to manipulate datasets from multiple data sources, you need to define multiple data structures and repeat the above process multiple times, so that any dataset cannot be handled uniformly. Second, for scripting systems such as Hive and Pig in Hadoop, using code generation is unreasonable. And Protocol Buffers takes into account that the data definition may not exactly match the data when serializing, adding annotations to the data, which makes the data large and slows down processing. Other serialization systems have similar problems like Protocol Buffers. So in order to consider the future of Hadoop, Doug Cutting led the development of a new serialization system, which is Avro, which joined the Hadoop project family in 2009.

The above through the comparison with Protocol Buffers, roughly clear the specialty of Avro. Let's focus on the details of Avro.

Avro relies on the schema (Schema) to implement the data structure definition. You can think of a schema as a class of Java, which defines the structure of each instance and which attributes it can contain. You can generate as many instance objects as you want based on the class. When serializing an instance, you must know its basic structure, which means you need to refer to the information of the class. Here, the Avro object generated according to the schema is similar to the instance object of the class. You need to know the specific structure of the pattern every time you serialize / deserialize. Therefore, in some scenarios where Avro is available, such as file storage or network communication, schema and data are required to exist at the same time. Avro data is read and written in mode (file or network), and the written data does not need to be appended with other identities, so the serialization is fast and the result is less content. Because programs can process data directly according to schemas, Avro is more suitable for scripting languages. It's not column storage, it's just a way of serialization.

The schema of Avro is mainly represented by JSON objects, which may have specific properties that describe different forms of a type (Type). Avro supports eight basic types (Primitive Type) and six mixed types (Complex Type). Primitive types can be represented by JSON strings. Each different mixed type has different properties (Attribute) to define, some properties are required, some are optional, and if necessary, you can use the JSON array to hold multiple JSON object definitions. With the support of these Avro-defined types, users can create rich data structures to support users' complex data.

Avro supports two serialization encodings: binary encoding and JSON encoding. Using binary encoding results in efficient serialization and small results, while JSON is generally used to debug systems or WEB-based applications. Serialization / deserialization of Avro data requires the execution of the schema in a depth-first (Depth-First), left-to-right (Left-to-Right) traversal order. Serialization of basic types is easy to solve, and serialization of mixed types has many different rules. The binary coding for basic and mixed types is specified in the document, and the bytes are arranged in the parsing order of the pattern. For JSON coding, the union type (Union Type) behaves differently from other mixed types.

Avro defines a container file format (Container File Format) to facilitate MapReduce processing. There can be only one mode in such a file, and all objects that need to be stored in this file need to be written in binary encoding. Objects are organized in Block in the file, and these objects can be compressed. There is a synchronization marker (Synchronization Marker) between blocks so that MapReduce can easily cut files for processing. The following figure is a file structure diagram drawn according to the document description:

The above picture has already dismembered each piece, but it is still necessary to explain in detail. A storage file consists of two parts: header information (Header) and data block (Data Block). The header information consists of three parts: a four-byte prefix (similar to Magic Number), a file Meta-data information, and a randomly generated 16-byte synchronization token. The Meta-data information here makes people wonder what it can contain besides the schema of the file. The document points out that there are only two Meta-data:schema and codec identified by Avro. The codec here indicates how to compress the subsequent file data blocks (File Data Block). The implementation of Avro needs to support the following two compression methods: null (not compressed) and deflate (compressing data blocks using the Deflate algorithm). In addition to the two Meta-data identified in the documentation, users can also customize their own Meta-data. Here, the long type is used to indicate how many Meta-data data pairs there are, which also allows users to define enough Meta-data information in practical applications. For each pair of Meta-data messages, there is a string-type key (required with "avro." Prefix) and binary encoded value. For each data block after the header information in the file, there is a structure like this: a long value records how many objects there are in the current block, a long value is used to record the number of bytes of the current block after compression, a real serialized object and a 16-byte synchronization token. Because objects can be organized into different blocks, you can manipulate a block of data without deserialization. Corrupted blocks can also be located by the number of blocks, objects, and synchronization markers to ensure data integrity.

The above is the operation to serialize the Avro object to a file. Accordingly, Avro is also used as a RPC framework. When the client wants to interact with the server, it needs to exchange the communication protocol between the two parties, which is similar to the pattern and needs to be defined by both parties, which is called Message in Avro. Both sides of the communication must maintain this protocol in order to parse the data sent from each other, which is the legendary handshake phase.

Messages sent from the client to the server go through the transport layer (Transport Layer), which sends messages and receives responses from the server. The data that reaches the transport layer is binary data. HTTP is usually used as the transmission model, and the data is sent to the other party by POST. In Avro, its messages are encapsulated into a set of buffers (Buffer), similar to the model shown below:

As shown in the figure above, each buffer begins with four bytes, with multiple bytes of buffered data in the middle, and ends with an empty buffer. The advantage of this mechanism is that the sender can easily assemble data from different data sources when sending data, and the receiver can also store the data in different storage areas. In addition, when writing data to a buffer, large objects can monopolize a buffer instead of mixing with other small objects, making it easy for the receiver to read large objects.

Let's talk about other aspects of Avro. Doug Cutting was quoted as saying that when Protocol Buffer transmits data, it adds annotation to the data to deal with the problem that the data structure does not match the data. However, it directly leads to the shortcomings such as large amount of data, difficult analysis and so on. So how does Avro deal with differences between patterns and data? In order to ensure the efficiency of Avro, it is assumed that at least most of the schemas are matched, and then some verification rules are defined. If the rules are satisfied, data validation is done. If the pattern does not match, it will report an error. In the same pattern, when interacting with data, if a field is missing in the data, use the default value in the specification; if there is more data in the data that does not match the pattern. These values are ignored.

Another advantage listed by Avro is that it can be sorted. That is, Avro programs supported by one language can sort undeserialized data by Avro programs in other languages after serializing the data. I don't know what kind of scenario this mechanism is used in, but it looks pretty good.

Repost: http://langyu.iteye.com/blog/708568

Implementing avro storage in hive is very simple. There is a very detailed introduction in https://cwiki.apache.org/confluence/display/Hive/AvroSerDe.

For higher versions of hive, you can directly use avro format instead of manually specifying the schema file of avro. Hive will parse and store schema in the header of the file according to the way table is created.

Create table kst (

> name string,age int

>) stored as avro

After that, import the data into the kst table by means of 1) through other tables select insert into,2) generate data in avro format through other programs, and load into the hive table, or add partition can go in.

The first way:

From (select * from stus) base insert into kst select *

After that, you can view the final data generated by hive.

As you can see, the table is generated automatically.

2) when creating a table, schema is written directly in table (Use schema.literal and embed the schema in the create statement)

CREATE EXTERNAL TABLE tweets

COMMENT "A table backed by Avro data with the

Avro schema embedded in the CREATE TABLE statement "

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

STORED AS

INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

LOCATION'/ user/wyp/examples/input/'

TBLPROPERTIES (

'avro.schema.literal'=' {

"type": "record"

"name": "Tweet"

"namespace": "com.miguno.avro"

"fields": [

{"name": "username", "type": "string"}

{"name": "tweet", "type": "string"}

{"name": "timestamp", "type": "long"}

]

);

3) through hive script (Use avro.schema.literal and pass the schema into the script)

Hive can do simple variable substitution and you can pass the schema embedded in a variable to the script. Note that to do this, the schema must be completely escaped (carriage returns converted to, tabs to\ t, quotes escaped, etc). An example:

Set hiveconf:schema

DROP TABLE example

CREATE TABLE example

ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

STORED AS INPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

TBLPROPERTIES (

'avro.schema.literal'='$ {hiveconf:schema}')

To execute this script file, assuming $SCHEMA has been defined to be the escaped schema value:

Hive-- hiveconf schema= "${SCHEMA}"-f your_script_file.sql

Note that $SCHEMA is interpolated into the quotes to correctly handle spaces within the schema.

Finally, even if an additional sechema is specified to automatically generate the table, the final schema is written to the final generated file.

After reading this article, I believe you have some understanding of "sample Analysis of AVRO data Storage format in Hive". If you want to know more about it, please follow the industry information channel. Thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.