Getting started with gRPC (2)-- Analysis of the principle of Protobuf Serialization 07/06 Update SLTechnology News&Howtos

Getting started with gRPC (2)-- Analysis of the principle of Protobuf Serialization

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Getting started with gRPC (2)-- Analysis of the principle of Protobuf Serialization 1. Introduction to the principle of Protobuf Serialization 1. Serialization

Serialization is the process of converting a data structure or object into a binary byte stream.

Protobuf serializes message fields with different encoding and data storage methods for different field types to ensure efficient and compact data compression.

The Protobuf serialization process is as follows:

(1) determine whether each field has a set value, and then encode it.

(2) encode the field value by different encoding methods according to the field identification number and data type.

(3) the encoded data blocks are encapsulated into binary data streams by using different data storage methods according to the field types.

2. Deserialization

Deserialization is the process of converting a stream of binary bytes generated during serialization into a data structure or object.

The Protobuf deserialization process is as follows:

(1) the parseFrom (input) of the calling message class parses the binary byte data stream read from the input stream.

(2) read the parsed data into the structure types corresponding to Java, C++ and Phyton according to the specified format.

2. Protobuf coding mode 1. Varint coding

Varint coding is a kind of variable length coding. The coding principle is to represent numbers in bytes. The smaller the value, the fewer bytes are used. Therefore, data compression can be performed by reducing the number of bytes representing numbers.

For numbers of type int32, it usually takes 4 bytes to represent. If you use Varint encoding, for very small int32 type numbers, you can use 1 byte; although large numbers will need 5 bytes to represent, but in most cases, messages will not have very large numbers, so using Varint encoding can always use fewer bytes to represent numbers.

The highest bit of each byte after Varint encoding has a special meaning:

A. if it is 1, it means that subsequent bytes are also part of the number.

B. if it is 0, it means that this byte is the last byte, and the remaining 7 bits are used to represent numbers.

When using Varint decoding, as long as the byte with the highest bit 0 is read, it indicates that this byte is the last byte of a Varint-encoded byte stream.

In a computer, a negative number is generally represented as a large integer, because the computer defines the symbol bit of a negative number as the highest bit of a number, so if you use Varint coding to represent a negative number, you must need 5 byte (because the highest bit of a negative number is 1, it will be treated as a large integer).

Protobuf defines a sint32 / sint64 type to represent a negative number, which is used to reduce the number of bytes after coding by first using Zigzag encoding (converting signed numbers into unsigned numbers), and then using Varint coding.

The Varint encoding for a value of type int32 300 is as follows:

The binary code of 300 is: 100101100 (256 32 8 4).

Take the 7bit from the end of the byte stream and add 1 to the highest bit to form a byte: [1] 010 1100

Take the 7bit from the end of the byte stream and add 1 to the highest bit to form a byte. If it is the last byte, add 0: [0] 0000010

The two bytes are: [0] 0000010 [1] 010 1100

Convert to small end mode: 10101100 00000010

Coding result: 1010 1100 0000 0010

2. Zigzag coding

Zigazg coding is a kind of variable length coding method, its coding principle is to use unsigned numbers to represent signed numbers, so that numbers with small absolute values can be represented by fewer bytes, especially for data representing negative numbers.

Zigzag coding is a supplement to the deficiency of Varint coding in expressing negative numbers, so as to better help Protobuf to compress data. Therefore, if you predict in advance that the field value is likely to be negative, you need to use the sint32/sint64 data type.

When Protobuf is encoded through Varint and Zigzag, the number of bytes consumed by field values is greatly reduced.

The Zigzag process of-2 is as follows:

3. Protobuf data storage mode 1. T-L-V data storage mode

T-L-V (Tag-Length-Value), that is, the storage mode of identifier-length-field value, its principle is that the identifier-length-field value represents a single data, and finally splices all the data into a byte stream, thus realizing the function of data storage.

Among them, Length optional storage, such as storing Varint encoded data does not need to store Length, at this time it is Tmurv storage mode.

Advantages of T-L-V storage:

A. the fields can be separated without delimiters, which reduces the use of delimiters.

B, each field is stored very compact, and the utilization rate of storage space is very high.

C, if a field is not set to a field value, then the field does not exist in the serialized data at all, that is, no encoding is required, and the corresponding field will only be set to the default value when decoding.

2. Tmurv data storage mode

After the identification number, data type and field value of the message field are encoded by Varint and Zigzag by Protobuf, the data is stored in TmurV (Tag-Value) mode.

For the data encoded by Varint and Zigzag, the byte length Length in T-L-V is omitted.

Tag is the value of the message field identifier and data type encoded by Varint and Zigzag, so Tag stores the field identifier (field_number) and data type (wire_type), that is, Tag = field data type (wire_type) + identification number (field_number).

The Tag occupies one byte of length (one more byte if the identifier is greater than 15), the field data type (wire_type) occupies 3 bit, the field identifier (field_number) occupies 4 bit, and the highest bit is used for Varint encoding retention.

Tag = (field_number 3field_number = 0010wire_type = nameTag & 3wire_type = 010IV. Analysis of Protobuf serialization principle 1. Introduction to Protobuf serialization

Protobuf's three principles for data storage:

The main results are as follows: (1) after each field in the message is encoded by Protocol Buffer, the T-L-V storage mode is used to store the data, and finally get a binary byte stream.

(2) ProtoBuf uses different serialization methods for different data types (data encoding and data storage).

Protobuf serializes message fields with different encoding and data storage methods for different field types to ensure efficient and compact data compression. Different types of data are encoded and stored as follows:

For the storage of Varint encoded data, there is no need to store byte length Length, and TMY V storage mode is used for storage; for data with other encoding methods (such as LENGTH_DELIMITED), T-L-V storage mode is used for storage.

(3) ProtoBuf's unique coding method for data field values and T-L-V data storage mode make the data volume positive and small after ProtoBuf serialization.

2. Serialization of WireType=0

The types of WireType=0 include int32,int64,uint32,unint64,bool,enum and sint32 and sint64.

The coding method adopts Varint coding (if it is negative, Zigzag auxiliary coding is used), and the data storage mode uses Tmerv mode to store binary byte stream.

3. Serialization of WireType=1

The types of WireType=1 include fixed64,sfixed64,double.

The coding method adopts 64bit coding (the data size after coding is 64bit, the high bit is in the back, and the low bit is in front), and the data storage mode uses Tmuri V mode to store the binary byte stream.

4. Serialization of WireType=2

The types of WireType=2 include string,bytes, nested messages, and packed repeated fields.

For the encoding method, the identifier Tag is encoded by Varint, the byte length Length is encoded by Varint, the field values of string types are encoded by UTF-8, and the field values of nested message types are selected according to the field data types inside the nested messages.

The data storage mode uses T-L-V mode to store binary byte streams.

5. Serialization of WireType=5

The types of WireType=5 include fixed32,sfixed32,float.

The coding method adopts 32bit coding (the data size after coding is 32bit, the high bit is in the back, and the low bit is in front), and the data storage mode uses Tmuri V mode to store the binary byte stream.

5. Protobuf serialization example 1. String type

The value of the String type field is encoded with UTF-8. The message data flow is as follows:

Message Test {required string str = 2;} / set str to: testingTest.setStr ("testing") / / output data serialized by protobuf encoding in binary format: 18,7,116,101,115,116,105,110,103

2. Nested message types

The nested message type is stored in T-L-V, and the V of the external message is the field of the nested message.

A series of T-L-V is nested in the V of T-L-V.

Encoding method: the field value (V) adopts different encoding methods according to the data type of the field.

Message Test2 {required string str = 1; required int32 id1 = 2;} message Test3 {required Test2 c = 1;} / / set the field str in Test2 to: testing// sets the field id1 in Test2 to: 296 stroke / encoded bytes are: 10,12,18penny 7116,101,115,110,105,110,105,110,103.

3. Repeat fields modified by packed message Test {repeated int32 Car = 4; / / expression 1: without packed=true repeated int32 Car = 4 [packed=true]; / / expression 2: with packed=true} Test.setCar (3); Test.setCar (270); Test.setCar (86942)

If serialization is stored for multiple T-V pairs (without packed=true), it results in redundancy of Tag, that is, multiple times of the same Tag storage.

In order to solve the Tag data redundancy, the repeated field storage method with packed=true is adopted, that is, the same Tag is stored only once, the length Length of all field values under the repeated field is added, and the repeated field values are stored continuously to form a large Tag-Length-Value-Value-Value pair, that is, T-L-V-V-V pair.

By using repeated field storage with packed=true, the serialized data length is better compressed.

VI. Suggestions for the use of Protobuf

Based on the analysis of Protobuf serialization principle, in order to effectively reduce the amount of data after serialization, the following measures can be taken:

(1) use more optional or repeated modifiers

If the optional or repeated field is not set to a field value, the field does not exist in the serialized data at all, that is, it does not need to be encoded, but the corresponding field is set to the default value when decoding.

(2) try to use only 1-15 field identification numbers (Field_Number), and do not jump to use them.

Tag takes up byte space. If Field_Number > 16:00, Field_Number 's encoding takes up 2 bytes, then Tag takes up more bytes when encoding; if the field identification number is defined as a continuously increasing numeric value, better encoding and decoding performance will be achieved.

(3) if the field values you need to use are negative, use sint32/sint64 instead of int32/int64.

When using the sint32/sint64 data type to represent negative numbers, Zigzag coding will be used first and then Varint coding will be used to compress the data more effectively.

(4) for repeated fields, add packed=true decoration as much as possible.

Add packed=true modification, repeated field will use continuous data storage mode, namely T-L-V-V-V mode.

References:

Revealing the principle of Carson_Ho:Protocol Buffer Serialization

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.