What are the common Serialize technologies? 07/06 Update SLTechnology News&Howtos

What are the common Serialize technologies?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what are the common Serialize technologies". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what are the common Serialize technologies?"

[1. Common in-use Serialize schemes for API and message communication]:

Scenario 1. Serialize and deserialize objects based on Java native ObjectOutputStream.write () and ObjectInputStream.read ().

Scheme 2, serialization and deserialization based on JSON.

Scheme 3, serialization and deserialization based on XML.

[option 1 Analysis, ObjectXXXStream]:

Advantages:

(1) Java has its own API serialization, which is simple, convenient and has no third-party dependence.

(2) Don't worry about the loss of precision in data parsing, the loss of fields, and the uncertainty of the deserialization type of Object.

Disadvantages:

(1) it is troublesome for both parties to debug. It is better for the sender and receiver to use the same version of the object description, otherwise there will be strange problems. The debugging cycle is relatively long, and there are many problems in cross-team upgrade.

(2) the transmitted object contains metadata information, which takes up a lot of space.

[scenario 2 Analysis, JSON Serialization]:

Advantages:

(1), simple, convenient, no need to pay attention to the object format to be serialized.

(2) there are many components that can be supported in the open source world, for example, FastJSON has very good performance.

(3) in many RPC frameworks, such a scheme is basically supported.

Disadvantages:

(1), if the Object type is included in the object attribute, it will be troublesome to deal with if the business itself is not clear about the data type during deserialization.

(2), because of the text type, it must take up a large amount of data space, such as the following figure.

(3) compare the compatibility and performance of parsing packages that depend on JSON. Some details of JSON (such as some non-target JSON) may be handled in different ways.

Serialization regardless of any data type must first be converted to String, converted to byte [], which will increase the number of memory copies.

(5), deserialization, the entire JSON must be deserialized into objects before it can be read, we should know that Java objects, especially hierarchical nested objects, will take up much more memory space than the data itself.

Extreme case of data magnification 1:

Transmit data description information as follows:

Class PP {

Long userId = 102333320132133L

Int passportNumber = 123456

}

At this point, the format of passing JSON is:

{

"userId": 102333320132133

"passportNumber": 123456

}

The data we want to pass is 1 long, 1 int, that is, 12 bytes of data. The string length of this JSON will be the actual number of bytes (excluding carriage returns and spaces, this is just for readability. At the same time, note that the long here is a string in JSON). This string has 51 bytes, that is, the data is about 4.25x.

Data magnification extreme case 2:

When the data inside your object is of byte [] type, and JSON is text format data, which cannot be stored in byte [], the only way to serialize such data is to convert byte to characters. There are usually two ways:

(1) use BASE64 coding, which is commonly used in JSON at present.

(2) hexadecimal character encoding in bytes. For example, the string "FF" represents the byte 0xFF.

No matter which of the above two methods, one byte will become two characters to pass, that is, the byte [] data will be magnified more than twice. Why not use ISO-8859-1 characters to encode them? Because after this encoding, after the final serialization into network byte [] data, the corresponding byte [] does not become larger, but when it is deserialized into text, the receiver does not know it is ISO-8859-1, and it will parse it into String with more common character sets such as GBK and UTF-8, so that the JSON object can be further parsed, so that the byte [] may be changed during the coding process. It will be very troublesome to deal with this problem.

[scenario 2 Analysis, XML Serialization]:

Advantages:

(1) easy to use, easy to use, no need to pay attention to the object format to be serialized

(2) good readability, XML is more common in the industry, and we are used to seeing what XML looks like in the configuration file.

(3) A large number of RPC frameworks are supported, and documents can be directly formed for circulation through XML.

Disadvantages:

(1) the performance of serialization and deserialization has not been very good.

(2) there are the same data type problems and data magnification problems as JSON. At the same time, the problem of data magnification is more serious. At the same time, the number of memory copies and JSON types are inevitable.

XML data magnification description:

The data magnification of XML is usually more serious than that of JSON. In the case of JSON above, XML usually transmits this data as follows:

102333320132133

123456

This message has more than 80 + bytes, if there are some Property attributes in XML and objects are nested, it is possible that the magnification may reach 10 times, so its magnification is more serious than JSON, which is why more and more API prefer to use JSON rather than XML.

[what is the problem of magnification]:

(1) spend more time assembling strings and copying memory, which takes up more Java memory and produces more fragments.

(2), if the generated JSON object is converted to byte [], it needs to be converted to String text before byte [] encoding, because this itself is a text protocol, then naturally one more full memory copy.

(3) the transmission process takes up more network traffic because the data is magnified.

(4) as the package of the network becomes more, the ACK of the TCP will also become more, and the natural system will also be larger. Under the same packet loss rate, the number of packets lost will increase, and the overall transmission time will be longer. If the network delay of this data transmission is very large and the packet loss rate is high, we should try to reduce the size as far as possible. Compression is a way, but compression will bring a huge increase in CPU load, we expect to reduce the magnification of data as much as possible before compression, and then determine whether to compress or not according to RT and data size when transmitting data. When necessary, if the data is too large before compression, partial sampling data compression can also be carried out to test the compression ratio.

(5) the receiver will spend more time processing the data.

(6) because it is a text protocol, it will increase the overhead in the process of processing, such as converting numbers to strings, strings to digits, byte [] to strings, strings to byte [] will increase additional memory and computing overhead.

However, in a large number of applications, this overhead is negligible compared to business logic, so in terms of optimization, this is not our focus, but we are faced with some specific scenarios with more data processing. That is, core business needs to consider this issue when data serialization and deserialization, so I will continue to discuss the problem.

A few questions are raised at this point:

(1) is there a better solution for network delivery? if so, why is it not adopted on a large scale now?

(2) how does the relatively low-level data communication, such as JDBC, do, and what if it passes the result set like the above three scenarios?

[II. MySQL JDBC data transfer scheme]:

In the previous article, I mentioned that the data is magnified several times in the serialization process, do we want to see if some of the relatively low-level communications are the same? So let's take MySQL JDBC as an example to see if the same is true for communication between it and JDBC.

JDBC driver has many implementations according to different databases, and there are great differences in the details of each database implementation. This paper takes the data parsing of MySQL JDBC as an example (before MySQL 8.0) to explain how it transmits data, and in the process of transferring data, I believe we are most concerned about how ResultSet data is transmitted.

Leaving aside the basic information such as MetaData in the result set, just look at the data itself:

(1) when JDBC reads data rows, it first reads a row packege,row package from the buffer, which is obtained from the network package. The size of the package is determined according to the header of the package passed in the protocol, and then the corresponding size is read from the network buffer. The following figure shows that there may not be an exact correspondence between the package transferred through the network and the package in the business data. In addition, if the package in the network goes to the local buffer, they are logically contiguous (the figure is deliberately separated to let people understand that the transmission in the network is passed to the local package). The process of JDBC reading row package from the local buffer is the package copy process from the kernel package to JVM. For us Java, we mainly focus on row package (there may be some special cases in the JDBC that read the package is not line-level. In this special case, please refer to the source code if you are interested.

Let's not consider that according to bit, there are 31 bit zeros, but there are 7 zeros according to bytes, which means that there is no data in bytes, only one byte is valuable. You can take a look at a large number of auto-growing columns in your database. Before the id is less than 4194303, the first five bytes are wasted, and the first four are all zero and wasted before growing to 4 bytes (2 to the 32-1). In addition, even if the first byte of the 8 bytes starts to be used, there will be a lot of data, and the probability of the middle byte being 0 is extremely high, just like entering 100 million in the decimal system, then there will be at most 8 zeros below 100 million. The higher the zero, it is difficult to add.

If you really want to try, you can use this method: use 1 byte to mark, but it will take up a certain amount of computational overhead, so it is up to you to decide whether to do this for this space. This article is only a technical discussion:

Method 1: express the number of bytes currently used in several low bits, because long only has 8 bytes, so 3 bit is enough, the other 5 bit is wasted, it doesn't matter, just add 0x00 according to the number of high bits when deserialization.

Method 2: compared with method 1, it is more thorough, but it is more complex to deal with. Using 0 and 1 of 8 bit of 1 byte to represent 8 bytes of long is used. Serialization and deserialization process carry out byte complement 0x00 operation according to flag bit and data itself. Adding complete 8 bytes is the value of long. At worst, 9 bytes represent long, and at best 0 is 1 byte. When only 2 bytes are occupied in the bytes, even if the data becomes quite large, there will be a large number of data bytes with gaps. In these cases, it can usually be expressed as less than 8 bytes, and it takes 7 bytes to be able to occupy the same space as the original digital long. At this time, the data is already larger than 2 to the 48th power.

[III. Google Protocol Buffer technical proposal]:

This may not have been used by many people, nor do they know what it is used for, but I have to say that it is an artifact of data serialization and deserialization at present. this thing is designed within Google to agree on its own internal data communication. We all know that Google's global network is very powerful, so naturally it is quite extreme in terms of data transmission. Here I will explain its principle, on its own use, please refer to other people's blog, the limited space of this article can not be explained by step by step.

When you see this name, you should know whether it is protocol Buffer or protocol coding, which is similar to the above-mentioned use of JSON and XML to make RPC calls, that is, to pass messages or call API between systems. But on the one hand, in order to achieve readability and cross-language versatility similar to XML and JSON, on the other hand, Google wants to achieve higher serialization and deserialization performance, and data magnification can be controlled, so it hopes to have a way that is easier to use than the underlying coding, while it can use the underlying coding and has the readability of documents.

It first needs to define a format file, as follows:

Syntax = "proto2"

Package com.xxx.proto.buffer.test

Message TestData2 {

Optional int32 id = 2

Optional int64 longId = 1

Optional bool boolValue = 3

Optional string name = 4

Optional bytes bytesValue = 5

Optional int32 id2 = 6

}

This file is neither a Java file nor a C file, and has nothing to do with the language. Its suffix is usually named proto (the numbers 1, 2, and 3 in the file represent the order of serialization, and deserialization will also be done in this order). Then when protobuf is installed locally (different OS installations are different, there are official downloads and instructions), a protoc run file is generated and added to the environment variable. Run the command to specify a target directory:

Protoc-java_out=~/temp/ TestData2.proto

At this point, the directory described by package will be generated under the specified directory. Inside the directory, there is a Java source file (other languages will generate other languages). This part of the code is generated by Google for you. If you write it yourself, it is too hard for Google to do it for you. In the local Java project, you need to introduce protobuf package and maven reference (version is optional):

Com.google.protobuf

Protobuf-java

3.6.1

At this point, the generated code will call the method library provided in the Google package to do the serialization action. Our code only needs to call the API in the generated class to do serialization and deserialization operations. Put these generated files in a module and publish them to the maven warehouse and others can refer to them. As for the test code itself, you can refer to many blogs that provide test code. It still works.

Google coding is more magical, you can define the format of data transmission according to the way objects, readability is extremely high, even compared to XML and JSON is more suitable for programmers to read, can also be used as communication documents, different languages are common, the definition of objects can still be nested, but it serializes out of the bytes only a little larger than the original data, this Nima is too powerful.

After testing different data types, deliberately creating layers of data nesting, carrying out multi-layer nesting of binary arrays, it is found that the magnification ratio of the data is very small, which is almost equivalent to binary transmission, so I output the serialized data in binary, and find that its coding mode is very close to the above JDBC, although there are some differences in details, but very close, other than that. It has several characteristics when serializing:

(1) if the field is empty, it will not produce any bytes. If all the attributes of the integrated object are null, the resulting bytes will be 0.

(2), for int32, int64 data using variable length coding, the idea and we described above have something in common, that is, an int64 value can be expressed with fewer bytes when it is relatively small, and there is a set of byte shift and XOR algorithms inside to deal with this matter.

(3) it does not do any conversion to the string and byte [], but puts it directly into the byte array, which is similar to binary coding.

(4), because the field is empty, it can not do any bytes, its practice is that where there is data, there will be a location coding information, you can try to adjust the numerical order to see if the generated byte will change; then it has a strong compatibility, that is, ordinary plus fields are fine, which is difficult for ordinary binary coding.

(5) the serialization process does not produce metadata information, that is, it does not write the structure of the object in bytes, but the deserialization receiver has the same object, which can be de-parsed.

What's the difference between this and my own coding?

(1) there is a lot of uncertainty in writing your own code. if you don't write it well, the data may be larger and prone to errors.

(2) after google engineers put the internal specification, Google open source products also widely use this communication protocol, more and more industry middleware products begin to use this scheme, and even the latest version of MySQL database will begin to be compatible with protobuf in data transmission.

(3) Google is beginning to define a new data transmission scheme in the industry, that is, it has performance and reduces the difficulty of code development, as well as the ability to access across languages, so more and more people like to use this thing.

So what are its shortcomings? Not much, it basically takes into account a lot of things you need to consider in serialization and deserialization, achieving a very good balance, but to pick defects, we have to look for scenarios:

(1), protobuf needs both sides to specify the data type, and each object in the defined file should be clear about the data type. There is no plan for the expression of the Object type, and you must know in advance what the Object type is.

(2), using repeated can express arrays, but can only express the same type of data. For example, when the data types of multiple columns of JDBC data mentioned above are different, it will be more troublesome to use this expression; in addition, by default, arrays can only express 1-dimensional arrays, and to express two-dimensional arrays, you need to use object nesting to accomplish indirectly.

(3) the data types it provides are all basic data types. If it is not an ordinary type, you need to find a way to convert it to an ordinary type for transmission. For example, if you look up and deal with a Docment object from MongoDB, you need to convert this object serialization to byte [] or String by yourself. Relative to XML and JSON, it generally provides recursive function, but if protobuf wants to provide this function, It is bound to face the problem of data amplification, and there is always a contradiction between generality and performance.

(4), relative to custom byte, serialization and deserialization are completed at one time, not gradually, so that if the transfer array is nested, a large number of Java objects will be generated during deserialization. In addition, custom byte can further reduce the memory copy, but Google's memory copy is much less compared to the text protocol.

At this point, I believe you have a deeper understanding of "what are the common Serialize technologies?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.