How to understand the serialization and deserialization of a website when dealing with data exchange 07/03 Update SLTechnology News&Howtos

How to understand the serialization and deserialization of a website when dealing with data exchange

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how to understand the serialization and deserialization of the website when dealing with data exchange". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations! I hope you can read it carefully and be able to achieve something!

# Summary

Serialization and deserialization are almost daily tasks for engineers, but it is not easy to accurately grasp these two concepts: on the one hand, they often appear as part of the framework and disappear in the framework; on the other hand, they appear in other concepts that are easier to understand, such as encryption and persistence. However, the selection of serialization and deserialization is an important part of system design or refactoring, especially in the design of distributed and large data systems. Proper serialization protocol can not only improve the versatility, robustness, security and optimize the performance of the system, but also make the system easier to debug and expand. This paper analyzes and explains "serialization and deserialization" from many angles, and compares several popular serialization protocols, hoping to be helpful to readers in serialization selection.

Brief introduction

The author serves Meituan's recommendation and personalization group, which is committed to providing Meituan users with high-quality personalized recommendation and sorting services at the billion level every day. From Terabyte-level user behavior data to Gigabyte-level Deal/Poi data; from real-time user real-time location data within milliseconds to regular background job data, recommendation and reordering system requires a variety of data services. Recommendation and reordering system customers include a variety of internal services, Meituan client, Meituan website. In order to provide high-quality data services and to achieve good docking with upstream and downstream systems, the selection of serialization and deserialization is often an important consideration in system design.

The content of this article is organized as follows:

The first part gives the definition of serialization and deserialization and their position in the communication protocol.

The second part discusses some characteristics of serialization protocol from the point of view of users.

The third part describes the typical serialization components in the specific implementation process, and makes an analogy with the database construction.

The fourth part explains the characteristics and application scenarios of several common serialization protocols, and gives examples of related components.

In the last part, the author gives some suggestions on technology selection based on the characteristics of various protocols and relevant benchmark data.

# I. definition and related concepts

The emergence of the Internet brings the demand for communication between machines, and both sides of the interconnection communication need to adopt the agreed protocol. Serialization and deserialization are part of the communication protocol. Communication protocols often use hierarchical model, and the function definition and granularity of each layer of different models are different. For example, TCP/IP protocol is a four-layer protocol, while OSI model is a seven-layer protocol model. The main function of the presentation layer (Presentation Layer) in the OSI seven-layer protocol model is to convert the application layer objects into a continuous binary string, or vice versa, to convert the binary strings into application layer objects-these two functions are serialization and deserialization. Generally speaking, the application layer of TCP/IP protocol corresponds to the application layer, presentation layer and session layer of the OSI seven-layer protocol model, so the serialization protocol is a part of the application layer of TCP/IP protocol. In this paper, the explanation of serialization protocol is mainly based on OSI seven-layer protocol model.

Serialization: the process of converting a data structure or object into a binary string

Deserialization: the process of converting a binary string generated during serialization into a data structure or object

Data structure, object and binary string

Data structures, objects and binary strings are represented differently in different computer languages.

Data structures and objects: for a fully object-oriented language like Java, everything an engineer operates on is an object (Object), which comes from class instantiation. The concepts closest to data structures in the Java language are POJO (Plain Old Java Object) or Javabean-- classes that have only setter/getter methods. In C++, a semi-object-oriented language, data structures correspond to struct and objects correspond to class.

Binary string: a binary string generated by serialization refers to a piece of data stored in memory. The C++ language has memory operators, so the concept of binary strings is easy to understand. For example, C++ strings can be used directly by the transport layer, because they are essentially binary strings stored in memory that end in'0'. In the Java language, the concept of binary strings is easily confused with String. In fact, String is a first-class citizen of Java and an Object. For cross-language communication, the serialized data cannot, of course, be a special data type of a language. The binary string in Java refers to byte [], and byte is one of the 8 native data types of Java (Primitive data types).

# II. Serialization protocol features

Each serialization protocol has its advantages and disadvantages, and they have their own unique application scenarios at the beginning of their design. In the process of system design, we need to consider all aspects of serialization requirements, comprehensively compare the characteristics of various serialization protocols, and finally give a compromise solution.

Versatility

Versatility has two levels of significance:

First, at the technical level, whether serialization protocols support cross-platform and cross-language. If not, the versatility at the technical level is greatly reduced.

Second, popularity, serialization and deserialization require multi-party participation, and protocols used by few people often mean expensive learning costs; on the other hand, protocols with low popularity often lack stable and mature cross-language and cross-platform public packages.

Robustness / robustness

The agreement is not strong enough for the following two reasons:

First, the maturity is not enough, an agreement from formulation to implementation, to the final maturity is often a long stage. The robustness of the protocol depends on a large number of comprehensive tests. For systems committed to providing high-quality service, using the serialization protocol in the testing stage will bring high risk.

Second, the unfairness of language / platform. In order to support cross-language and cross-platform functions, serialization protocol makers need to do a lot of work; however, when there are irreconcilable features between supported languages or platforms, protocol makers need to make a difficult decision-to support more languages / platforms, or to support more languages / platforms and abandon a feature. When the protocol maker decides to provide more support for a language or platform, the robustness of the protocol is sacrificed for users.

Debuggability / readability

It often takes a long time to debug the data correctness and business correctness of serialization and deserialization, and a good debugging mechanism will greatly improve the development efficiency. Serialized binary strings are often not readable to the human eye. In order to verify the serialization results, the writer must not write a deserialization program at the same time, or provide a query platform-- which is time-consuming. On the other hand, if the reader fails to deserialize successfully, it will pose a great challenge to finding the problem-- it is difficult to locate whether it is caused by the bug of its own deserializer or by the incorrect data serialized by the writer. For cross-company debugging, the problem becomes more serious for the following reasons:

First, the support is not in place, and cross-company debugging may not get timely support after problems occur, which greatly prolongs the debugging cycle.

Second, access restrictions, the debugging phase of the query platform may not be open to the public, which increases the verification difficulty of the reader.

If the serialized data is readable to the human eye, it will greatly improve the debugging efficiency, and XML and JSON will have the advantage of human eye readability.

Performance

Performance includes two aspects, time complexity and space complexity:

First, space overhead (Verbosity), serialization needs to add a description field to the original data for deserialization parsing. If the extra overhead introduced by the serialization process is too high, it may lead to excessive pressure on networks, disks, and so on. For massive distributed storage systems, the amount of data is often in TB units, and the huge extra space overhead means high cost.

Second, time overhead (Complexity), complex serialization protocols can lead to long parsing time, which may make the serialization and deserialization phases become the bottleneck of the whole system.

Scalability / compatibility

In the era of mobile Internet, the update cycle of business system requirements becomes faster, new requirements continue to emerge, and the old system still needs to be maintained. If the serialization protocol has good scalability and supports the automatic addition of new business fields without affecting the old services, this will greatly provide the flexibility of the system.

Security / access restriction

In the process of serialization selection, security considerations often occur in the scenario of cross-LAN access. When communication occurs between companies or across computer rooms, access across Lans is often limited to ports 80 and 443 based on HTTP/HTTPS for security reasons. If you use a serialization protocol that is not supported by a compatible and mature HTTP transport layer framework, it may result in one of three results:

First, service availability is reduced because of access restrictions.

Second, it is forced to re-implement the security protocol, which leads to a great increase in implementation cost.

Third, open more firewall ports and protocol access at the expense of security.

# III. Serialized and deserialized components

Typical serialization and deserialization processes often require the following components:

IDL (Interface description language) document: the parties involved in the communication need to make a relevant agreement (Specifications) on the content of the communication. In order to establish a convention independent of language and platform, this convention needs to be described in a language independent of specific development language and platform. This language is called Interface description language (IDL), and the protocol written by IDL is called IDL file.

In order to be visible in various languages and platforms, it is necessary to have a compiler to convert IDL Compiler:IDL files into dynamic libraries corresponding to each language.

Stub/Skeleton Lib: the working code responsible for serialization and deserialization. Stub is a piece of code deployed on the client side of a distributed system. On the one hand, it receives the parameters of the application layer and sends them to the server through the underlying protocol stack. On the other hand, it receives the serialized result data of the server side and delivers it to the client application layer after deserialization. Skeleton is deployed on the server side, and its function is opposite to Stub. It receives serialization parameters from the transport layer, deserializes them to the server application layer, and serializes the execution results of the application layer and finally transmits them to the client Stub.

Client/Server: refers to the application layer program code, which is faced with class or struct of the specific language in which IDL exists.

Underlying protocol stack and Internet: the serialized data is converted into digital signals and transmitted over the Internet through the underlying transport layer, network layer, link layer, and physical layer protocols.

Comparison between serialization component and database access component

Database access is relatively familiar to many engineers, and the components used are relatively easy to understand. The following table compares the corresponding relationship between some components and database access components used in the serialization process, so that we can better grasp the concept of serialization-related components.

# IV. Several common serialization and deserialization protocols

The early serialization protocols of the Internet were mainly COM and CORBA.

COM is mainly used for the Windows platform, but it is not really cross-platform. In addition, the principle of COM serialization makes use of the virtual table in the compiler, which makes it expensive to learn. (think of this scenario, the engineer needs to be a simple serialization protocol, but he needs to master the language compiler first.) Extending attributes is cumbersome because the serialized data is tightly coupled to the compiler.

CORBA is a good early implementation of cross-platform, cross-language serialization protocol. The main problems with COBRA are too many versions caused by too many participants, poor compatibility between versions, and complex and obscure usage. These problems of political economy, technical realization and early immature design eventually led to the gradual demise of COBRA. After J2SE 1.3, RMI-IIOP technology based on CORBA protocol is provided, which allows Java developers to develop CORBA in pure Java language.

This article mainly introduces and compares several popular serialization protocols, including XML, JSON, Protobuf, Thrift and Avro.

An example

As mentioned earlier, the emergence of serialization and deserialization is often obscure and covert, often inclusive with other concepts. In order to better understand the specific implementation of the concepts of serialization and deserialization in each protocol, we interspersed an example with various serialization protocols. In this example, we want to pass a user information across multiple systems; at the application layer, if you use the Java language, the class objects you face are as follows:

Java Code copies content to the clipboard

Class Address

{

Private String city

Private String postcode

Private String street

}

Public class UserInfo

{

Private Integer userid

Private String name

Private List address

}

XML&SOAP

XML is a common serialization and deserialization protocol, which has the advantages of cross-machine, cross-language and so on. XML has a long history. Its version 1. 0 became the standard as early as 1998 and has been widely used ever since. The original goal of XML was to mark up Internet documents (Document), so its design philosophy included readability to both humans and machines. However, when the design of such markup documents is used to serialize objects, it becomes lengthy and complex (Verbose and Complex). XML is essentially a description language and has self-description (Self-describing) properties, so XML itself is used for XML serialized IDL. There are two standard XML description formats: DTD (Document Type Definition) and XSD (XML Schema Definition). As an eye-readable (Human-readable) description language, XML is widely used in configuration files, such as Ogamard R mapping, Spring Bean Configuration File and so on.

SOAP (Simple Object Access protocol) is a widely used structured messaging protocol based on XML for serialization and deserialization. SOAP has such a big impact on the Internet that we give SOAP-based solutions a specific name-Web service. Although SOAP can support a variety of transport layer protocols, the most common use of SOAP is XML+HTTP. The main interface description language (IDL) of SOAP protocol is WSDL (Web Service Description Language). SOAP is secure, extensible, cross-language, cross-platform, and supports multiple transport layer protocols. Regardless of cross-platform and cross-language requirements, XML has a very easy-to-use method of serialization in some languages, eliminating the need for IDL files and third-party compilers, such as Java+XStream.

Self-description and recursion

SOAP is a protocol that uses XML for serialization and deserialization, and its IDL is WSDL. The description file of WSDL is XSD, while XSD itself is a XML file. This gives rise to an interesting problem called "recursion" in mathematics, which often occurs in things with Self-description.

Examples of IDL files

An example of using WSDL to describe the above basic user information is as follows:

The code is as follows:

Typical application scenarios and non-application scenarios

SOAP protocol has a broad mass base, the transport protocol based on HTTP makes it have good security characteristics when passing through the firewall, the human eye readable (Human-readable) feature of XML makes it outstanding debugability, and the increasing Internet bandwidth greatly makes up for its shortcomings of high space overhead (Verbose). It is a good choice for services that transfer a relatively small amount of data between companies or have relatively low real-time requirements (such as seconds).

Due to the large extra space overhead of XML and the sharp increase in the amount of data after serialization, it is common for large data sequence persistence applications, which means huge memory and disk overhead, which is not suitable for XML. In addition, the space and time overhead of serialization and deserialization of XML is large, and it is not recommended for services that require performance at the ms level. Although WSDL has the ability to describe objects, and the S of SOAP also stands for simple, the use of SOAP is by no means simple. For users accustomed to object-oriented programming, WSDL files are not intuitive.

JSON (Javascript Object Notation)

JSON originates from the weakly typed language Javascript, which comes from a concept called "Associative array". Its essence is to describe objects in a "Attribute-value" way. In fact, in weakly typed languages such as Javascript and PHP, classes are described as Associative array. JSON quickly becomes one of the most widely used serialization protocols because of the following advantages:

1. This Associative array format is very consistent with the engineer's understanding of the object.

2. It maintains the advantage of human eye readability (Human-readable) of XML.

3. Compared with XML, the serialized data is more concise. Research from the following link shows that the serialized file generated by XML is nearly twice the size of JSON. Http://www.codeproject.com/Articles/604720/JSON-vs-XML-Some-hard-numbers-about-verbosity

4. It has the congenital support of Javascript, so it is widely used in the application of Web browser, and it is the de facto standard protocol of Ajax.

5. Compared with XML, its protocol is simpler and its parsing speed is faster.

6. Loose Associative array makes it have good expansibility and compatibility.

IDL paradox

JSON is so simple, or so much like classes in various languages, that you don't need IDL for serialization with JSON. It's amazing that there is a natural serialization protocol that implements cross-language and cross-platform. However, the truth is not so magical, and this illusion comes from two reasons:

First, Associative array is the concept of class in weakly typed languages, and Associative array is the actual implementation of class in PHP and Javascript, so JSON is well supported in these weakly typed languages.

Second, the purpose of IDL is to write IDL files, and IDL files can be compiled by IDL Compiler to produce some code (Stub/Skeleton) that is really responsible for serialization and deserialization. But because Associative array and the class in the general language are so similar, there is an one-to-one correspondence between them, which allows us to use a set of standard code to transform accordingly. For weakly typed languages that support Associative array, the language itself has the ability to manipulate JSON serialized data; for strongly typed languages like Java, it can be solved uniformly by reflection, such as Gson provided by Google.

Typical application scenarios and non-application scenarios

JSON can replace XML in many application scenarios, which is more concise and faster. Typical application scenarios include:

1. Services with relatively small amount of data transferred between companies and relatively low real-time requirements (for example, second level).

2. Ajax request based on Web browser.

3. Because JSON has very strong compatibility before and after, it often changes the interface and requires high tunability, such as the communication between Mobile app and server.

4. Since the typical application scenario of JSON is JSON+HTTP, it is suitable for access across firewalls.

Overall, the extra space overhead of serialization with JSON is large, which means huge memory and disk overhead for large data service or persistence, which is not suitable for this scenario. The lack of a unified IDL reduces the constraints on the participants, and the actual operation can only be agreed by documentation, which may bring some inconvenience to debugging and prolong the development cycle. Since the serialization and deserialization of JSON in some languages require reflection, it is not recommended at the ms level for performance requirements.

Examples of IDL files

Here is an example after UserInfo serialization:

The code is as follows:

{"userid": 1, "name": "messi", "address": [{"city": "Beijing", "postcode": "1000000", "street": "wangjingdonglu"}]}

Thrift

Thrift is a high-performance, lightweight RPC service framework provided by Facebook open source, which is created to meet the current needs of large amount of data, distributed, cross-language and cross-platform data communication. However, Thrift is not just a serialization protocol, but a RPC framework. Compared with JSON and XML, Thrift has a great improvement in space overhead and parsing performance, and it is an excellent RPC solution for distributed systems with high performance requirements; but because the serialization of Thrift is embedded in the Thrift framework, the Thrift framework itself does not reveal serialization and deserialization interfaces, which makes it difficult to share with other transport layer protocols (such as HTTP).

Typical application scenarios and non-application scenarios

Thrift is an excellent solution for high-performance, distributed RPC services. It supports many languages and rich data types, and has strong compatibility for the addition and deletion of data fields. Therefore, it is very suitable for the standard RPC framework as service-oriented construction (SOA) within the company.

However, the documentation of Thrift is relatively scarce, and the mass base used at present is relatively small. In addition, because its Server is based on its own Socket service, security is a concern when accessing across firewalls, so you need to be careful when communicating between companies. In addition, the data after Thrift serialization is an Binary array, which is not readable, so it is relatively difficult to debug the code. Finally, because Thrift serialization is tightly coupled with the framework, it cannot support reading and writing data directly to the persistence layer, so it is not suitable to be a data persistence serialization protocol.

Examples of IDL files

The code is as follows:

Struct Address

{

1: required string city

2: optional string postcode

3: optional string street

}

Struct UserInfo

{

1: required string userid

2: required i32 name

3: optional list address

}

Protobuf

Protobuf has many typical features required for an excellent serialization protocol:

1. Standard IDL and IDL compilers, which make it very engineer-friendly.

2. The serialized data is very concise and compact. Compared with XML, the amount of data after serialization is about 1 big 3 to 1 big 10.

3. The parsing speed is very fast, about 20-100 times faster than the corresponding XML.

4, provides a very friendly dynamic library, very easy to use, deserialization only needs one line of code.

Protobuf is a pure presentation layer protocol that can be used with various transport layer protocols; Protobuf documentation is also very complete. But because Protobuf is produced in Google, it only supports Java, C++ and Python at present. In addition, Protobuf supports relatively few data types and does not support constant types. Because its design idea is pure presentation layer protocol (Presentation Layer), there is not a RPC framework that specifically supports Protobuf.

Typical application scenarios and non-application scenarios

Protobuf has a broad user base, low space overhead and high resolution performance are its highlights, which is very suitable for internal RPC calls that require high performance. Because Protobuf provides a standard IDL and corresponding compiler, its IDL file is a very strong business constraint of all parties involved. In addition, Protobuf has nothing to do with the transport layer, and HTTP has a good access attribute across firewalls, so Protobuf is also suitable for scenarios with high performance requirements between companies. Because of its high parsing performance and relatively small amount of data after serialization, it is very suitable for the persistence scenario of application layer objects.

Its main problem is that it supports relatively few languages, and because there is no bound standard underlying transport layer protocol, it is relatively troublesome to debug the transport layer protocol between companies.

Examples of IDL files

The code is as follows:

Message Address

{

Required string city=1

Optional string postcode=2

Optional string street=3

}

Message UserInfo

{

Required string userid=1

Required string name=2

Repeated Address address=3

}

Avro

The generation of Avro solves the problem of verbosity of JSON and no IDL. Avro belongs to a sub-project of Apache Hadoop. Avro provides two serialization formats: JSON format or Binary format. The Binary format is comparable to Protobuf in terms of space overhead and parsing performance, and the JSON format facilitates debugging during the testing phase. Avro supports a wide variety of data types, including union in the C++ language. Avro supports IDL in JSON format and IDL (experimental phase) similar to Thrift and Protobuf, which are interchangeable. Schema can transmit data at the same time, coupled with the self-describing properties of JSON, which makes Avro very suitable for dynamically typed languages. When Avro does file persistence, it is usually stored with Schema, so the Avro serialization file itself has self-description properties, so it is very suitable for Hive, Pig and MapReduce persistent data formats. For different versions of Schema, when making RPC calls, the server and the client can confirm the Schema each other in the handshake stage, which greatly improves the final speed of data parsing.

Typical application scenarios and non-application scenarios

Avro has high parsing performance and the serialized data is very concise, so it is suitable for high-performance serialization services.

Because Avro's non-JSON format IDL is currently in the experimental stage, JSON format IDL is not intuitive for engineers accustomed to statically typed languages.

Examples of IDL files

The code is as follows:

Protocol Userservice {

Record Address {

String city

String postcode

String street

}

Record UserInfo {

String name

Int userid

Array address = []

}

The corresponding JSON Schema format is as follows:

JavaScript Code copies content to the clipboard

{

"protocol": "Userservice"

"namespace": "org.apache.avro.ipc.specific"

"version": "1.0.5"

"types": [{

"type": "record"

"name": "Address"

"fields": [{

"name": "city"

"type": "string"

}, {

"name": "postcode"

"type": "string"

}, {

"name": "street"

"type": "string"

}]

}, {

"type": "record"

"name": "UserInfo"

"fields": [{

"name": "name"

"type": "string"

}, {

"name": "userid"

"type": "int"

}, {

"name": "address"

"type": {

"type": "array"

"items": "Address"

}

"default": []

}]

"messages": {}

}

# 5. Benchmark and suggestions for type selection

# # Benchmark

The following data are from https://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking

Analytical performance

Space overhead of serialization

The following conclusions can be drawn from the above picture:

1. XML serialization (Xstream) is poor in terms of performance and simplicity.

2. Compared with Protobuf, Thrift has some disadvantages in space-time cost.

3. Protobuf and Avro perform very well in both aspects.

Type selection suggestion

The five serialization and deserialization protocols described above have their own characteristics and are suitable for different scenarios:

1. For inter-company system calls, if the performance requirements of services above 100ms, XML-based SOAP protocol is a solution worth considering.

2. JSON protocol is the first choice for Ajax based on Web browser and the communication between Mobile app and server. JSON is also a very good choice for application scenarios where performance requirements are not too high, or dynamically typed languages are dominant, or the transfer data load is very small.

3. For the scenarios with bad debugging environment, using JSON or XML can greatly improve the debugging efficiency and reduce the cost of system development.

4. When there are scenarios with high requirements for performance and simplicity, there is a certain competition between Protobuf,Thrift,Avro.

5. For T-level data persistence scenarios, Protobuf and Avro are the first choice. If the persisted data is stored in the Hadoop subproject, Avro would be a better choice.

6. Because the design concept of Avro is biased towards dynamic typed language, Avro is a better choice for dynamic language-based application scenarios.

7. For persistence layer non-Hadoop projects, Protobuf will be more in line with the development habits of statically typed language engineers in application scenarios dominated by statically typed languages.

8. If you need to provide a complete RPC solution, Thrift is a good choice.

9. If you need to support different transport layer protocols after serialization, or if you need to access high-performance scenarios across firewalls, Protobuf can be preferred.

This is the end of the content of "how to understand the serialization and deserialization of the website when dealing with data exchange". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.