Analyze the principle of Java distributed system 07/03 Update SLTechnology News&Howtos

Analyze the principle of Java distributed system

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "analyzing the principle of Java distributed system". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "analyze the principle of Java distributed system".

1 concept

1.1 Model

Node

In a specific engineering project, a node is often a process on an operating system. In the model of this paper, it is considered that the node is a complete and inseparable whole, if a program process is actually composed of several relatively independent parts, then a process can be divided into multiple nodes in the model.

Abnormal

Machine downtime: machine downtime is one of the most common exceptions. In large clusters, the probability of daily downtime is about 1/1000. In practice, the recovery time of a down machine is usually considered to be 24 hours, which generally requires manual intervention to restart the machine.

Network exception: the message is lost and the two nodes are completely unable to communicate with each other, that is, there is "network differentiation".

Messages are out of order, and there is a certain probability that they will not arrive at the destination node in the order in which they are sent. Consider using mechanisms such as sequence numbers to deal with the disorder of network messages, so that invalid and expired network messages do not affect the correctness of the system.

Data error; unreliable TCP,TCP protocol provides reliable and connection-oriented transmission services for the application layer, but in the protocol design of distributed systems, it can not be considered that all network communications are based on TCP protocol.

TCP protocol can only guarantee that the network messages within the same TCP link will not be out of order, but the order of network messages between TCP links cannot be guaranteed.

Distributed trinity: if a node initiates a RPC (Remote procedure call) call to another node, that is, a node A sends a message to another node B, and node B completes some operations according to the content of the message received, and returns the result of the operation to node A through another message, then the result of the RPC execution has three states: "success", "failure", and "timeout (unknown)" It is called the three states of distributed system.

Storage data loss: for stateful nodes, data loss means state loss, and usually the stored state can only be read and recovered from other nodes.

* exception handling principle *: the golden principle of exception handling, which has been tested by a large number of engineering practices, is that any exception considered in the design phase must occur in the actual operation of the system, but the exception encountered in the actual operation of the system is likely to not be considered in the design, so, unless the requirement index allows, any exception cannot be ignored in the system design.

1.2 copy

Replica/copy refers to the redundancy provided for data or services in a distributed system. For a data copy, it means that the same data is persisted on different nodes. When the stored data of a certain node is lost, the data can be read from the copy.

Data copy is the only way for distributed system to solve the exception of data loss. The other kind of replica is the service replica, which exponentially provides some kind of the same service. This kind of service generally does not depend on the local storage of the node, and its required data generally comes from other nodes.

Replica protocol is the theoretical core of the whole distributed system.

Copy consistency

Through the replica control protocol, the distributed system makes the data of each copy read from outside the system the same under certain constraints, which is called replica consistency (consistency). Replica consistency is for distributed systems, not for a particular replica.

Strong consistency (strong consistency): any user or node can read the last successfully updated copy of data at any time. Strong consistency is not only the highest degree of consistency requirement, but also the most difficult to achieve in practice.

Monotonous consistency (monotonic consistency): at any time, once a user reads the value of a data after an update, the user will no longer read a value older than that value.

Monotone consistency is a very practical level of consistency that is weaker than strong consistency. Because generally speaking, users only care about the consistency observed from their own point of view, but not the consistency of other users.

Session consistency (session consistency): once a user reads the value of some data after an update in a session, the user will no longer read a value older than this value during this session.

By introducing the concept of session, session consistency further relaxes the constraints on the basis of monotone consistency, which only ensures the monotonous modification of data in a single session of a single user. there is no guarantee for the consistency between different users and between different sessions of the same user.

In practice, there are many mechanisms that correspond to the concept of conversation, such as the concept of session in php.

Final consistency (eventual consistency): the final consistency requires that once the update is successful, the data on each replica will eventually reach a completely consistent state, but the time required to reach a fully consistent state cannot be guaranteed.

For the final consistency system, as long as a user always reads the data of a copy, it can achieve a similar monotonous consistency effect, but once the user changes the read copy, it can not guarantee any consistency.

Weak consistency (week consistency): once an update is successful, the user cannot read the value of the update within a certain time, and even if a new value is read on one copy, there is no guarantee that the new value will be read on other copies.

Weak consistency system is generally difficult to use in practice, and the use of weak consistency system requires more work from the application side to make the system available.

1.3 indicators for measuring distributed systems

Performance: the throughput capacity of a system, which refers to the total amount of data that the system can process at a certain time, usually measured by the total amount of data processed by the system per second.

The response delay of a system, which refers to the time it takes for the system to complete a function

The concurrency ability of a system, which refers to the ability of a system to perform a function at the same time, is usually measured by QPS (query per second).

The above three performance indicators often restrict each other, the pursuit of high throughput system, it is often difficult to achieve low latency; when the average response time of the system is long, it is also difficult to improve QPS.

Availability: availability of the system refers to the ability of the system to provide services correctly in the face of various exceptions.

The availability of the system can be measured by the ratio of the time the system is out of service to the time of normal service, or by the ratio of the number of failures to the number of successes of a function. Availability is an important indicator of distribution, which measures the robustness of the system and reflects the fault tolerance of the system.

Scalability: system scalability (scalability) refers to the characteristics of distributed systems that improve system performance (throughput, latency, concurrency), storage capacity and computing power by expanding the size of cluster machines. A good distributed system always pursues "linear scalability", that is, a certain index of the system can increase linearly with the number of machines in the cluster.

Consistency: in order to improve availability, distributed systems always inevitably use replica mechanism, which leads to the problem of replica consistency. The stronger the consistent sex model, the easier it is for users to use.

2 principle of distributed system

2.1 data distribution mode

As the name implies, the so-called distributed system is to use multiple computers to solve computing, storage and other problems that can not be solved by a single computer.

The biggest difference between a stand-alone system and a distributed system lies in the scale of the problem, that is, the difference in the amount of data calculated and stored.

To solve a single-machine problem using distributed solution, the first thing to solve is how to disassemble the problem into a subset of the original problem that can be solved by using multi-machine distributed solution. Because the input object of the problem is data whether it is computing or storage, how to disassemble the input data of the distributed system has become the basic problem of the distributed system.

Hash mode

The disadvantage of hash distributed data is also obvious, such as low scalability. Once the size of the cluster needs to be expanded, almost all the data needs to be migrated and redistributed. In the project, when expanding the hash distribution data system, the cluster size is often doubled, and the hash is recalculated according to the data, so that only half of the data on one machine needs to be migrated to another corresponding machine to complete the expansion.

In order to solve the problem of poor scalability of hashing, one way of thinking is not simply to map the hash value to the machine by division, but to manage the corresponding relationship as metadata by a special metadata server. At the same time, the number of hash modules is often larger than the number of machines, so that the remainder of multiple hash modules are responsible for on the same machine. However, a large amount of metadata needs to be maintained by a more complex mechanism. Another disadvantage of hash distribution data is that once the data of a certain data eigenvalue is seriously uneven, it is prone to the problem of "data skew" (data skew).

Another disadvantage of hash distribution data is that once the data of a certain data eigenvalue is seriously uneven, it is prone to the problem of "data skew" (data skew).

Distribution by data range

Distribution by data range is another common data distribution, which divides the data into different intervals according to the range of eigenvalues, so that each server in the cluster processes the data in different intervals.

In the project, for the convenience of load balancing operations such as data migration, the technology of dynamic partition is often used to make the amount of data served in each interval as large as possible. When the amount of data in an interval is large, by dividing the interval into two intervals, the amount of data in each interval is kept below a fixed threshold as far as possible.

In general, it is often necessary to use a special server to maintain data distribution information in memory, which is called a kind of meta-information. Even for large-scale clusters, because the scale of meta-information is so large that a single computer can not be maintained independently, multiple machines need to be used as meta-information servers.

Distribution by amount of data

Data volume distribution data has nothing to do with specific data characteristics, but regards the data as a sequentially growing file, and divides the file into several data blocks (chunk) according to a relatively fixed size, and different data blocks are distributed to different servers.

Similar to the way you distribute data by data range, you also need to record the specific distribution of data blocks and manage the distribution information as metadata using a metadata server.

Because it has nothing to do with the specific data content, there is generally no data tilt in the way of data distribution according to the amount of data, and the data is always evenly sliced and distributed to the cluster.

When the cluster needs to be rebalanced, it can be done simply by migrating data blocks. There is no great restriction on the expansion of the cluster, which can be completed by migrating some databases to the newly added machines.

The disadvantage of dividing data according to the amount of data is that it needs to manage more complex meta-information, which is similar to the way of distributing data by range. When the scale of the cluster is large, the amount of meta-information also becomes very large. Efficient management of meta-information has become a new topic.

Consistent hash

Consistent hash (consistent hashing) is another way of data distribution that is widely used in engineering. Consistent hash was originally used as a common data distribution algorithm for distributed hash table (DHT) in P2P networks.

The basic way of consistent hashing is to use a hash function to calculate the hash value of the data or data characteristics, so that the output range of the hash function is a closed loop, that is, the maximum output of the hash function is the preorder of the minimum value. The nodes are randomly distributed to the ring, and each node is responsible for processing all the data in the hash range from its own clockwise to the next node.

Using a consistent hash requires managing the location of the node on the consistent hash ring as meta-information, which is more complex than directly using the hash to distribute data. However, the location information of nodes is only related to the size of the machines in the cluster, and the amount of meta-information is usually much less than the amount of meta-information that distributes data according to data range and data.

For this reason, a common improved algorithm is to introduce the concept of virtual node (virtual node). Many virtual nodes are created at the beginning of the system, and the number of virtual nodes is generally much larger than the number of machines in the future cluster. The virtual nodes are evenly distributed to the consistent hash domain ring, which has the same function as the nodes in the basic consistent hash algorithm. Assign several virtual nodes to each node.

When operating the data, we first find the corresponding virtual node on the ring through the hash value of the data, and then find the metadata to find the corresponding real node. There are several advantages to using virtual node improvements.

First of all, once a node is unavailable, the node will make multiple virtual nodes unavailable, thus making multiple adjacent real nodes load the pressure of the failed nodes. Similarly, once a new node is added, multiple virtual nodes can be allocated, so that the new node can bear the pressure of multiple existing nodes. From a global point of view, it is easier to achieve load balancing during capacity expansion.

Copy and data distribution

The basic means of fault tolerance and improving availability of distributed systems is to use replicas. The distribution of data copies mainly affects the scalability of the system. A basic data replication strategy is to take machines as units, several machines are replicas of each other, and the data between the replica machines is exactly the same. This strategy applies to the various data distribution methods mentioned above. Its advantage is very simple, and its disadvantage is that the efficiency of data recovery is not high and the scalability is not high.

A more appropriate approach is not to use the machine as the replica unit, but to split the data into more reasonable data segments, taking the data segment as the replica.

In practice, the size of each data segment is often made as equal as possible and controlled within a certain size. Data segments have many different names, segment,fragment,chunk,partition, and so on. The selection of data segments is directly related to the way of data distribution.

For the way of hashing data, the remainder after each hash bucket can be used as a data segment. In order to control the size of the data segment, the number of buckets is often larger than the size of the cluster. Once the data is divided into segments, copies can be managed on a segment-by-segment basis, so that the copy is no longer hard related to the machine, and each machine can be responsible for the copy of a certain segment.

Once the replica distribution has nothing to do with the machine, the recovery efficiency after data loss will be very high. This is because, once a machine's data is lost, copies of its data segments will be distributed across all machines in the entire cluster, rather than just a few replica machines, so that data can be copied and restored from the entire cluster at the same time. Each data source machine in the cluster can make copies with very low resources. Even if the machines as the recovery data source are all speed-limited 1MB/s, if there are 100 machines participating in the recovery, the recovery speed can reach 100MB/s.

Furthermore, the replica distribution has nothing to do with the machine and is also conducive to cluster fault tolerance. If there is a machine outage, the pressure will naturally spread to the whole cluster because the copies on the down machine are scattered throughout the cluster.

Finally, the replica distribution is independent of the machine and is also conducive to cluster expansion. In theory, if the size of the cluster is set to N machines, when a new machine is added, the new load balancing can be achieved only by migrating 1 large N-1/N+1 data segment from each machine to the new machine. Because the data is migrated from each machine in the cluster, it is the same as data recovery, and it is also more efficient.

In the project, establishing a copy completely according to the data segment will increase the overhead of the metadata to be managed, and the difficulty of copy maintenance will also increase accordingly. One compromise is to group some data segments into a segment group and manage copies according to the granularity of the segment group. By doing so, the granularity of the copy can be controlled within a more appropriate range.

Localized computing

In distributed systems, the distribution of data also deeply affects the distribution of computing. In a distributed system, computing nodes and storage nodes that store computing data can be on the same physical machine or on different physical machines.

If the computing node and the storage node are located in different physical machines, the calculated data needs to be transmitted through the network, which is very expensive, and even the network bandwidth will become the overall bottleneck of the system.

Another idea is to schedule the computing to the computing node on the same physical machine as the storage node as far as possible, which is called localized computing. Localized computing is an important optimization of computing scheduling, which embodies an important distributed scheduling idea: "Mobile data is not as good as mobile computing".

The choice of data distribution mode

In practical engineering practice, the data distribution mode can be reasonably selected according to the demand and implementation complexity. In addition, the way of data distribution can be used flexibly, and it can often have the advantages of various ways, and get a better comprehensive effect.

For example, the problem of data skew is solved by introducing the way of distributing data according to the amount of data on the basis of dividing the data according to hash. Divide the data according to the hash value of the user id. When the amount of data of a user's id is particularly large, the user's data always falls on a certain machine. At this time, the way of distributing data according to the amount of data is introduced, the amount of data of users is counted, and the data of users is cut into multiple uniform data segments according to a certain threshold, and these data segments are distributed to the cluster. Since the amount of data of most users will not exceed the threshold, only the segment distribution information of the users who exceed the threshold is saved in the metadata, thus the scale of the metadata can be controlled. This scheme, which combines the hash distribution data with the data distribution according to the amount of data, has been used in a real system and achieved good results.

2.2 basic copy Agreement

Replica control protocol refers to a distributed protocol that controls the reading and writing behavior of replica data according to a specific protocol flow, so that the replica meets certain availability and consistency requirements. The replica control protocol should have a certain ability of fault tolerance against abnormal states, so that the system has a certain availability, and the replica control protocol should be able to provide a certain level of consistency. According to the CAP principle (analyzed in detail in Section 2.9), it is impossible to design a replica protocol that satisfies strong consistency and is available in the event of any network anomaly. For this reason, in practice, replica control protocols always make a compromise between availability, consistency and performance according to specific requirements.

Replica control protocols can be divided into two categories: centralized (centralized) replica control protocol and decentralized (decentralized) replica control protocol.

Centralized replica control protocol

The basic idea of the centralized replica control protocol is that a central node coordinates the update of replica data and maintains the consistency between replicas.

The figure shows the general architecture of the centralized replica protocol. The advantage of the centralized replica control protocol is that the protocol is relatively simple, and all the replica-related control is completed by the central node. Concurrency control will be completed by the central node, so that a distributed concurrency control problem is simplified to a single machine concurrency control problem.

The so-called concurrency control means that when multiple nodes need to modify replica data at the same time, it is necessary to solve the concurrency conflicts such as "writing" and "reading and writing". Locking and other methods are commonly used in stand-alone systems for concurrency control. Locking is also a common method for distributed concurrency control, but if there is no central node for lock management, a fully distributed locking system is needed, which will make the protocol very complex.

The disadvantage of the centralized replica control protocol is that the availability of the system depends on the centralized node. When the central node is abnormal or the communication with the central node is interrupted, the system will lose some services (usually at least the update service). Therefore, the disadvantage of the centralized replica control protocol is that there is a certain downtime.

Primary-secondary protocol

In primary-secondary-type protocols, replicas are divided into two categories, one and only one of which is a primary replica, and all replicas except primary are secondary replicas. The node that maintains the primary copy is the central node, and the central node is responsible for maintaining the data update, concurrency control and coordinating the consistency of the copy.

Primary-secondary type protocols generally solve four kinds of problems: data update process, data reading mode, determination and switching of Primary copies, and data synchronization (reconcile).

Basic process of data update

Data updates are coordinated by primary nodes.

The external node sends the update operation to the primary node

Primary nodes perform concurrency control, that is, to determine the sequence of concurrent update operations.

The primary node sends the update operation to the secondary node

Primary decides whether the update is successful according to the completion of the secondary node and returns the result to the external node

In engineering practice, if primary sends data to other N replicas at the same time, the update throughput of each secondary is limited by the total egress network bandwidth of primary, and the maximum is 1max N of the egress bandwidth of primary network.

To solve this problem, some systems (for example, GFS) use relay to synchronize data, that is, primary sends updates to the first secondary copy, the first secondary copy is sent to the second secondary copy, and so on.

Data reading mode

The way data is read is also highly related to consistency. If only final consistency is required, reading any copy can satisfy the requirement.

If session consistency is required, you can set the version number for the copy, increment the version number after each update, and verify the version number when the user reads the copy, thus ensuring that the data read by the user increases monotonously within the scope of the session. The difficulty in using primary-secondary is to achieve strong consistency.

Because the update process of data is controlled by primary, the data on the primary copy must be up-to-date, so if you always read only the data of the primary copy, you can achieve strong consistency. If the primary copy is read-only, the secondary copy will not provide read service.

In practice, if the replica is not bound to the machine, but maintains the replica per data segment, only the primary replica provides read service. In many scenarios, it will not cause a waste of machine resources.

Spread the replicas into the cluster, assuming that the primary is also determined at random, then there are primary copies of some data and secondary copies of other data segments on each machine. As a result, a server actually provides read and write services.

The availability of the secondary node is controlled by primary. When primary updates a secondary copy unsuccessfully, primary marks the secondary copy as unavailable so that the user no longer reads the unavailable copy. Unavailable copies of secondary can continue to try to synchronize data with primary, and primary can mark copies as available when data synchronization with primary is complete.

This approach makes all available replicas, whether primary or secondary, readable, and within a certain period of time, a secondary replica is either updated to the latest state consistent with primary or marked as unavailable, thus meeting high consistency requirements. This approach relies on a central metadata management system to record which copies are available and which are not. In a sense, this approach improves system consistency by reducing the availability of the system.

Determination and switching of primary copy

In primary-secondary-type protocols, another core problem is how to determine the primary copy, especially when there is an exception such as a downtime on the machine where the original primary copy resides, some mechanism is needed to switch the primary copy so that a secondary copy becomes a new primary copy.

In general, in primary-secondary-type distributed systems, the information about which replica is primary is meta-information and is maintained by a dedicated metadata server. When performing the update operation, the metadata server is first queried to obtain the primary information of the copy, so as to further execute the data update process.

Because reliable detection of node anomalies in a distributed system requires a certain detection time, which is usually at the level of 10 seconds, which also means that once a primary is abnormal, a maximum of 10 seconds is needed before the system can start primary switch. during this 10 seconds, the system cannot provide update services because there is no primary, if the system can only read the primary copy. Then the reading service cannot even be provided during this period of time. From here, we can see that the biggest disadvantage of primary-backup class copy protocol is the certain downtime brought by primary handover.

Data synchronization

Inconsistent copies of secondary need to be synchronized with primary (reconcile).

There are usually three forms of inconsistency:

First, due to network differentiation and other anomalies, the data on secondary lags behind the data on primary.

Second, under some protocols, the data on secondary may be dirty data and need to be discarded. The so-called dirty data is due to the fact that the primary copy does not carry out an update operation, but the redundant modification operation is carried out on the secondary copy, resulting in the secondary copy data error.

Secondary is a newly added copy, which has no data at all and needs to be copied from other replicas.

For the first case where secondary data lags behind, a common synchronization method is to play back the operation log on the primary (usually the redo log) to catch up with the update progress of the primary.

In the case of dirty data, it is better to design a distributed protocol that does not produce dirty data. If the protocol must have the possibility of generating dirty data, the probability of generating dirty data should also be reduced to a very low level, so that once dirty data occurs, copies of dirty data can be simply discarded directly, which means that the copy has no data.

In addition, you can also design ways based on undo logs to remove dirty data. If the secondary copy has no data at all, it is common to copy the data from the primary copy directly, which is often much faster than playing back the log to track the progress of the update. However, when copying data, primary replicas need to be able to continue to provide update services, which requires primary replicas to support snapshot (snapshot) functionality. That is, a snapshot of the replica data at a certain moment is formed, and then the snapshot is copied. After the copy is completed, the update operation after the formation of the snapshot is tracked by playing back the log.

Decentralized copy control protocol

The decentralized replica control protocol has no central node, all the nodes in the protocol are completely equal, and the nodes reach agreement through equal negotiation. As a result, the decentralization protocol has no problems such as service outage caused by the exception of the centralized node.

The biggest disadvantage of decentralized protocols is that the protocol process is usually complex. Especially when the decentralized protocol needs to achieve strong consistency, the protocol process becomes complex and not easy to understand. Because of the complexity of the process, the efficiency or performance of the decentralized protocol is generally lower than that of the centralized protocol. An inappropriate example is that the centralized replica control protocol is similar to the autocratic system, the system is efficient but highly dependent on the central node, once the central node is abnormal, the system is greatly affected; the decentralized replica control protocol is similar to the democratic system, node collective bargaining, inefficiency, but the exception of individual nodes will not have much impact on the system as a whole.

2.3 Lease mechanism

Lease mechanism is the most important distributed protocol, which is widely used in a variety of practical distributed systems.

Distributed cache system based on lease

The basic background of the problem is as follows: in a distributed system, there is a central server node, which stores and maintains some data, which is the metadata of the system. Other nodes in the system read and modify the metadata on it by accessing the central server node.

Because all kinds of operations in the system depend on metadata, if every operation of reading metadata accesses the central server node, then the performance of the central server node becomes the bottleneck of the system. For this reason, a metadata cache is designed to cache the metadata information on each node, so as to reduce the access to the central server node and improve the performance.

On the other hand, the correct operation of the system strictly depends on the correctness of the metadata, which requires that the data of cache on each node is always consistent with the data on the central server, and the data in cache can not be old dirty data. Finally, the designed cache system should be able to handle node outages, network interruptions and other anomalies as much as possible, so as to improve the availability of the system to the greatest extent.

For this reason, a set of cache system is designed by using lease mechanism, and its basic principle is as follows.

When sending data to each node, the central server issues a lease to the node at the same time. Each lease has an expiration date, similar to the expiration date on a credit card. The validity period on a lease is usually a definite point in time, such as 12:00:10, after which the lease expires. In this way, the validity period of the lease has nothing to do with the time when the node received the lease, and the lease may have expired when the node received the lease. First of all, it is assumed that the clocks of the central server and each node are synchronized, and the impact of clock non-synchronization on lease is discussed in the next section. The meaning of the lease issued by the central server is that the central server guarantees that the value of the corresponding data will not be modified during the validity of the lease. Therefore, after receiving the data and lease, the node adds the data to the local Cache. Once the corresponding lease times out, the node deletes the corresponding local cache data. When the central server modifies the data, it first blocks all new read requests and waits for all previous lease timeouts issued for the data to expire, and then modifies the value of the data.

Cache based on lease, client node reads metadata

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Determine whether the metadata is already in the local cache and the lease is within the validity period 1.1 Yes: directly return the metadata 1.2 in the cache No: request to the central server node to read the metadata information 1.2.1 after the server receives the read request, whether the metadata and a corresponding lease 1.2.2 client successfully received the data returned by the server 1.2.2.1 failed or timed out: exit the process, read failed Retry 1.2.2.2 successfully: record the metadata and the lease of the metadata into memory and return the metadata

Based on the cache of lease, the client node modifies the metadata flow 2.1node initiates the modification metadata request to the server. 2.2 after the server receives the modification request, it blocks all new read data requests, that is, it receives the read request, but does not return data. The server waits for all lease timeouts associated with this metadata. 2.4 the server modifies the metadata and returns the modification success to the client node.

The above mechanism ensures that the cache on each node is always consistent with the center on the central server. This is because the central server node grants the corresponding lease to the node while sending the data, and the server will not modify the data during the lease validity period, so the client node can rest assured to cache the data within the lease validity period. The key to the fault tolerance of the above lease mechanism is that once the server sends out data and lease, no matter whether the client receives it or not, whether the subsequent client is down or not, and regardless of whether the subsequent network is normal or not, as long as the server waits for lease timeout, it can ensure that the corresponding client node will not continue cache data, so that the data can be modified without breaking the consistency of cache.

The above basic process has some performance and availability problems, but it can be easily optimized and modified. Optimization point 1: when the server modifies metadata, it first blocks all new read requests, resulting in no read service. This is to prevent new lease from being issued, which causes new client nodes to hold lease and cache data constantly, forming "livelocks". The method of optimization is very simple. After the server enters the process of modifying data, once it receives a read request, it only returns data but does not issue lease. As a result, in the process of modifying the execution of the process, the client can read the metadata, but cannot cache the metadata. Further optimization is that when the modification process is entered, the lease validity period issued by the server is selected as the maximum validity period of the issued lease. In doing so, the client can continue to cache metadata after the server enters the modification process, but the server's waiting time for all lease to expire will not be prolonged by issuing a new lease.

Finally, the difference between the cache mechanism and the multi-copy mechanism. The similarity between the Cache mechanism and the multi-copy mechanism is that one piece of data is saved on multiple nodes. However, the Cache mechanism is much simpler. For cache data, you can delete and discard the data at any time, and the consequence of hitting cache is that you only need to access the data source to read the data; however, the replica mechanism is different. Copies cannot be discarded at will, and the quality of service is declining every time a copy is lost. once the number of copies drops to a certain extent, the service will no longer be available.

Analysis of lease Mechanism

Definition of lease: Lease is a commitment granted by the issuer within a certain period of validity. Once the issuer sends out the lease, the issuer will strictly abide by the promise as long as the lease does not expire, regardless of whether the recipient receives it or not, and no matter what state the subsequent receiver is in; on the other hand, the receiver can use the issuer's promise during the validity period of the lease, but once the lease expires, the receiver must not continue to use the issuer's promise.

Lease mechanism has high fault tolerance. First of all, by introducing the validity period, whether the Lease mechanism can tolerate network anomalies very well. The Lease issuance process only depends on the one-way communication of the network, even if the receiver is unable to send a message to the issuer, it does not affect the issuance of lease.

Because the validity period of lease is a definite point in time, the semantics of lease has nothing to do with the specific time when the lease is sent, so the same lease can be sent to the receiver repeatedly by the issuer. Even if the issuer fails to send the lease occasionally, the issuer can simply resend it. Once the lease is successfully accepted by the receiver, the subsequent lease mechanism no longer depends on the network communication, even if the network completely interrupts the lease mechanism will not be affected.

In addition, the Lease mechanism can better fault-tolerant node downtime. If the issuer is down, the issuer usually cannot change his previous commitment and will not affect the correctness of the lease. After the issuer resumes, if the issuer restores the previous lease information, the issuer can continue to abide by the promise of lease. If the issuer cannot recover the lease information, it only needs to wait for a maximum lease timeout to invalidate all lease, thus not breaking the lease mechanism.

For example, in the example of the cache system in the previous section, once the server goes down, the metadata will not be modified. After recovery, you only need to wait for a maximum lease timeout, and the cache information on all nodes will be emptied.

In the case of downtime of the recipient, the issuer does not need to do more fault-tolerant processing, but only needs to wait for the expiration of the lease to withdraw the promise, that is, to take back the permissions and identities previously given. Finally, the lease mechanism does not depend on storage. The issuer can persist the issued lease information so that the valid lease can remain valid after the outage is restored. But this is only an optimization for the lease mechanism, such as the previous analysis, even if the issuer does not persist the lease information, it can invalidate all previously issued lease by waiting for a maximum lease time, so as to ensure that the mechanism continues to be effective.

The Lease mechanism depends on the expiration date, which requires that the clocks of the issuer and receiver are synchronized. On the one hand, if the issuer's clock is slower than the receiver's clock, the issuer still thinks that the lease is valid when the receiver thinks that the lease has expired. The recipient can solve this problem by applying for a new lease before the lease expires. On the other hand, if the issuer's clock is faster than the receiver's clock, when the issuer thinks that the lease has expired, the receiver still thinks that the lease is valid, and the issuer may issue the lease to other nodes, resulting in invalid commitments and affecting the correctness of the system. For this kind of clock non-synchronization, the common practice in practice is to set the validity period of the issuer to be slightly larger than that of the receiver, and only a large clock error is needed to avoid the impact on the effectiveness of lease.

Determining Node State based on lease Mechanism

The distributed protocol depends on the global consistency of the cognition of the node state, that is, once node Q thinks that a node An is abnormal, node A must also think that it is abnormal, so that node A stops acting as a primary and avoids the "double master" problem.

There are two ways to solve this problem. First, the designed distributed protocol can tolerate "double master" errors, that is, it does not depend on the understanding of the global consistency of the node state, or the global consistency state is the result of negotiation.

Second, make use of lease mechanism. The first idea is to abandon the use of centralized design and switch to decentralized design, which is beyond the scope of this section. The following focuses on the use of lease mechanism to determine the state of nodes.

The central node sends lease to other nodes. If a node holds a valid lease, it is considered that the node can provide service normally. In example 2.3.1, nodes A, B, C still periodically send heart beat to report their own status. Node Q sends a lease after receiving the heart beat, indicating that node Q confirms the status of node A, B, C and allows the node to work normally within the validity period of the lease. Node Q can give a special lease to the primary node, indicating that the node can work as a primary. Once node Q wants to switch a new primary, it only needs to wait for the lease of the previous primary to expire, then it can safely issue a new lease to the new primary node without the "double master" problem.

In the actual system, if a central node is used to send lease, there is also a great risk. Once the central node is down or the network is abnormal, all nodes have no lease, resulting in a high degree of unavailability of the system. For this reason, the actual system always uses multiple central nodes as replicas of each other to form a small cluster, which has high availability and provides the function of issuing lease. Both chubby and zookeeper are based on this design.

Selection of validity time of lease

In a project, the commonly selected lease duration is 10 seconds, which is a validated empirical value, which can be used as a reference and a comprehensive selection of appropriate duration in practice.

2.4 Quorum mechanism

First, make this convention: the update operation (write) is a series of sequential processes that determine the order of the update operation through other mechanisms (for example, the order determined by the primary in the primary-secondary architecture), each update operation is recorded as wi, I is the sequence number of the monotonous increment of the update operation, and the copy data changes after the successful execution of each wi, called a different data version, recorded as vi. Assume that each copy holds all versions of data in history.

Write-all-read-one

Write-all-read-one (WARO for short) is the simplest copy control rule. As its name implies, all copies are written during the update. Only when all copies are updated successfully, the update is considered to be successful, thus ensuring that all copies are consistent, so that the data on any copy can be read when reading the data.

Because the update operation needs to be successful on all N replicas for the update operation to be successful, once there is a copy exception, the update operation fails and the update service is not available. For the update service, although there are N copies, the system cannot tolerate any copy exception. On the other hand, as long as one of the N copies is normal, the system can provide read service. For the read service, the system can tolerate 1 copy exception when there are N copies. From the above analysis, we can find that the availability of the WARO read service is high, but the availability of the update service is not high, even though a copy is used, the availability of the update service is equivalent to no copy.

Quorum definition

Under the Quorum mechanism, once an update operation wi is successful on W copies of all N replicas, the update operation is called "successfully submitted update operation", and the corresponding data is called "successfully submitted data". Make R > Nmurw, because the update operation wi is only successful on W replicas, so when reading data, you must be able to read the updated data vi of wi if you need to read at most R replicas. If an update of wi is successful on W replicas, the set of any R replicas must intersect with the set of successful W replicas because of wi R > N, so reading R replicas must be able to read the updated data vi of wi.

As shown in figure 2-10, the principle of the Quorum mechanism can be represented by the Vincent diagram.

There are 5 replicas in a system. The data of the first 5 replicas are consistent, all v1. An update operation w2 succeeded on the first 3 replicas, and the replica situation becomes (v2 v2 v2 v1 v1).

At this point, v2 must be included in the set of any three copies. In the above definition, we get WARO by making Wendy NMagazine 1, that is, WARO is a special case of the Quorum mechanism. Similar to analyzing WARO, analyze the availability of Quorum mechanism. Limit the Quorum parameter to W+R=N+1. Because the update operation needs to be successful on all W replicas before the update operation can succeed, once the N-W+1 replicas are abnormal, the update operation will never succeed on W replicas and the update service will not be available.

On the other hand, once N-R+1 replicas are abnormal, there is no guarantee that a set of replicas intersecting W replicas will be read, and the consistency of the read service will decline.

Again: relying solely on the quorum mechanism can not guarantee strong consistency. Because the latest successfully submitted version number cannot be determined only with the quorum mechanism, it is difficult to determine the latest successfully submitted version number unless the latest submitted version number is managed as metadata by a specific metadata server or metadata cluster. In the next section, we will discuss the circumstances in which the latest successfully submitted version number can be determined only through the quorum mechanism.

The three system parameters N, W and R of the Quorum mechanism control the availability of the system, and it is also the system's service commitment to users: there are at most N copies of the data, but W copies of the data are updated successfully, which returns the user's success. For Quorum systems with high consistency requirements, the system should also promise not to read unsuccessfully submitted data at any time, that is, the data read are data that have been successful on W copies.

Read the latest successfully submitted data

The Quorum mechanism only needs to successfully update W of the N replicas, and when reading R replicas, the latest successfully submitted data must be read. However, due to the existence of unsuccessful updates, just reading R copies does not necessarily determine which version of the data is the latest committed data. For a strongly consistent Quorum system, if there is less than W data, assuming X, then continue to read other replicas, even if W copies of this version are successfully read, then the data is the latest successfully submitted data; if the number of the data in all replicas must be less than W, then the second largest version number in R is the latest successfully submitted copy.

Example: when reading (v2 v1 v1), continue to read the remaining copies. If you read that the remaining two copies are (v2 v2), then v2 is the latest committed copy; if you read that the remaining two copies are (v2 v1) or (v1 v1), then v1 is the latest successfully committed version; if there is a timeout or failure in reading the next two copies, it is impossible to determine which version is the latest successfully committed version.

As you can see, when using the Quorum mechanism alone, to determine the latest successfully committed version, you need to read at most R + (W-R-1) = N copies. When any copy is abnormal, the ability to read the latest successfully committed version may not be available.

In the actual project, we should try our best to avoid reading the latest successfully submitted version through Quorum mechanism by other technical means. For example, when the quorum mechanism is used in conjunction with the primary-secondary control protocol, the latest committed data can be read by reading primary.

Selection of primary replica based on Quorum mechanism

Reading data can be done differently according to consistency requirements: if you need to read the latest successfully submitted data immediately with strong consistency, you can simply read the data on the primary copy, or you can read it as shown in the previous section.

If session consistency is required, you can selectively read on individual replicas based on the previously read data version number; if only weak consistency is needed, you can choose any copy to read.

In the primary-secondary protocol, when the primary is abnormal, a new primary needs to be selected, and then the secondary copy synchronizes the data with the primary.

Usually, the task of selecting a new primary is completed by a central node. After introducing the quorum mechanism, the common way of primary selection is similar to the way of reading data, that is, the central node reads R copies and selects the copy with the highest version number of the R copies as the new primary. After completing data synchronization with at least W copies, the new primary provides read and write services as the new primary.

First of all, the copy with the highest version number of the R copies must contain the latest successfully submitted data. Furthermore, although it is not certain that the number of the highest version number is a successfully submitted data, the new primary then synchronizes the data with secondary, making the number of copies of this version reach W, thus making the data of this version a successfully submitted data.

Example: in the system of Noble 5 and Wend3, the maximum version number of the copy at some point is (v2 v2 v1 v1 v1), when v1 is the latest successfully submitted data of the system, v2 is an unsuccessfully committed data in the intermediate state. Suppose that the original primary copy is abnormal and the central node carries out primary switching work. Whether this kind of "intermediate state" data is deleted as "dirty data" or becomes effective after being synchronized as new data depends entirely on whether this data can participate in the election of the new primary. The following is an analysis of these two situations.

First, as shown in figure 2-12, if the central node successfully communicates with three of the replicas and the version number read is (v1 v1 v1), then any copy is selected as primary, and the new primary uses v1 as the latest successfully submitted version and synchronizes with other replicas. When synchronizing data with the first and second replicas, the first and second copies belong to dirty data because the first and second copies are larger than primary. It can be solved in the way dirty data is handled as described in Section 2.2.2.4.

In practice, it is also possible for the new primary to provide data services after synchronization with the latter two copies, and then its version number is also updated to v2. If the system cannot guarantee that the subsequent v2 is exactly the same as the previous v2, then the new primary needs to compare not only the data version number but also the specific content of the update operation when synchronizing data with the first and second copies.

Second, if the central node communicates successfully with the other three copies and reads the version number (v2 v1 v1), then the copy with version number v2 is selected as the new primary. After that, once the new primary completes data synchronization with the other two replicas, the number of copies conforming to v2 reaches W, which becomes the latest successfully submitted copy, and the new primary can provide normal reading and writing services.

2.5 logging technology

Logging technology is one of the main technologies of downtime recovery. Logging technology was originally used in database systems. Strictly speaking, log technology is not a distributed system technology, but in the practice of distributed systems, log technology is widely used for downtime recovery, and even systems such as BigTable save logs to a distributed system to further enhance the fault tolerance of the system.

Redo Log and Check point

A high-speed stand-alone query system is designed, which stores all the data in memory to achieve high-speed data query, and each update operation updates a small part of the data (such as a key in key-value). Now the problem is to use log technology to achieve the downtime recovery of the memory query system. Unlike database transactions, every successful update operation in this problem model takes effect. This is also equivalent to having only one update operation per transaction in the database, and each update operation can and must be committed immediately (Auto commit).

Redo Log

Write the result of the update operation (for example, Set K1 to 1, record K1 to the log file on disk in the form of append)

Modify data in memory according to update operation

Returns the update successfully

As can be seen from the process of Redo Log, what Redo writes to the log is the result of the completion of the update operation (although this article does not discuss Undo Log, which is one of the differences from Undo Log), and because it is sequentially appended to write log files, it is more efficient on storage devices such as disks that have a strong pair of sequential writes.

Downtime recovery with Redo Log is very simple, you only need to "play back" the log.

Downtime recovery of process 2.5.2:Redo Log

Read the results of each update operation in the log file from scratch and use these results to modify the data in memory.

As can be seen from Redo Log's downtime recovery process, only the update results written to the log file can be recovered after downtime. This is why you need to update the log file first and then update the data in memory in the Redo Log process.

If the data in memory is updated first, the user can read the updated data immediately. Once there is a downtime between completing the memory modification and writing to the log, the last update operation cannot be recovered. However, the user may have read the updated data before, resulting in inconsistencies.

Check point

Under the simplified model, the process of check point technology is to completely dump the data in memory to disk in a way that is easy to reload, so as to reduce the log data that needs to be played back during downtime recovery.

Process: check point

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Record "Begin Check Point" to the log file

Dump the data in memory to disk in a way that is easy to reload

Record "End Check Point" to the log file in the check point process, the data can continue to be updated in accordance with process 2.5.1, during which the updated data can be dump to disk or not dump to disk, depending on the implementation. For example, at the beginning of check point, the value of K1 from dump to disk can be v1 or v2 if it is updated to K1 = v2 during the process of checking point.

Process: check point-based outage recovery process

Load dump-to-disk data into memory.

Scan the log file from back to front for the last "End Check Point" log.

Find the most recent "End Check Point" log forward from the last "Begin Check Point" log, and play back all update operation logs since that log.

No Undo/No Redo log

If the data is maintained on disk, a batch of updates consists of several update operations, which need to take effect atomically, that is, either at the same time or none of them.

In Directory 1 directory technology, there are two directory structures, called directory 0 (directory 0) and directory 1 (Directory 1). There is another structure called the master record (Master record) record the directory currently in use is called the active directory. Either the record uses directory 0 or the record uses directory 1 in the master record. The location of each data in the log file is recorded in directory 0 or directory 1. The data update process of the 0ap1 directory always takes place on the inactive directory, but reverses the 0,1 values in the master record before the data takes effect, thus switching the master record.

Process: 0can1 catalog data update process

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Fully copy the active directory to the inactive directory.

For each update operation, create a new log entry to record the value after the operation, and change the location of the corresponding data in the inactive directory to the location of the newly created log entry.

Atomic modification of the master record: reverses the value in the master record so that the inactive directory takes effect.

The update process of the 0prime 1 directory is very simple, and it is atomic to make a batch of changes take effect by switching the master records of the 0,1 directory. The 0ACT 1 catalog sums up the atomicity of batch transaction operations to the atomic switching of the master record by catalog means.

Because the atomic modification of multiple records is generally difficult to achieve, while the atomic modification of a single record can often be achieved, thus reducing the difficulty of the problem.

In the project, the idea of 0swap 1 directory is widely used, and its form is not limited to the above process. It can be two data structures in memory switching back and forth, or two file directories on disk back and forth.

2.6 two-phase submission protocol

Two-phase commit protocol is a classical strong consistency centralized replica control protocol. Although there are many problems with the protocol in the project, the study of the protocol can well understand several typical problems of distributed systems.

Process description

The two-phase commit protocol is a typical "centralized replica control" protocol. In this protocol, the participating nodes are divided into two categories: one centralized coordinator node (coordinator) and N participant nodes (participant). Each participant node is the node that manages the copy of the database in the background above.

The idea of two-phase commit is relatively simple. In the first phase, the coordinator asks all participants whether they can submit the transaction (ask participants to vote), and all participants vote for the coordinator.

In the second stage, the coordinator makes a decision on whether the transaction can be committed globally based on the voting results of all participants, and notifies all participants to implement the decision. In a two-phase submission process, participants cannot change their voting results.

The global commit of the two-phase commit protocol is based on the premise that all participants agree to commit the transaction, and as long as one participant votes to abort the transaction, the transaction must be abandoned.

Process: two-phase submission of the coordinator process

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Write the local log "begin_commit" and enter the WAIT state

Send "prepare messages" to all participants

Wait and receive a response from the participant to the "prepare message"; 3.1 if you receive a "vote-abort message" from any participant; 3.1.1 write a local "global-abort" log, enter ABORT;3.1.2 and send a "global-abort message" to all participants; 3.1.3 enter the ABORT status; 3.2 if you receive a "vote-commit" message from all participants 3.2.1 write local "global-commit" log and enter COMMIT status; 3.1.2 send "global-commit message" to all participants

Wait for and receive a confirmation response message from the participant to the "global-abort message" or "global-commit message". Once you receive the confirmation message from all the participants, the process of writing the local "end_transaction" log ends.

Process: two-phase submission of the coordinator process

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Write local log "init" record and enter INIT state

Wait and receive the "prepare message" sent by the coordinator. 2.1 if the participant can submit the transaction 2.1.1 to write the local log "ready", enter the READY status 2.1.2 send the "vote-commit" message to the coordinator 2.1.4 waiting for the coordinator's message 2.1.4.1 if you receive the coordinator's "global-abort" message 2.1.4.1.1 write the local log "abort" Enter ABORT status 2.1.4.1.2 send confirmation message of "global-abort" to coordinator 2.1.4.2 if you receive coordinator's "global-commit" message 2.1.4.1.1 write local log "commit", enter COMMIT status 2.1.4.1.2 send confirmation message of "global-commit" to coordinator 2.2.If the participant is unable to submit this transaction 2.2.1 write local log "abort" Enter ABORT status 2.2.2 send a "vote-abort" message to the coordinator 2.2.3 the process ends 2.2.4 if you later receive a "global-abort" message from the coordinator, you can respond.

Even if the process ends, a corresponding confirmation message is sent whenever a "global-abort" or "global-commit" message is received from the coordinator.

Exception handling

Outage recovery

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Coordinator downtime recovery coordinator downtime recovery, first through the log to find the status before downtime. If there is a "begin_commit" record at the end of the log, the coordinator is in WAIT state before the outage, and the coordinator may or may not have sent a "prepare message", but the coordinator must not have sent a "global-commit message" or "global-abort message", that is, the global status of the transaction has not been determined. At this point, the coordinator can resend the "prepare message" to continue the two-phase submission process, and even if the participant has already sent a response to the "prepare message", it will only retransmit the previous response without affecting the consistency of the protocol. If there is a "global-commit" or "global-abort" record at the end of the log, the coordinator is in a COMMIT or ABORT state before downtime. At this point, the coordinator only needs to re-send a "global-commit message" or "global-abort message" to all participants to continue the two-phase submission process.

Participant downtime recovery after participant downtime recovery, first use the log to find the status before downtime. If the log ends with a "init" record, indicating that the participant is in an INIT state and has not yet voted on this transaction, the participant can continue the process and wait for a "prepare message" from the coordinator. If there is a "ready" record at the end of the log, indicating that the participant is in a REDAY state, it indicates that the participant has voted on this transaction, but it is not known whether the participant has sent a "vote-commit" message to the coordinator before the outage. So at this point, the participant can resend the "vote-commit" to the coordinator and continue the protocol process. If the log ends with a "commit" or "abort" record, the participant has received a "global-commit message" (in the COMMIT state) or a "global-abort message" (in the ABORT state) from the coordinator. It is unknown whether a confirmation message for "global-commit" or "global-abort" has been sent to the coordinator. However, even if no confirmation message has been sent, because the coordinator will constantly resend "global-commit" or "global-abort", it is only necessary to send a confirmation message when these messages are received, without affecting the global consistency of the protocol.

Protocol analysis

The two-phase submission agreement is rarely used in engineering practice, mainly for the following reasons:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

The fault tolerance of the two-phase commit protocol is poor. As can be seen from the above analysis, the two-phase commit protocol can not be executed in some cases, and the process status cannot be judged. Good distributed protocols in engineering can always be executed even in the event of an exception. For example, recall the Lease mechanism (2.3). Once the lease is issued, no matter what exception occurs, the Lease server node can always determine whether the Lease is valid by time, or withdraw the Lease permission by waiting for the Lease timeout. There is no case that any process in the whole Lease protocol is blocked and cannot be executed. Compared with the simple and effective Lease mechanism, the two-phase commit protocol is more complex and has poor fault tolerance.

The performance of the two-phase commit protocol is poor. In a successful two-phase commit protocol process, at least two rounds of interaction between the coordinator and each participant are required for four messages "prepare", "vote-commit", "global-commit" and "confirm global-commit". Too many interactions can degrade performance. On the other hand, the coordinator needs to wait for the voting results of all participants, and once there are slower participants, it will affect the execution speed of the global process.

Although there are some improved two-phase commit protocols that can improve fault tolerance and performance, these protocols are still less used in engineering, and their theoretical value is greater than practical significance.

2.7 MVCC

MVCC (Multi-version Cocurrent Control, multi-version concurrency control) technology. MVCC technology was originally proposed in database systems, but this idea is not limited to stand-alone distributed systems, but also effective in distributed systems.

MVCC is a technology for concurrency control of multiple different versions of data. Its basic idea is to generate a new version of data for each transaction. When reading the data, we can choose different versions of data to achieve the integrity of the transaction results. When using MVCC, each transaction is updated based on a base version that is already in effect, and the transaction can be performed in parallel, resulting in a graphical structure.

The underlying data is version 1, and two transactions are generated at the same time: transaction An and transaction B. Each of the two transactions made some local modifications to the data (these changes are visible only by the transaction itself and do not affect the real data), and then transaction A first commits to generate data version 2; based on data version 2, transaction C is initiated, and transaction C continues to commit, generating data version 3 Finally, transaction B commits, and the result of transaction B needs to be merged with the result of transaction C. if there is no data conflict, that is, transaction B does not modify the modified variables of transaction An and transaction C, then transaction B can commit, otherwise transaction B commit fails. The process of MVCC is very similar to that of version control systems such as SVN, or version control systems such as SVN use the idea of MVCC. When a transaction makes local modifications based on the underlying data version, in order not to affect the real data, there are usually two ways. One is to copy the data in the underlying data version completely and then modify it. Even if SVN uses this method, SVN check out is the process of copying. Second, only the update operation is recorded in each transaction, but not the complete data. When reading the data, the update operation is applied to the basic version of the data to calculate the result, which is also similar to the incremental commit of SVN.

2.8Paxos protocol

Paxos protocol is one of the few decentralized distributed protocols with strong consistency and high availability that have been proved in engineering practice. The process of Paxos protocol is complicated, but its basic idea is not difficult to understand, which is similar to the voting process in human society. In the Paxos protocol, there is a group of completely equivalent participating nodes (called accpetor), each of which makes a decision on an event. If a resolution is agreed by more than half of the nodes, it will take effect. As long as more than half of the nodes in the Paxos protocol are normal, they can work well against outages, network fragmentation and other abnormal situations.

Role

Proposer: sponsored by. There can be multiple Proposer, and Proposer proposes a bill (value). The so-called value can be any operation in a project, such as "change the value of a variable to a value", "set the current primary to a node", and so on. These operations are abstracted as value in Paxos protocol. Different Proposer can propose different and even contradictory value, such as one Proposer proposal to "set variable X to 1" and another Proposer proposal to "set variable X to 2", but for the same round of Paxos process, at most one value is approved. Acceptor: approver. There are N Acceptor, and the value proposed by Proposer must be approved by more than half of the Acceptor. The Acceptor are completely independent of each other. Learner: learner. Learner learns the approved value. The so-called learning is by reading the selection results of each Proposer to the value. If a value is passed by more than half of the Proposer, then the Learner learns the value. Recall is not difficult to understand, similar to the Quorum mechanism here, a value needs to obtain Acceptor approval of W=N/2 + 1, so that learners need to read at least 2 Accpetor of NBank, and after reading the results of N Acceptor, they can learn one passed value. The above three types of roles are only logical division, in practice, a node can play these three roles at the same time.

Process flow

The Paxos agreement is carried out round by round, each round with a number. Each round of Paxos agreements may approve one value, or it may not approve a value. If a certain value is approved by a round of Paxos agreements, then only that value can be approved by subsequent rounds of Paxos. The above round of protocol processes constitute a Paxos protocol instance, that is, one Paxos protocol instance can only approve one value, which is also an important embodiment of the strong consistency of the Paxos protocol. Each round of Paxos agreement is divided into three stages: preparation stage and approval stage, in which Proposer and Acceptor have their own processes.

Process: Proposer process (preparation phase)

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Send the message "Prepare (b)" to all Acceptor; where b is the number of rounds of Paxos, incremented each round

If you receive any message "Reject (B)" sent by Acceptor, the current round of Paxos fails for this Proposer. Set the number of rounds b to Bau1 and restart step 1; (approval phase, different choices are made according to the messages received by Acceptor)

If the "Promise (b, Acceptor I)" message received from Acceptor reaches N _ Acceptor (N is the total number of value, division and division rounding, the same below); Acceptor last approved value v in the I round. 3.1If v is empty in the "Promise (bjournal v)" message received, Proposer selects a value v and broadcasts Accept to all Acceptor; 3.2Otherwise, select the value v which is the largest of all received "Promise (bjournal v)" messages and broadcast the message Accept to all Acceptor.

If you receive Nack (B), set the number of rounds b to Broad 1 and repeat step 1.

Process: Accpetor process (preparation phase)

Accept the message Prepare (b) from a Propeser. Parameter B is the maximum number of Paxos rounds received by the Acceptor; V is the value approved by Acceptor, which can be empty 1.1.If b > B, reply to Promise (b, vault B) and set Block.This means that you will no longer accept proposals with numbers less than b. 1.2 otherwise, reply to Reject (B) (approval phase)

Receive Accept (b, v), 2.1If b < B, reply to Nack (B), implying that proposer has a larger number of proposals received by this Acceptor 2.2 otherwise set VSPV. Indicates that the Value approved by this Acceptor is v. Broadcast Accepted messages.

Examples

There are 5 Acceptor,1 and Proposer in the basic example, and there are no network or downtime anomalies. We focus on the changes of variable B and variable V on each Accpetor, and the change of variable b on Proposer.

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Initial state

Proposer sends "Prepare (1)" to all Accpetor, all Acceptor handles correctly, and replies Promise (1, NULL)

The Proposer receives five Promise (1, NULL), and the value satisfying more than half of the Promise is empty, and the Accept (1, v1) is sent, where v1 is the Value selected by Proposer.

At this point, v1 is approved by more than half of the Acceptor, and v1 is the Value approved by this Paxos protocol instance. If Learner learns value, all he can learn is v1

In the same Paxos instance, the approved Value cannot be changed, and the value cannot be changed even if the subsequent Proposer initiates the Paxos protocol with a higher sequence number. The core of the Paxos protocol is that "the approved value cannot be changed", which is also the basis of the correctness of the whole protocol.

The Paxos protocol is designed artificially, and its design process is also the derivation process of the protocol. The Paxos protocol makes use of the Quorom machine system, the choice of W=R=N/2+1.

To put it simply, a protocol is the process of updating an Acceptor by Proposer. Once an Acceptor successfully updates more than half of the Acceptor, the update is successful. Learner reads Acceptor by Quorum, and once a value is successfully read on more than half of the Proposer, it is an approved value. The protocol avoids deadlock by introducing rounds so that high-round proposals preempt low-round proposals. The key point of protocol design is how to meet the constraint of "only one Value is approved in a single Paxos algorithm instance".

2.9 CAP

The definition of CAP theory is simple. The three letters of CAP represent three contradictory attributes in a distributed system:

Consistency (consistency): replica consistency in CAP theory refers to strong consistency (1.3.4)

Availiablity (availability): indicates that the system can already provide services in the event of an exception

Tolerance to the partition of network (partition tolerance): means that the system can handle abnormal situations such as network partition (1.1.4.2) with fault tolerance.

CAP theory points out that it is impossible to design a distributed protocol so that it can have the three attributes of CAP at the same time, that is, 1) copies under this protocol are always strongly consistent, 2) services are always available, 3) the protocol can tolerate any abnormal network partition, and distributed system protocols can only compromise between the three of CAP.

The second law of thermodynamics states that perpetual motion machines are impossible, so don't try to design them. Similarly, the significance of CAP theory lies in that it is clearly proposed not to attempt to design a perfect system that has all three attributes of CAP, because such a system has been proved to be non-existent in theory.

Lease mechanism: Lease mechanism sacrifices An in some abnormal cases, thus obtaining complete C and good P.

Quorum mechanism: Quorum mechanism, there is a compromise among the three major factors of CAP, there is a certain C, a better A, but also a better P, which is a more balanced distributed protocol.

Two-phase commit protocol: the two-phase commit system has complete C, very bad A, and very bad P.

Paxos protocol: it is also a strong consistency protocol. Paxos is much better than two-phase commit protocol in three aspects of CAP. Paxos protocol has complete C, better A, better P. The attributes of An and P of Paxos are similar to Quorum mechanism, because the protocol of Paxos inherently has the factor of Quorum mechanism.

Thank you for reading, the above is the content of "analyzing the principle of Java distributed system". After the study of this article, I believe you have a deeper understanding of analyzing the principle of Java distributed system, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.