What is the basic concept of web distributed system 07/11 Update SLTechnology News&Howtos

What is the basic concept of web distributed system

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what is the basic concept of web distributed system". The content of this article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is the basic concept of web distributed system".

1. Distributed

Xiaoming's company has three more systems: system A, system B and system C. these three systems do different business and are deployed to run on three separate machines, and they call each other (cross-domain networks, of course). Work together to complete the company's business process.

Different business divisions in different places, constitute a distributed system, now the problem is, system An is the face of the entire distributed system, users directly access, when the user visits a large amount of speed is extremely slow, or just hang up, what should I do?

Since there is only one copy of system A, it will cause a single point of failure.

2. Cluster (Cluster)

Xiaoming's company is not short of money, so buy a few more machines. Xiaoming deployed several copies of system An at once (for example, the three servers shown below), each of which is an instance of system A, providing the same service to the outside world. I'm not afraid that one of them is broken, and there are two others.

The systems of these three servers form a cluster.

But for users, there are so many system An all at once, and the IP address of each system is different, which one should be accessed in the end?

If everyone accesses server 1.1, server 1.1 will be exhausted to death, and the remaining two will be idle and become a waste of money.

3. Load balancing (Load Balancer)

Xiaoming wants to balance the work of system An on three machines as much as possible, for example, if there are 30,000 requests, then let the three servers each handle 10,000 (ideally). This is called load balancing.

Obviously, it is best to work independently of this load balancer and put it on a separate server (e.g. nginx):

Later, Xiaoming found that although the work of this load balancing server is simple, that is, to get the request and distribute the request, it may still fail, and a single point of failure will still occur.

We have no choice but to turn the load balancer into a cluster. There are two differences between bug and system A:

1. Although there are two machines in this new cluster, we can somehow get the machine to provide only one IP address, that is, the user seems to see only one machine.

two。 At the same time, we only let one load-balanced machine work, and the other one stands by. If the working one turns around, the one on standby goes on top.

4. Elasticity

If the three instances of system A still cannot meet a large number of requests, such as Singles Day, you can apply to add servers. After Singles Day, the new servers are idle and become furnishings, so Xiao Ming decided to try cloud computing. In the cloud, you can easily create and delete virtual servers, so you can easily add or decrease the number of servers according to the user's request.

5. Failure transfer

The above system looks good, but makes an unrealistic assumption:

All services are stateless, in other words, assuming that the user's two requests are directly unrelated.

But the reality is that most services, such as shopping carts, are stateful.

The user accesses the system, creates a shopping cart on the server, and adds several items to it, and then the server 1.1 hangs up, and the user cannot find the server 1.1 for subsequent visits. At this time, it is necessary to make a failure transfer and let several other servers take over and process the user's request.

But the question is, is there a user's shopping cart on the server 1.2 line 1.3? If not, users will complain, where is the shopping cart I just created?

More seriously, suppose that the login information of the user is saved to the server 1.1.1.The login information of the user is saved to the session of the server. now that the server is dead, the session is gone, which will kick the user to the login interface and let the user log in again!

Deal with the problem of poor state, the power of the cluster will be greatly reduced, unable to complete the real failure transfer, or even can not be used.

What shall I do?

One way is to copy the status information between the servers in the cluster and get the servers in the cluster to agree. Who will do this? There can only be application servers like Webspher,Weblogic.

Another way is to store the status information in one place so that each server of the cluster server can access it:

Xiaoming heard that Redis is good, so use Redis to save it.

Understanding distributed Architecture

With the increasing scale of computer systems, the architecture of centralizing all business units on one or more mainframes has become increasingly unable to meet the rapid development of today's computer systems, especially large-scale Internet systems. a variety of flexible system architecture models emerge in endlessly. Distributed processing is more and more favored by the industry-computer systems are undergoing an unprecedented change from centralized to distributed architecture.

Centralized and distributed

Centralized system

The so-called centralized system means that the central node is composed of one or more host computers, the data is centrally stored in this central node, and all the business units of the whole system are centrally deployed on this central node. all the functions of the system are handled centrally.

The most important feature of the centralized system is that the deployment structure is very simple, and the bottom layer generally uses expensive mainframes purchased from IBM, HP and other manufacturers. Therefore, there is no need to consider how to deploy services with multiple nodes, and there is no need to consider the problem of distributed cooperation between nodes. However, due to the use of stand-alone deployment, the system is likely to be large and complex, difficult to maintain, a single point of failure (the failure of a single point will affect the whole system or network, resulting in the paralysis of the whole system or network), poor scalability and other problems.

Distributed system

A distributed system is a system in which hardware or software components are distributed on different network computers and communicate and coordinate with each other only through message transmission. To put it simply, a group of independent computers gather together to provide services, but for users of the system, it is like a computer providing services. Distributed means that more ordinary computers (as opposed to expensive mainframes) can be used to form a distributed cluster to provide services. The more computers, the more CPU, memory, storage resources, and so on, and the more concurrent visits can be handled.

From the concept of distributed system, we know that communication and coordination between hosts are mainly carried out through the network, so there are almost no restrictions on the space of computers in distributed systems. These computers may be placed on different cabinets, may be deployed in different computer rooms, may be in different cities, and may even be distributed in different countries and regions. However, no matter how spatially distributed, a standard distributed system should have the following main characteristics:

Distributive property

Multiple computers in a distributed system can be distributed at will in space, and at the same time, the distribution of machines will change at any time.

Reciprocity

There is no master / slave in the distributed system, that is, there is no host to control the whole system and no slave to be controlled, and all the computer nodes that make up the distributed system are peer-to-peer. Replica is one of the most common concepts in distributed systems, which refers to a redundant way for distributed systems to provide data and services. In common distributed systems, in order to provide highly available services, we often make copies of data and services. Data copy means that the same data is persisted on different nodes, and when the data stored on a node is lost, the data can be read from the copy, which is the most effective means to solve the problem of data loss in distributed systems. Another type of copy is the service replica, which means that multiple nodes provide the same service, and each node has the ability to receive requests from outside and process them accordingly.

Concurrency

In a computer network, the concurrency operation of the program running process is a very common behavior. For example, like multiple nodes in a distributed system, some shared resources may be operated concurrently. How to coordinate distributed concurrent operations accurately and efficiently has become one of the biggest challenges in distributed system architecture and design.

Lack of global clock

In a distributed system, it is difficult to define which of the two events comes first and which comes later. The reason is that the distributed system lacks a global clock sequence control.

Failures always happen.

All the computers that make up the distributed system can fail in any form. Unless the requirement indicator allows, you can't let go of any anomalies in system design.

Problems faced by distributed systems communication anomalies

Distributed system needs network communication between nodes, so network communication will be accompanied by the risk of network unavailability or system unavailability will lead to the final distributed system can not successfully complete a network communication. In addition, even if the network communication between the nodes of the distributed system can be carried out normally, the delay will be much longer than the stand-alone operation, which will affect the process of sending and receiving messages, so message loss and message delay become very common.

Network partition

When abnormal conditions occur in the network, the network delay between some nodes in the distributed system continues to increase, resulting in normal communication between only some of the nodes that make up the distributed system. While other nodes can not-- we call this phenomenon network partition, which is commonly known as "brain fissure". When the network partition appears, there will be local small clusters in the distributed system. in extreme cases, these local small clusters will independently complete the functions that need to be completed by the whole distribution, which poses a great challenge to the distributed consistency.

Three states

There is a unique concept of "three states" for every request and response of a distributed system, that is, success, failure and timeout. When a timeout occurs, the initiator of the network communication cannot determine whether the current request has been successfully processed.

Node failure

Node failure is another common problem in the distributed environment, which refers to the downtime or "dead" phenomenon of the server nodes that make up the distributed system.

Distributed Theory (1)-CAP Theorem

Preface

CAP principle, also known as CAP theorem, means that in a distributed system, the three basic requirements of Consistency (consistency), Availability (availability) and Partition tolerance (partition fault tolerance) can only be satisfied at the same time.

A preliminary study of distributed Theory

Text

1. Brief introduction of CAP principles

Option description

Consistency (consistency) refers to the characteristic that data can be consistent across multiple copies (strict consistency).

Availability (availability) means that the service provided by the system must be available all the time, and a correct response can be obtained for each request (there is no guarantee that the data obtained is up-to-date).

Partition tolerance (partition fault tolerance) distributed systems can still provide services that meet consistency and availability in the event of any network partition failure, unless the entire network environment fails.

What is zoning?

In the distributed system, different nodes are distributed in different subnetworks. due to some special reasons, the network is impassable between these sub-nodes, but their internal subnetworks are normal. As a result, the environment of the whole system is divided into several isolated areas, which is called zoning.

2. Demonstration of CAP principle

In the basic scenario of CAP, there are two nodes N1 and N2 in the network. It can be understood simply that N1 and N2 are two computers, and the network between them can be connected. There is an application An in N1, and a database VJN 2 also has an application B and a database V. Now, An and B are two parts of the distributed system, and V is the two subdatabases of the data storage of the distributed system.

When consistency is met, the data in N1 and N2 are the same, V0=V0.

When usability is met, the user will get an immediate response whether it is a request for N1 or N2.

Under the condition that the partition fault tolerance is satisfied, the normal operation of N1 and N2 between N1 and N2 will not be affected when either side of N1 and N2 is down or the network is down.

This is the normal operation of the distributed system process, the user to N1 machine request data update, program A update database V0 is V1. The distributed system synchronizes the data M, and changes the V0 in the N2 synchronized by V1, so that the data V0 in N2 is also updated to the data in V1Magic N2 and then responds to the request of N2.

According to the definition of CAP principles, the consistency, availability, and partition fault tolerance of the system are subdivided as follows:

Consistency: whether the data between the database V of N1 and N2 are exactly the same.

Availability: whether N1 and N2 can respond normally to external requests.

Partition fault tolerance: whether the networks between N1 and N2 are interoperable.

This is a functioning scenario and an ideal scenario. As a distributed system, the biggest difference between it and a stand-alone system lies in the network. Now suppose that in an extreme case, the network between N1 and N2 is disconnected, and we want to support this network anomaly. It is equivalent to meeting the fault tolerance of partitions, can it satisfy both consistency and availability? Or do we have to make a choice between them?

Suppose that when the network is disconnected between N1 and N2, a user sends a data update request to N1, then the data V0 in N1 will be updated to V1. Because the network is disconnected, the distributed system synchronizes M, so the data in N2 is still V0. At this time, a user sends a data read request to N2, and because the data has not been synchronized, the application cannot immediately return the latest data V1 to the user.

There are two options:

First: sacrifice data consistency to ensure availability. Respond to the old data V0 to the user.

Second: sacrifice availability to ensure data consistency. Block and wait until the network connection is restored, and after the data update operation M is completed, the user will respond to the latest data V1.

This process proves that in order to satisfy the partition fault tolerance of distributed systems, we can only choose between consistency and availability.

3. CAP principle tradeoff

Through CAP theory, we know that consistency, availability and partition fault tolerance cannot be satisfied at the same time, so which one should be abandoned?

3.1. CA without P

If P (partitioning is not allowed) is not required, C (strong consistency) and A (availability) can be guaranteed. But in fact, partitioning is not a question of whether you want it or not, but it will always exist, so CA's system is more likely to allow subsystems to maintain CA after partitioning.

3.2. CP without A

If A (available) is not required, it means that every request needs to be strongly consistent between Server, while P (partition) will cause the synchronization time to be infinitely longer, so that CP can also be guaranteed. Many traditional database distributed transactions belong to this mode.

3.3. AP wihtout C

To be highly available and allow partitions, you need to discard consistency. Once the partition occurs, the nodes may lose contact with each other. In order to be highly available, each node can only use local data to provide services, which will lead to global data inconsistency. Many NoSQL now fall into this category.

Summary

For most large-scale Internet application scenarios, the hosts are numerous and the deployment is decentralized. And now the scale of the cluster is getting larger and larger, so node failure and network failure are the norm. This kind of application generally ensures that the service availability reaches N 9s, that is, P and An are guaranteed, and only C is abandoned (the second best is to ensure the final consistency). Although some places can affect the customer experience, it is not serious enough to cause the user process.

For situations where money is involved without the slightest concession, C must guarantee. Network failure would rather stop the service, this is to guarantee CA, abandon P. It seems that there have been no less than 10 accidents in the domestic banking industry in recent years, but the impact is small, the registration is not much, and the broad masses know little. Another way is to guarantee CP and abandon A, for example, read-only and not write-only in the event of a network failure.

Which is better or worse, there is no final conclusion, can only be decided according to the scene, suitable is the best.

Distributed Theory (2)-BASE Theory

Preface

The BASE theory is proposed by the eBay architect. BASE is the result of the trade-off between consistency and availability in CAP. It comes from the summary of the practice of large-scale Internet distributed systems and is gradually evolved based on CAP's law. Its core idea is that even if strong consistency can not be achieved, each application can use an appropriate way to achieve the final consistency of the system according to its own business characteristics.

Text

1. CAP's false proposition of three choices and two choices.

In fact, not for P (partition fault tolerance), you have to choose between C (consistency) and A (availability). Partitioning is rare, and CAP can satisfy both C and A most of the time.

If a partition exists or its impact is known, you need to provide a preparatory strategy to deal with it:

Detect the occurrence of zoning

Enter the displayed partition mode to restrict certain operations

Start the recovery process, restore data consistency, and compensate for errors during partition occurrence.

A brief introduction to BASE Theory

BASE theory is the abbreviation of three phrases: Basically Available (basic availability), Soft State (soft state) and Eventually Consistent (ultimate consistency).

Its core idea is:

It is impossible to achieve strong consistency (Strong consistency), but each application can adopt appropriate ways to make the system achieve final consistency (Eventual consistency) according to its own business characteristics.

3. The content of BASE theory.

Basic available (Basically Available)

Soft state (Soft State)

Final consistency (Eventually Consistent)

The following is a discussion:

3.1. Basic availability

What is basic availability? Suppose the system has an unpredictable failure, but it still works, compared to a normal system:

Loss of response time: normally, search engines return results to users in 0.5 seconds, while basic available search engines can return results in 2 seconds.

Functional loss: on an e-commerce website, under normal circumstances, users can successfully complete every order. But during the promotion period, in order to protect the stability of the shopping system, some consumers may be directed to a downgraded page.

3.2. Soft state

What is the soft state? Relative to atomicity, data copies of multiple nodes are required to be consistent, which is a "hard state".

The soft state means that the data in the system is allowed to have an intermediate state, and it is considered that the state does not affect the overall availability of the system, that is, it allows the system to have data delay in the data copies of many different nodes.

3.3. Final consistency

It says soft state, and then it can't be soft all the time, there must be a time limit. After the expiration of the period, all copies should be ensured to maintain data consistency, so as to achieve the ultimate consistency of the data. This time period depends on network latency, system load, data replication scheme design, and so on.

In the actual engineering practice, the final consistency can be divided into five types:

3.3.1. Causal consistency (Causal consistency)

Causal consistency means that if node A notifies node B after updating some data, then node B's subsequent access to and modification of the data is based on the updated value of A. At the same time, there is no such restriction on data access of node C which has no causal relationship with node A.

3.3.2. Read what you write (Read your writes)

Reading what you write means that after node A updates a data, it can always access the latest value it has updated without seeing the old value. In fact, it is also a kind of causal consistency.

3.3.3. Session consistency (Session consistency)

Session consistency frames the access to system data in a session: the system can guarantee the consistency of "reading and writing" in the same valid session, that is, after performing the update operation, the client can always read the latest value of the data item in the same session.

3.3.4. Monotone read consistency (Monotonic read consistency)

Monotone read consistency means that if a node reads a value of a data item from the system, the system should not return an older value for any subsequent data access to that node.

3.3.5. Monotone write consistency (Monotonic write consistency)

Monotone write consistency means that a system should be able to ensure that writes from the same node are performed sequentially.

In practice, these five systems are often used together to build a distributed system with ultimate consistency.

In fact, it is not only distributed systems that use ultimate consistency, but relational databases also use ultimate consistency for certain functions. For example, backup, the replication process of the database takes time, in this replication process, the value read by the business is old. Of course, data consistency was finally achieved. This is also a classic case of ultimate consistency.

More specific distributed problems distributed transactions

It means that the operation of the transaction is located on different nodes, and the AICD characteristics of the transaction need to be guaranteed. For example, in the case of placing an order, if the inventory and order are not on the same node, a distributed transaction needs to be involved.

Local message

Principle

The local message table and the business data table are located in the same database, so that local transactions can be utilized to ensure that the operations on these two tables satisfy the transaction characteristics.

On the side of the distributed transaction operation, it sends a message to the local message table after it completes the operation of writing business data, and the local transaction can guarantee that the message will be written to the local message table.

The message in the local message table is then forwarded to a message queue (MQ) such as Kafka, and if the forwarding is successful, the message is deleted from the local message table, otherwise it will continue to be re-forwarded.

On the other side of the distributed transaction operation, a message is read from the message queue and the operation in the message is performed.

Analysis.

The local message table uses local transactions to implement distributed transactions, and message queues are used to ensure final consistency.

Two-phase commit protocol

2PC

Second, distributed lock

You can use the built-in locks provided by Java for process synchronization: synchronized implemented by JVM and Lock provided by JDK. But in a distributed scenario, the processes that need to be synchronized may be on different nodes, so you need to use distributed locks to synchronize.

Principle

Locks can be implemented in two ways: blocking lock and optimistic lock. This paper mainly discusses the implementation of blocking lock. Blocking locks are usually implemented using mutexes. A mutex of 1 indicates that other processes are using the lock and is in a locked state, while a mutex of 0 indicates an unlocked state. 1 and 0 can be stored as an integer value, or with the presence or absence of some data, which indicates that the mutex is 1, that is, the locked state.

Realize

Unique index of the database

When you want to acquire a lock, insert a record into the table and delete it when the lock is released. A unique index ensures that the record is inserted only once, so the existence of the record can be used to determine whether it is in a locked state.

There are several problems with this approach:

The lock has no expiration time, and failure to unlock will result in a deadlock, and other threads can no longer acquire the lock.

Can only be a non-blocking lock, insert failure directly reported an error, can not be retried.

It is not reentrant, and the same thread cannot acquire the lock until it is released.

SETNX instruction of Redis

Use the SETNX (set if not exist) instruction to insert a key-value pair, which returns False if Key already exists, otherwise the insertion succeeds and True is returned.

The SETNX instruction is similar to the unique index of the database, it can guarantee that there is only one key-value pair of Key, and whether a key-value pair of Key exists or not can be used to determine whether it exists in the locked state.

The EXPIRE instruction can set an expiration time for a key-value pair, thus avoiding deadlocks.

RedLock algorithm of Redis

Multiple Redis instances are used to implement distributed locks to ensure that they are still available in the event of a single point of failure.

Try to acquire locks from N independent Redis instances, and if one instance is not available, try the next as soon as possible.

Calculate the time it takes to acquire a lock, and only if this time is less than the expiration time of the lock, and the lock is acquired from most of the (Nmax 2 instances 1) instances, then the lock acquisition is considered successful.

If the lock acquisition fails, the lock is released on each instance.

Ordered nodes of Zookeeper

Zookeeper is a software that provides consistent services for distributed applications, such as configuration management, distributed collaboration and naming centralization, these are very low-level and indispensable basic functions in distributed systems, but it is actually very difficult to achieve high throughput and low latency while maintaining consistency and availability.

(I) Abstract model

Zookeeper provides a tree-level namespace, and the / app1/p_1 node indicates that its parent node is / app1.

(2) Node type

Permanent node: does not disappear due to the end of the session or timeout

Temporary node: disappears if the session ends or times out

Ordered node: a numeric suffix is added to the node name, and it is ordered. For example, the generated ordered node is / lock/node-0000000000, its next ordered node is / lock/node-0000000001, and so on.

(3) listener

(4) implementation of distributed lock

Create a lock directory / lock

Create temporary and ordered child nodes under / lock, with the first client corresponding to / lock/lock-0000000000, the second / lock/lock-0000000001, and so on

The client acquires the list of child nodes under / lock to determine whether the child node it creates is the child node with the lowest sequence number in the current child node list, and if so, it thinks that it has acquired the lock; otherwise, it listens to its previous child node and repeats this step until the lock is obtained after receiving the change notification of the child node.

Execute the business code, and delete the corresponding child nodes after completion.

(5) session timeout

If a session that has already acquired a lock times out, because the temporary node is created, the temporary node corresponding to the session is deleted and other sessions can acquire the lock. As you can see, the Zookeeper distributed lock does not have the deadlock problem of the unique index of the database to implement the distributed lock.

(6) herding effect

A node does not get a lock and needs to listen to its previous child node, because if all the child nodes are monitored, then the state of any child node changes, and all other child nodes will be notified (herding effect). And we only want its latter child node to be notified.

Third, distributed Session

In a distributed scenario, if a user's Session is stored on only one server, when the load balancer forwards the user's next request to another server, the server does not have the user's Session, which may cause the user to log in again and other operations.

Sticky Sessions

A load balancer needs to be configured so that all requests from a user are routed to a server node, so that the user's Session can be stored in that server node.

Cons: when a server node goes down, all Session on that server node will be lost.

Session Replication

Session synchronization is performed between server nodes so that users can access any server node.

Disadvantages: better server hardware conditions are required; the server needs to be configured.

Persistent DataStore

Persist Session information to a database.

Cons: it may be necessary to implement code that accesses Session.

In-Memory DataStore

Memory databases such as Redis and Memcached can be used to store Session, which can greatly improve the reading and writing efficiency of Session. Memory databases can also persist data to disk to ensure data security.

IV. Load balancing

Arithmetic

Polling (Round Robin)

The polling algorithm sends each request to each server in turn. In the following figure, a total of 6 clients have generated 6 requests, which are sent in the order of (1, 2, 3, 4, 5, 6). Finally, the request of (1, 3, 5) is sent to server 1, and the request of (2, 4, 6) is sent to server 2.

This algorithm is suitable for scenarios where the performance of each server is similar. If there is a difference in performance, the server with poor performance may not be able to bear too much load (Server 2 in the following figure).

Weighted polling (Weighted Round Robbin)

Weighted polling gives a certain weight to the server according to the performance difference of the server on the basis of polling. For example, in the following figure, server 1 is given a weight of 5, server 2 is given a weight of 1, then (1, 2, 3, 4, 5) the request is sent to server 1, (6) the request is sent to server 2.

Minimum connection (least Connections)

Because the connection time of each request is different, the use of polling or weighted polling algorithm may make the current number of connections of one server too large, while the connection of the other server is too small, resulting in load imbalance. For example, in the following figure, the (1, 3, 5) request is sent to server 1, but (1, 3) is quickly disconnected, and only (5) requests to connect to server 1; (2, 4, 6) are sent to server 2. Only (2) is disconnected. When the system continues to run, server 2 will bear too much load.

The minimum connection algorithm is to send the request to the server with the current minimum number of connections. For example, in the following figure, server 1 has the smallest number of current connections, so the new request 6 will be sent to server 1.

Weighted least connection (Weighted Least Connection)

On the basis of the least connection, the weight is assigned to each server according to the performance of the server, and then the number of connections that each server can handle is calculated according to the weight.

Random algorithm (Random)

Send the request to the server at random. Similar to the polling algorithm, this algorithm is more suitable for scenarios with similar server performance.

Source address Hash (IP Hash)

The source address hash is a numerical value calculated by the client IP hash, which is used to modular the number of servers, and the result is the serial number of the target server.

Advantages: ensure that clients with the same IP will be hash to the same server.

Disadvantages: it is not conducive to cluster expansion, and changes in the number of backend servers will affect hash results. Consistent Hash improvements can be used.

Realize

HTTP redirection

After receiving the HTTP request, the HTTP redirect load balancer server will return the address of the server and write the address back to the browser in the HTTP redirection response. After receiving the request, the browser needs to send the request again.

Disadvantages:

The delay of user access will increase

If the load balancer goes down, the site cannot be accessed.

DNS redirection

Use DNS as a load balancer to return the IP addresses of different servers according to the load. Large websites basically use this method as the first-level load balancing method, and then use other ways to do the second-level load balancing internally.

Disadvantages:

The DNS lookup table may be cached by the client, and all subsequent requests will be redirected to the same server.

Modify MAC address

Use LVS (Linux Virtual Server), a link layer load balancer, to modify the requested MAC address according to the load.

Modify IP address

Modify the destination IP address of the request at the network layer.

Agent automatic configuration

The difference between forward and reverse proxies:

Forward proxy: occurs on the client side and is initiated by the user. For example, the client accesses the proxy server actively, lets the proxy server get the required public network data, and then forwards it back to the client.

Reverse proxy: occurs on the server side, the user does not know the existence of the agent.

The PAC server is used to determine whether a request is proxied or not.

Highly available "brain fissure"

When it comes to high availability, we often hear "brain fissure". What exactly is "brain fissure"?

Bottom line: when two or more nodes contend for resources because they think they are the only active servers at the same time, this contention scenario is called "split-brain" or "partitioned cluster".

HeartBeat principle:

HeartBeat Heartbeat running on the standby host can detect the running status of the primary server through an Ethernet connection, and automatically take over the resources of the primary server if it cannot detect the "heartbeat" of the primary server. In general, the heartbeat connection between the primary and standby servers is a separate physical connection, which can be a serial cable and an Ethernet connection implemented by a "cross-wire". Heartbeat can even detect the working status of the primary server through multiple physical connections at the same time, and as long as it can receive information that the primary server is active through one of the connections, it will consider the primary server to be in a normal state. From a practical experience point of view, it is recommended to configure multiple independent physical connections for Heartbeat to avoid a single point of failure on the Heartbeat communication line itself.

In the "dual hot standby" high availability (HA) system, when the heartbeat line connecting two nodes is disconnected, the HA system, which is originally a whole and coordinated, is split into two independent individuals. Because they have lost contact with each other, they all think that there is something wrong with each other, and the HA software on the two nodes, like "brain crackers", instinctively scramble for "shared resources" and "application services", which will have serious consequences: or shared resources will be carved up and two sides of "services" will not be able to get up. Or both sides of the "service" are up, but at the same time read and write "shared storage", resulting in data corruption (such as database polling online log error).

The Heartbeat running on the standby host can detect the running status of the primary server through an Ethernet connection, and automatically take over the resources of the primary server if it cannot detect the "heartbeat" of the primary server. In general, the heartbeat connection between the primary and standby servers is a separate physical connection, which can be a serial cable and an Ethernet connection implemented by a "cross-wire". Heartbeat can even detect the working status of the primary server through multiple physical connections at the same time, and as long as it can receive information that the primary server is active through one of the connections, it will consider the primary server to be in a normal state. From a practical experience point of view, it is recommended to configure multiple independent physical connections for Heartbeat to avoid a single point of failure on the Heartbeat communication line itself.

Serial cable: considered to be a slightly more secure connection than an Ethernet connection, because hacker cannot run programs such as telnet, ssh, or rsh over a serial connection, thus reducing its chances of re-invading the backup server through a hijacked server. However, the serial cable is limited to the available length, so the distance between the primary and standby servers must be very short.

2. Ethernet connection: using this method, the restriction on the length of serial cable can be eliminated, and the file system can be synchronized between master and standby servers through this connection, thus reducing the occupation of bandwidth from normal communication connections.

From the point of view of redundancy, two physical connections should be used to transmit heartbeat control information between the primary server and the standby server; this can avoid contention for resources when two nodes consider themselves as the only active servers when a network or cable failure occurs. This scenario is called "split-brain" or "partitioned cluster". In the case of two nodes sharing the same physical device resources, brain fissure will have quite terrible consequences. To avoid brain fissure, the following preventive measures can be taken:

Add redundant heartbeats, such as double lines. Minimize the chance of "brain fissure".

Enable disk locks. The service side is locking the shared disk, and when a "brain crack" occurs, the other party can not take away the shared disk resources at all. But there is also a big problem with using locked disks. If the party who occupies the shared disk does not take the initiative to "unlock" it, the other side will never get the shared disk. In reality, if the service node suddenly crashes or crashes, it is impossible to execute the unlock command. Backup nodes can not take over shared resources and application services. So someone designed a "smart" lock in HA. That is, the party in service enables the disk lock only when it is found that the heartbeat cable is all disconnected (not noticing the opposite end). It's not usually locked.

Set up the arbitration mechanism. For example, if you set up a reference IP (such as gateway IP), when the jumper is completely disconnected, both nodes ping the reference IP. If it is not accessible, it means that the breakpoint is on the local side. Not only the "heartbeat", but also the local network link of the external "service" is broken, even if it is useless to start (or continue) the application service, then take the initiative to give up the competition and ping the reference IP to start the service. To be more secure, the party that does not have access to ping simply restarts itself with reference to IP, in order to completely release the shared resources that may still be occupied.

Thank you for your reading, the above is the content of "what is the basic concept of web distributed system". After the study of this article, I believe you have a deeper understanding of what the basic concept of web distributed system is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.