In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article introduces to you what is master-slave replication, the content is very detailed, interested friends can refer to, hope to be helpful to you.
The way it is copied
There are two main ways to achieve replication in a cluster:
State transfer (State Transfer): the host (Primary) copies and sends all its status to the standby (Backup), usually an incremental backup
Replication state machine (Replicated State Machine): the standby machine is regarded as a definite state machine-client sends operations to the host, the host sends operations to the standby sequentially, and all standby machines perform all operations. If you enter the same operations from the same starting state and in the same order, their output will be the same.
VMware FT uses the method of copying a state machine.
State transfer may transfer memory, while replication state machines transfer operations from the client or other external events. The reason people tend to use replication state machines is that external operations or events are usually smaller than the memory state of the service. For example, if it is a database, its memory state may reach the GB level.
The challenge of replication
Several Big Question to consider:
Which states do we want to copy?
Does the host have to wait for the backup to finish?
When will you switch to standby?
Can you see an exception when switching?
If a copy fails, we need to add a new copy, which can be a costly behavior, because the copy can be very large, how to increase the speed of adding a new copy?
Let's take a look at what virtualization giant VMware does.
Summary of VMware FT papers
Overview
As shown in figure 1, the convention is that the master virtual machine (Primary VM) is referred to as the host and the Backup VM is referred to as the standby for short.
VMware FT requires two physical servers, the host is synchronized with the standby, and the virtual disk of the virtual machine is on shared storage.
All inputs (such as network, mouse, keyboard, etc.) are entered to the host and then forwarded to the standby through Logging channel. For non-deterministic operations, additional information is sent to ensure that the standby performs these operations in a deterministic manner.
Both virtual machines perform input operations, but only the output of the host is returned to the client, and the output of the standby is discarded by the hypervisor.
Deterministic playback (Deterministic replay)
Non-Deterministic events such as virtual interrupts and uncertain operations such as reading clock cycle counters from the processor may cause the host and standby to run differently.
This presents three challenges:
Correctly capture all inputs and necessary uncertain inputs to ensure the deterministic execution of the standby
Correctly perform uncertainty input on standby
Does not degrade the performance of the system
VMware deterministic playback (deterministic replay) captures all inputs and possible uncertain inputs and writes them to a log file. By reading the log file, you can accurately replay the execution of the virtual machine.
For uncertain inputs, sufficient information must be recorded to replay, but the specific log format is not described in the paper, and Professor Robert speculates that there may be three records:
The instruction number at the time of the event
Log type. It could be ordinary network data input, or it could be weird instructions.
data.
FT Protocol (FT Protocol)
VMware FT generates relevant log entries through deterministic replay, but instead of writing the log to disk, it sends the log to the standby through logging channel. Standby real-time replay log entries.
In order to be fault tolerant, a strict fault tolerant protocol must be implemented on loggin channel, with the following requirements:
Output requirements: if the standby machine takes over after the failure of the host, the standby will continue to run in exactly the same way as the output that the host has sent to the outside world.
The easiest way is to create a special log entry for each output operation.
However, in one case, it is assumed that the virtual machine is running a database and the data of the host standby is 10. Now the client sends a self-increment request, and the host does + 1 and replies to client 11, which immediately goes down, and to make matters worse, the + 1 operation sent by the host to the standby machine has also lost packets. At this time, the standby machine is still 10, and takes over the work of the host, the client requests + 1 again, and will receive a reply of 11. The client will get a weird result (since adding twice or 11).
Therefore, it is required:
Output rule: the host does not send output to the outside world until the standby receives and confirms the log related to the output.
The purpose of this is that as long as the standby receives all the log entries, even if the host is down, the standby can still be replayed to the state last seen by the client.
As shown in figure 2, the output to the outside world is delayed until the host receives an acknowledgment from the standby.
Almost every replication system has this problem: at some point, the host must stop and wait for the standby, which will certainly limit performance.
Note: since there is no two-phase commit transaction, there is no guarantee that all output will be generated only once. The standby cannot determine whether the host sent the final output before or after the outage, and the standby may perform an output operation again. However, VMware detects duplicate packets through its network infrastructure and prevents output retransmission to the client.
Find and deal with faults
The host and standby must quickly know the other party's failure and detect it through a combination of udp heartbeats and monitoring traffic on the logging channel, indicating a failure if the heartbeat times out or logging channel traffic stops.
If the standby fails, the host will stop sending logs to logging channel and continue to operate normally.
After that, how can the standby catch up with the host? VMware has a tool called VMotion, which can interrupt the execution of the virtual machine to a minimum and clone a virtual machine.
If the host fails, the standby must be replayed until the last log entry is consumed. Then the standby machine replaces the host and begins to produce output to the client.
To ensure that only one virtual machine becomes the host at a time and avoid brain cracks, VMware executes an atomic test-and-set lock instruction on shared storage. This operation can only return success to one of the machines at a time, which is useful when both the host and the standby want to take over because of the network partition. However, if shared storage cannot be accessed because of network problems, it will not work properly anyway.
When one of the virtual machines fails, VMware FT automatically starts a new backup virtual machine on the other physical machine to restore redundancy.
The actual implementation details of FT
The previous section described the basic fault-tolerant design and protocols, but in order to create a working, robust automation system, many other components need to be designed and implemented.
Start and restart FT VMs
One challenge is how to start the standby in the same state as the host while the host is running. To solve this problem, VMware provides a tool called VMware VMotion that allows running virtual machines to be migrated from one server to another with minimal disruption. For fault tolerance, the tool was redesigned as FT VMotion to allow the virtual machine to be cloned to a remote host, which interrupts the host for no more than 1 second.
Manage Logging Channel
Figure 3 above illustrates the process from the generation of logs on the host to the consumption on the standby.
The hypervisor maintains a large log buffer (log buffer) that holds the logs of the host and standby. The host generates log entry-to-log buffers, and the standby machine consumes logs from log buffers.
If the slave reads an empty log buffer, it pauses until the log buffer has logs; if the host writes the log and finds that the log cache is full, it also pauses until the log entry is cleared-a pause that affects the client of the virtual machine. Therefore, our implementation must minimize the possibility that the host log buffer is full.
In general, the reason why the host log buffer is full:
The bandwidth is too small. Log channel bandwidth 1Gbit/s is recommended.
When the execution speed of the standby machine is too slow, so that the consumption log is too slow, the log buffer of the host may also be filled up.
A mechanism has been implemented in VMware FT that slows down the execution of the host when the standby machine lags far behind (more than 1 second, according to the paper). Slow down by reducing the host's CPU resources.
Note that deceleration of the mainframe is rare and usually occurs only when the system is under extreme pressure.
Problems with disk IO implementation
There are some minor implementation issues related to disk IO.
Problem 1: non-blocking disk operations can be performed in parallel, so simultaneous access to the same disk location can lead to uncertainty.
Solution: detect all such IO contention, and then force these competing disk operations to be performed sequentially on the host and standby in the same way.
How do you test it? There's nothing in the paper.
Question 2: disk operations of applications (or operating systems) on virtual machines may also lead to memory competition
Solution: Bounce buffer-- a temporary buffer that is the same as the memory being accessed by the disk operation. The disk read operation is modified to read specific data in the bounce buffer, and the data is copied to the virtual machine memory only when the IO operation is complete and the transfer is complete. Similarly, for disk write operations, the data to be sent is first copied to bounce buffer, and the disk write operation is modified to write data to bounce buffer.
The use of Bounce buffer slows down disk operations, but the paper says it hasn't seen any significant performance differences.
Question 3: the disk IO was not completed on the host due to a host failure. What should I do after the backup takes over?
Solution: send an error to indicate that the IO failed, and then retry the wrong IO.
Alternative scheme
This section discusses some alternatives and the tradeoffs they make.
Shared and non-shared disks: VMware FT uses a shared storage that can be accessed by both primary and standby computers. An alternative is to use separate (non-shared) virtual disks, which are written separately by the master and standby. This design can be used when shared storage cannot be accessed by both master and standby at the same time, or when shared storage is too expensive. The disadvantage is that extra work needs to be done and disk status must be synchronized.
Perform disk reading on the standby: in the current implementation, the standby will never read from the disk, and the disk operation is considered an input. An alternative design is that the standby can perform disk reads, which can help reduce traffic in the log channel when there are workloads with a large number of disk reads. However, there are two main challenges to this approach:
It may slow down the execution speed of the standby because the standby must perform all disk reads
What if the read succeeds on the host but fails on the standby (and vice versa)? Some extra work must be done to handle failed disk reads.
VMware's performance assessment shows that performing disk reads on the standby reduces throughput by 1-4 per cent, but also reduces log bandwidth.
FAQ
From: https://pdos.csail.mit.edu/6.824/papers/vm-ft-faq.txt
Q: both GFS and VMware FT provide fault tolerance, which is better?
FT provides computational fault tolerance, and you can use it to provide fault tolerance for any existing web server. FT provides fairly strict consistency and is transparent to both clients and servers. For example, you can apply FT to an existing mail server and provide it with fault tolerance.
GFS only provides storage fault tolerance, because GFS only provides fault tolerance for specific simple services (storage), and its backup strategy will be more efficient than FT. For example, GFS does not need to cause interrupts to occur on all copies with exactly the same instructions. GFS is usually only used for part of a system that provides full fault-tolerant services. For example, VMware FT itself relies on a fault-tolerant storage service shared between host and slave computers, and you can implement this shared storage with something similar to GFS (although GFS is not suitable for FT in detail).
Q: what are the atomic test-and-set instructions on shared memory?
When a service on shared storage has an initial status of false, and the host or slave machine thinks that the other party is down and should take over, it must first send a test-and-set operation to the shared storage. The pseudo code is:
Test-and-set () {acquire_lock () if flag = = true: release_lock () return false else: flag = true release_lock () return true
The host can only be taken over when true is returned. The main purpose is to avoid the situation of brain fissure (that is, there are two hosts at the same time) when the master and standby machine have network partition.
It's kind of like a distributed lock. The problem is that the pseudocode does not show when flag will be set to false!
The teacher explained: the paper did not mention when to reset flag to false, maybe it was an artificial operation of the administrator, maybe it was handed over to the machine to clean up.
About what is master from copy to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.