Brief introduction of distributed system 07/08 Update SLTechnology News&Howtos

Brief introduction of distributed system

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

What is a distributed system?

Distributed system (distributed system) is highly cohesive and transparent.

Cohesion: each node is highly autonomous and has a local database management system

Transparency: each database distribution node is transparent to the user, and the user does not feel "distributed", that is, the user does not need to know whether the relationship is split, whether there is a copy, which node the data is located, which site things are executed, and so on.

CAP principle

C: consistency (Consistency)

Whether all data backups in a distributed system have the same value at the same time

A: availability (Availability)

After the failure of some nodes in the cluster, can the cluster as a whole still respond to the read and write requests of the client?

P: partition tolerance (Partition Rolerance)

Partitioning is equivalent to the time limit for communication, and if the message body fails to achieve data consistency within the time limit, it means that partitioning has occurred.

Therefore, partition tolerance is a basic requirement, otherwise it will lose its value. Therefore, the distributed data system is to strike a balance between consistency and availability. For most web applications, there is no need for strong consistency, and the usual approach is to obtain high availability at the expense of strong consistency, which is also the direction of most distributed database products.

The sacrifice of consistency here means that strong consistency in relational databases is no longer required, as long as the final consistency can be achieved, this time window is transparent to users and imperceptible to users. The common practice is to achieve the high availability of the system and the ultimate consistency of the data through multiple asynchronous replications, and the time window depends on the time when the data is replicated to a consistent state.

About final consistency

The consistency problem is due to concurrent read or write operations. For the client, the different policies of how to obtain the updated data in different processes determine the different consistency when multi-process concurrent access.

Strong consistency: for relational databases, updated data is required to be seen by subsequent visits

Weak consistency: if you tolerate some or all subsequent access, this is weak consistency.

Ultimate consistency: if after a period of time, access to updated data is required, it is final consistency.

Seeing the consistency of strength and weakness from the synchronization of data

For the server, how to distribute the updated data to the whole system as soon as possible and reduce the time window of final consistency is very important for the distributed system.

Let's assume that there is now a distributed system in which N copies of the data are saved, and the number of nodes written to update the data needs to be W and the number of nodes to be read is R:

If WalkR > N, it means that the read and write nodes are duplicated, and it is highly consistent, such as a relational database replicated synchronously by one master and one standby.

If Wendr = 3.

Idea of distributed system Architecture

Two ways of thinking

1. Now there is a server, a server can handle 100w/s requests, as the business growth, it is estimated that the highest number of visits will reach 200w/s, if not processed, the server will deny access, or even downtime. The simplest solution is to add another machine (in the real world, adding more machines to solve the problem is a common solution). Each machine bears half of the requests, and if the traffic continues to increase, you can continue to solve the problem by adding machines. This is horizontal expansion. For the time being, we will not discuss how to balance the load.

two。 Now there is an application that provides external services, and each service is a request. The current server can undertake 100w/s requests. According to current statistics, A service 40w/s, B service 40w/s. The business has also expanded, with requests for both service An and service B doubling and there is a need to expand. Two machines are used for equal distribution, and each machine is responsible for half of service An and half of service B. it is too complicated to share equally, so it is not as good as one machine only responsible for service An and one machine only responsible for business B. this way is called vertical expansion.

A simple summary of water expansion and vertical expansion, we can find that split according to the business, that is, vertical expansion; split according to the request, that is, horizontal expansion.

Load balancing

The task of load balancer is to determine which server the client's request should be sent to in the distributed system. The usual practice is to allocate the request through an intermediate server.

Common load balancing strategies:

1. FastDFS distributed system: client asks tracker for a storage that can download files, and tracker returns an available storage,client to communicate directly with storage to complete file downloading.

The tracker here is the load balancing server.

two。 The distributed RPC middleware Hedwig:client asks zookeeper which server can execute the request; zookeeper returns an available server;client to communicate directly with service.

Zookeeper is a load balancing framework in distributed systems, an open source implementation of Google's chubby, and an important component of Hadoop and Hbase.

3. Nginx is also a load balancing server for distributed web servers.

Load balancing scheduling algorithm

1. Polling

In this way, all servers marked into the virtual service should have similar resource capacity and similarly loaded applications.

two。 Weighted round robin (Weighted Round Robin)

To solve the shortcomings of polling, the weights are allocated to each server according to resources and capabilities in advance, and the incoming requests are assigned to the servers in the cluster according to the order, weight and order.

3. Minimum number of connections

Neither of the above can be determined that the system cannot recognize how many connections are maintained at a given time. Due to the different processing time of different connections, it may occur that fewer connections are processed on server A than on B, but An is overloaded because the user on A has opened the connection for longer.

This potential problem can be avoided by the "minimum number of connections" algorithm, where client requests are allocated according to the number of connections currently open on each machine, that is, the server with the least number of active connections automatically receives the next incoming request.

Like simple polling, each server needs to have similar resource capacity and applications with similar load; it should be noted that in a configuration environment with low flow rate, the traffic of each server is not the same, and the first server will be taken into account.

4. Weighted least connection

If the server resource capacity varies, then weighted minimum connections is more appropriate. This allocation combines the advantages of both connection and weight, and in most cases, it is a fairly fair algorithm.

Again, this approach requires the same attention as the minimum number of connections.

5. Minimum number of connections slow start time

For modes 3 and 4, when a node is added to the cluster, it can be configured with a time period during which the number of connections is limited and increases slowly, which provides an "excessive time" for the server.

6. Source IP hash

By generating the hash value of the source IP, and using this hash value to find the correct real server, it means that for the request of the same host, the corresponding server is always the same, but this way will lead to server load imbalance.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.