What is big data's brief introduction and theory of distribution? 04/18 Update SLTechnology News&Howtos

What is big data's brief introduction and theory of distribution?

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail the introduction and theory of big data's distribution. The content of the article is of high quality, so the editor will share it with you for reference. I hope you will have some understanding of the relevant knowledge after reading this article.

First: introduction to distributed 1.1: distributed system concept

Distributed system: a business is divided into multiple services, distributed in different server nodes, and the system built together is called distributed system.

The difference between distribution and clustering: distribution is multiple people doing different things together; clustering is multiple people doing the same thing together.

Characteristics of distributed systems:

Distributive property

Reciprocity

Concurrency

Lack of global clock

1.2: evolution of distributed systems

1: single application architecture-> 2: application server is separated from database server-> 3: application server cluster-> 4: application server load balancing-> 5: database read-write separation-> 6: add search engine to relieve database pressure-> 7: add cache mechanism to relieve database pressure-> 8: database horizontal split, vertical split-> 9: application split-> 10: service

Diagram of distributed architecture evolution

1.3: problems faced by distributed systems 1.3.1 communication anomalies

Due to the unreliability of the network itself, each network communication will be accompanied by the risk of network unavailability, which will lead to the distributed system can not successfully carry out a network communication.

1.3.2 Network partition

There is a network disconnection between the networks, but the internal network of each sub-network is normal, so that the network environment of the whole system is divided into several isolated areas, and local small clusters will appear in the distributed system.

1.3.3 Node failure

Node failure is another common problem in distributed systems, which refers to the downtime or "dead" phenomenon of the server nodes that make up the distributed system.

1.3.4 three states (success, failure, timeout)

In most cases, network traffic can receive successful or failed responses, but when there is an exception in the network, there will be a timeout, usually in the following two cases:

The request was not successfully sent to the receiver, but was lost during the sending process.

The request was successfully received by the receiver and processed, but in the process of response feedback to the sender, the message was lost.

Second, distributed theory 2.1: consistency

Distributed consistency: when data is stored in multiple copies, the data in each copy is consistent.

Replica consistency: in distributed systems, there are often multiple copies of data, and the problem of synchronization is brought about by ensuring the consistency of each copy.

Therefore, how to ensure the consistency of data without affecting the performance of the system is a key consideration and tradeoff for every distributed system. As a result, the level of consistency is born:

2.1.1 strong consistency

Users change what data, read what is the data, which has a great impact on the performance of the system. It is often implemented in the way of locking, for example, the data of node 1 is changed, and the data on other nodes is locked until the changed data synchronization is successful.

2.1.2 weak consistency

After a successful write, the system does not promise to read the written value immediately, nor does it promise how long the data will be consistent, but will try its best to ensure that the data can reach the final consistency after reaching a certain time level (such as second level).

Ultimate consistency: do not consider the impact of all intermediate states, only ensure that after a period of time, the data of all copies in the system are correct after there are no new updates.

Example: for example, after A deducts the money from a bank transfer, B receives the money after a long time. As long as the final state is correct, other aspects can reduce the requirements.

Final consistency includes the following ways, depending on when and how each process accesses the data after the update:

1) read and write consistency

Read-write consistency: users read the consistency of their own writing results, ensuring that users can always see their updated content for the first time.

Example: for example, if we post a moments, it doesn't matter whether the content of the moments is seen by friends at the first time, but it must be displayed on our own list.

Solution:

We go to the main library to read some specific content every time. (the main library is under great pressure)

Set an update time window. For a period of time just updated, we all read from the master library by default. After this window, we select the slave library that has been updated recently to read it.

Record the timestamp of the user update directly, and bring this timestamp with the request. All slave libraries whose last update time is less than this timestamp will not respond.

2) monotone read consistency

Monotonous read consistency: the data read this time cannot be older than that read last time. Due to the inconsistent time for master-slave nodes to update data, users can sometimes brush it out (request to nodes that have already been synchronized) when they keep refreshing. After refreshing again, you will find that the data is gone (and the nodes that have not been synchronized yet are requested), and then refresh may be brushed out again, as if you have encountered a psychic event.

The solution is to calculate a hash value based on the user's ID and map it to the machine through the hash value. No matter how much the same user refreshes, it will only be mapped to the same machine.

3) consistency of cause and effect

It means that if node A notifies node B after updating some data, then node B's subsequent access to and modification of the data is based on the updated value of A. At the same time, there is no such restriction for data access of node C which has no causal relationship with node A.

2.2:CAP theory

The theoretical meaning of CAP is that a distributed system can not meet the three basic requirements of consistency (C:Consistency), availability (A: Availability) and partition fault tolerance (P:Partition tolerance) at most two of them at the same time.

C consistency: all nodes (or replicas) in a distributed system have the same data (consistency here means strong consistency)

Goal: if the data is successfully written to the master database, the query to the slave database is also successful, and if the data is not written to the master database, the query to the slave database also fails.

Implementation: the slave database is locked after the master database is written, and the lock is not released until the slave database synchronization is successful

Features of distributed consistency:

There will be a delay in the write operation because of the process of data synchronization

Resources are temporarily locked during synchronization

If data synchronization fails, requesting this data will return a failure message instead of old data

An availability: Reads and writes always succeed. The system is always available

Goal: respond to database query requests immediately (regardless of synchronization), with no response timeouts or errors.

Implementation: after writing data to the master database, regardless of whether the data from the slave database has been synchronized, you can directly return the data when reading the data from the slave database, even if the data is old data.

P-partition fault tolerance: can still provide services that meet consistency and availability in the event of a system node or network partition failure

Goal: even if the master database fails to synchronize data to the slave database, it will not affect the write operation; the failure of one node will not affect the service provided by the other node.

Implementation of Partition tolerance (Partition Fault tolerance):

Synchronize the data from the master database to the slave database asynchronously

Add a database node, which will be serviced by another slave node when the middle slave node dies

2.2.1 CAP can only satisfy two of them.

Abandon An and conceal CP (consistency and fault tolerance)

A system ensures consistency and partition fault tolerance, and abandons availability. In other words, in extreme cases, the system is allowed to be inaccessible, which often sacrifices the user experience and keeps the user waiting until the system data is consistent before resuming the service.

Abandon C to meet AP (availability and fault tolerance)

Most distributed systems are designed to ensure high availability and partition fault tolerance, but at the expense of consistency.

Abandon P to meet CA (consistency and availability)

If you want to abandon P, then you have to abandon the distributed system, and CAP will be out of the question and return to the traditional stand-alone service.

2.3:BASE theory

BASE theory: full name: abbreviations of three phrases: Basically Available (basic available), Soft state (soft state), and Eventually consistent (final consistency); it is a coordination of CAP theory, the result of the tradeoff between consistency and usability in CAP theory, so that the three conditions of CAP achieve good results without having to choose two or three.

The significance of BASE theory is that we do not have to choose between An or C, we can realize part of An and C, which is an extension of AP in CAP.

The core idea of BASE theory is that even if strong consistency cannot be achieved, each application can make the system achieve final consistency in an appropriate way according to its own business characteristics.

Basically Available (basic available)

When an unpredictable failure occurs in a distributed system, partial availability is allowed to be lost-but it does not mean that the system is completely unavailable.

For example, when a failure occurs, the response time can be appropriately increased by one or two seconds by using the search function; at 11: 00 am, in order to protect the stability of the system, some users will be directed to a degraded page.

Soft state (soft state)

Soft state: allows the data in the system to have an intermediate state, and it is considered that this state does not affect the overall availability of the system, that is, there is a delay in the process of allowing the system to synchronize data between replicas of multiple different nodes.

For example: takeout delivery status, for users, there is a distribution status.

Eventually consistent (final consistency)

The ultimate consistency emphasizes that all the copies of data in the system can reach a consistent state after a period of synchronization. There is no need to guarantee the strong consistency of the system data in real time. For example, the transfer will not arrive immediately, the data synchronization will be successful after a period of time, but will be successful in the end.

About big data distributed introduction and theory is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.