Knowledge sharing in distributed Systems: correct understanding of CAP Theorem 07/13 Update SLTechnology News&Howtos

Knowledge sharing in distributed Systems: correct understanding of CAP Theorem

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Preface

CAP understanding I also read a lot of books, also read a lot of peer blog articles, basically everyone's understanding is different, and Professor Brewer's definition is too simple, there is no specific description and scenario case study. Therefore, I refer to some of the materials to sort out an article to share with you.

The title is correctly understood, perhaps some points are not 100% correct or ambiguous, but I hope to share the discussion with you to achieve the final correct.

Brief introduction

CAP Theorem, also known as Brewer's theorem Theorem, is a conjecture put forward by Professor Eric Brewer in 2000. It points out that it is impossible for a distributed system to satisfy the following three points at the same time:

Consistency (consistency): where all nodes see the same data at the same time. (all nodes have the same data at the same time)

Availability (availability): which guarantees that every request receives a response about whether it succeeded or failed. (ensure that every request responds regardless of success or failure)

Partition tolerance (separation tolerance): where the system continues to operate even if any one part of the system is lost or fails. (the loss or failure of any information in the system will not affect the continued operation of the system)

Many books and articles quote http://robertgreiner.com/2014/08/cap-theorem-revisited/, a blog post written by Robert Greiner in August 2014. Robert Greiner's is easier to understand than looking at Professor Brewer's muddled definition.

Define

Original: In a distributed system (a collection of interconnected nodes that share data.), you can only have two out of the following three guarantees across a write/read pair: Consistency, Availability, and Partition Tolerance-one of them must be sacrificed.

In a distributed system (a collection of nodes that connect to each other and share data), when it comes to read and write operations, only two of Consistence, Availability, and Partition Tolerance are guaranteed, and the other must be sacrificed.

Keywords: interconnected nodes (interconnect node), share data (shared data), a write/read pair (read / write)

From the above paragraph, there are several, that is to say, when we talk about the CAP theorem, we choose the second of the above three on the premise of data reading and writing, data sharing and node interconnection, and it is also suggested that we should not spend time and energy meeting the three at the same time.

For example, web cluster and memcached cluster do not belong to the discussion object

Web cluster is only resource replication and distribution on different nodes, but there is no interconnection and data sharing between nodes (sessionid, memory cache).

Memcached cluster data storage achieves hash consistency through clients, but the cluster nodes are not interconnected and there is no data sharing.

Generally speaking, CAP Theorem does not discuss all the functions of distributed systems.

Consistency (Consistency)

Original text: A read is guaranteed to return the most recent write for a given client.

For a given client, the read operation ensures that the latest write results can be returned

Keyword: a given client (specified client).

The consistency here is a little different from what we usually know about the consistency of ACID. The consistency of ACID is concerned with the data integrity of the database.

The above definition does not specify that all nodes must have the same data at the same time, but the focus is on the client. If there is a scenario, after you deposit RMB500 to a bank card in the ATM (client), the balance will be displayed immediately when ATM initiates a query on the balance, and then we can also withdraw the RMB500. The query balance read operation can be the main library read immediately after writing, or read from the library after a certain period of time after writing (without writing).

Availability (Availability)

Original: a non-failing node will return a reasonable response within a reasonable amount of time (no error or timeout).

The non-failure node will return a reasonable response (not an error or timeout) within a reasonable time.

Keywords: non-failing node (non-failure node), reasonable response (reasonable response)

The availability here is a little different from what we usually understand as high availability, which refers to the ability of the system to perform its functions without interruption.

The failed node is no longer available because the result of the request is either error or timeout. A reasonable response does not indicate success or failure, but the response should have an accurate description of whether it was successful or not. For example, when we read a slave library of a sql server cluster, synchronization takes time. It may not be the latest data, but it is a reasonable response.

Partition fault tolerance (Partition tolerance)

Original text: The system will continue to function when network partitions occur.

When the network partition occurs, the system will continue to operate normally

Keyword: continue to function (continue to operate normally)

If we make a redis cluster of one master and two slaves, one day a slave node becomes unavailable because of a network failure, but the other master and slave can still operate normally, then we think it has partition fault tolerance.

CA- sacrifices partition fault tolerance

As a distributed system, partitioning is bound to occur (50 minutes once in two years or 10 minutes three times a year?), so it is considered that most of the discussions on CAP are based on the premise of P establishment. Suppose we sacrifice P at this time when the node is unavailable due to a network failure, and the request responds to error and timeout, which conflicts with the definition of availability.

However, if the partition does not exist most of the time, then there is no need to make a C / A trade-off for reading and writing to a single node. But it is said above that zoning always happens. Isn't that contradictory? it's still a trade-off. If 99.99% of the time in one year is normal, and the unavailable time is 0.01% (52.56 minutes), if this time is within the scope of business acceptance, or only in a certain region (South China, North China, Central China?) If there is an impact, then CA is also optional.

PC- sacrifices usability

The most typical cases are RDBMS cluster and Redis cluster, both of which are read-write separation schemes using master-slave replication. If both set up a master and multi-slave cluster and write data in the master node, in order to ensure that the subsequent read operation gets the latest data (consistency), the read operation will still request the master node (the complex point of read-write separation leads to the exception of the business in time when the slave database is not synchronized, and in order to ensure the normality of the business, the read after writing will request the master node), a slave node dies but as long as the master node and other slave nodes are still operating normally. It satisfies the fault tolerance of the partition. But when the master node writes error or timeout due to a network failure, the system will be unavailable (at the expense of availability).

At this time, other functions and mechanisms can be introduced, such as Redis Sentinel mode, failover function.

PA- sacrifices consistency

The most typical cases are Cassanda cluster and Riak cluster. This type of distributed database can be written and read by any node. When it appears as a cluster, no matter which node is written, it will synchronize the data of that node to other nodes. Because of this synchronization method, it is enough to access only one node when reading data. However, the data may not be up-to-date (at the expense of consistency) because of data synchronization on other nodes. If the current node becomes unavailable (regardless of read / write) due to a network exception, you can transfer the access node (availability).

In addition, sacrificing consistency here does not mean giving up consistency, but PA chooses final consistency (all copies of data in the system can reach a consistent state after a period of synchronization)

Summary

The above reference to the word "sacrifice" does not mean an either-or choice, but can be mashup based on the design of subsystems and modules (such as PA and PC, CA and PC).

This article makes a simple description of CAP Theorem. I refer to some books and articles plus my own understanding. I hope I can share with you. If there are different suggestions and opinions, including errors in the description in the article, please point out in the comments below. I will make changes in time.

By: Chen Xun

Source: http://www.cnblogs.com/skychen1218/

About the author: focus on project development on the Microsoft platform. If you have any questions or suggestions, please let us know!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.