How to understand the website data storage architecture of Amazon 04/27 Update SLTechnology News&Howtos

How to understand the website data storage architecture of Amazon

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to understand Amazon's website data storage architecture". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to understand Amazon's website data storage architecture".

I. system overview

1. Overview of Amazon platform

Amazon platform is a service-oriented architecture composed of hundreds of services, which adheres to the principles of highly decentralized, loosely coupled and completely distributed, as shown in the following figure.

In this environment, there is a particular need for a storage system that is always available, from which Dynamo is born.

2. Overview of Dynamo

Dynamo is a highly available distributed Key-Value storage system provided by Amazon, which meets scalability, availability and reliability.

CAP principle satisfies: P is satisfied by consistent hash, An is satisfied by replication, and C is satisfied by object version and vector clock. Sacrifice C to satisfy the highly available A, but eventually it will be consistent. However, whether to sacrifice C to meet An or A to meet C can be deployed according to the NWR model to achieve a balance of income and cost.

There are three levels of concepts within Dynamo:

Key-Value:Key uniquely identifies a data object, Value identifies the entity of the data object, and reads and writes the data object through the Key.

Node node: a node is a physical host. On each node, there are three prerequisite components: request coordinator (request coordination), member and failure detection, and local persistence engine (local persistence engine), all of which are implemented by Java. The local persistence engine supports different storage engines, the main of which is Berkeley Database Transactional Data Store (it is more appropriate to store objects of several hundred kilograms), as well as BDB Java Edtion, MySQL, and consistent memory Cache. The local persistence engine component is a pluggable persistence component, and the application can choose the most appropriate storage engine as needed, such as BDB if the storage object is usually a few kilobytes, and MySQL if it is of more size. In production, Dynamo usually uses BDB transaction data storage.

Example instance: from an application point of view, it is a service that provides IO functionality. Each instance consists of a set of nodes, which may be located in different IDC, so that problems with IDC will not lead to data loss, which will have better disaster recovery and reliability.

II. Background conditions

1. System assumptions and requirements

(1) query model

Based on the Key-Value model, rather than the SQL or relational model. Storage objects are relatively small, usually smaller than 1MB.

(2) ACID attribute

In traditional relational databases, using ACID (An atomicity, C consistency, I isolation, D persistence) to guarantee transactions often has poor availability on the premise of ensuring ACID. Dynamo uses weak consistency C to achieve high availability, does not provide data isolation I, and only allows single Key updates.

(3) efficiency

SLA is satisfied on cheap machines and is configured to meet latency and throughput requirements, so there must be a tradeoff between performance, cost, availability, and durability guarantees.

(4) other assumptions

Dynamo is used only within Amazon, so its environment is considered reliable.

2. Service level Agreement (SLA)

The so-called service level agreement refers to an agreement reached between the client and the server on several metrics, which usually includes the rate of API requests from the client and the expected delay of the server. For example, at the peak of 500 requests per second by the client, the response time of 99.9% is 300ms.

In the general industry, the average (average), median (median) and expected change (expected variance) are used for this performance-oriented SLA. But these metrics can only have a good experience for most clients, not all. To solve this problem, Dynamo uses 99.9% percentile to replace these indicators.

3. Design considerations (copying data)

Traditional data replication algorithms are forced to sacrifice availability in order to ensure data consistency in the event of failure, that is, rather than being unsure whether the data is correct, it is better to keep the data unavailable until the data is absolutely correct.

However, a highly flexible system should be able to let users know which attributes can be reached under what circumstances, as is the case with Dynamo.

For systems where failures are normal, optimistic replication technology can provide system availability, but the problem is the need to detect and coordinate conflict resolution, and the process of coordinating conflict resolution includes two problems: when and by whom. The design of Dynamo is that the data storage is ultimately consistent, that is, all update operations eventually reach all copies.

(1) when to coordinate

There are only two situations: writing or reading to coordinate conflicts.

Traditional data stores coordinate conflicts when writing, that is, if the data does not reach all or most of the required copies within a given time, the write may be rejected.

Amazon believes that rejecting customers' update operations will lead to a poor user experience, and the typical application is the shopping cart service, where customers can still add or delete items to the shopping cart even if there is a failure. Based on this, Dynamo's goal is to be "always writable", that is, the "write" of the data store is highly available. In other words, in order to ensure that "write" is never rejected by Dynamo, the data store coordinates conflicts when reading.

(2) who will coordinate

There are only two situations: coordinated by the data store itself or by the client application.

If the data store itself is coordinated, you can only use a simple policy to coordinate conflicting update operations, such as "Last write wins" (last write wins).

If it is a client application coordination, the application can choose the method that is most suitable for coordinating conflicts based on business needs.

Dynamo chooses the latter, the typical application is the shopping cart service, returns all the data object versions, and finally selects the conflicting versions that have been merged.

III. Key technologies

As a typical representative of a kind of distributed system, many key technologies of Dynamo bring it a series of advantages, as shown in the following table:

1. Data partition

Hash algorithm: use MD5 to Hash the Key to produce a 128bit identifier to determine the storage node of the Key.

In order to achieve incremental scalability, Dynamo uses consistent hashing to complete data partitioning. In the consistent hash, the output range of the hash function is a ring. As shown in figure 2, each node in the system is mapped to a location in the ring, and the Key is also Hash to a location in the ring. Starting from the location where it is mapped, Key finds the first node larger than it clockwise as its storage node. This is the area between the first system nodes that each system node is responsible for starting from its mapped position to counterclockwise.

The biggest advantage of consistent hashing is that the expansion and reduction of nodes only affect their direct neighbors, but have no effect on other nodes. This seems perfect, but Amazon does not stop scripting because of this, which is its greatness, but there are still two problems: uneven distribution of node data and ignoring the heterogeneity of node performance. In order to solve these two problems, Dynamo improves the consistent hash and introduces virtual nodes, that is, each node is logically divided into multiple virtual nodes, and each virtual node logically looks like a real node, so that each node is assigned to multiple points on the ring instead of a single point.

2. Data replication

For high availability, Dynamo replicates each data to N hosts, where N is the configuration parameter for each instance (per-instance) and the recommended value is 3. Each Key is assigned to a coordinator node, and the coordinator node manages replication data items within its area of responsibility. In addition to storing each Key within its responsibility locally, it also copies these Key to a clockwise NMY successor node on the ring. In this way, each node in the system is responsible for an area on the ring from its own position to the Nth precursor node. The specific logic is shown in figure 2, where node B copies the key K at nodes C and D in addition to storing the key K locally, so that node D stores all keys that fall on ranges (A, B], (B, C), and (C, D]:

There is a preferred node list for a specific key. Due to the existence of virtual nodes, in order to solve the problem of node failure, some positions on the ring will be skipped when constructing the first node list, so that these nodes are located on different physical nodes to ensure high availability.

In order to ensure the consistency of data replicas during replication, Dynamo is implemented using a consistency protocol similar to that of Quorum system. Here three key parameters (N, R, W) are involved, in which N refers to the replication of data objects to N hosts, and the coordinator is responsible for copying the data to Nmuri 1 node. Amazon recommends that N be configured as 3 Magi R represents the minimum number of participating nodes in a successful read operation, and W represents the minimum number of participating nodes in a successful write operation. If ringing W > N, the effect is similar to that of Quorum. In this model, the read (write) delay is determined by the slowest R (W) replica, and in order to get a smaller latency, R and W are usually configured to be less than N. Amazon recommends (N, R, W) set to (3, 2, 2) to strike a balance between performance and availability. R and W directly affect performance, scalability and consistency. If W is set to 1, as long as one node is available in an instance, it does not affect the write operation. If R is set to 1, as long as one node is available, it will not affect the read request. If the R and W values are too small, they affect consistency, and if they are too large, availability. Therefore, a balance between R and W values is needed, which is one of the bright spots of Dynamo.

3. Version merging

As you can see from the previous article, in order to ensure high availability, Dynamo replicates multiple copies of each data (3 copies are recommended). Before the data is replicated asynchronously to all replicas, inconsistent data will be obtained if there is a get operation, but Dynamo provides final consistency. The shopping cart is a typical application of this situation on the Amazon platform. In order to ensure that the shopping cart is always available, the result of any change to any copy will be stored as a data version, so that users will get multiple versions when they get, so that the data versions need to be merged. Dynamo pushes the merge to the application, which in this case is handled when the shopping cart get.

Dynamo uses a vector clock to identify causality between multiple copies of the same data on different nodes. A vector clock is actually a list, and each node of the list is a (node, counter) pair, that is, a (node, counter) list. The relationship between data versions is either causal or parallel, and the judgment of the relationship depends on the size of the counter of the counter. If the counter of the first clock object is less than or equal to the counter of all other clock objects, it is causality. Then the ancestors of the result can be regarded as old data and directly ignored, otherwise it is parallel, then it is considered that the data version conflicts and needs to be coordinated and merged.

In Dynamo, when a client updates an object, it must specify which version of the data to update, which depends on the vector clock obtained during the earlier get operation.

The use of the vector clock is shown in figure 3 on the process diagram, and the specific process is analyzed as follows:

The client writes a new object. The node Sx processes the request, handles the write to key: the sequence number is incremented, and creates a vector clock for the data, which generates object D1 and vector clock [(Sx, 1)] on that node.

The client updates the object. Suppose that the request is processed by the same node, that is, Sx, because the node has D1 and vector clock [(Sx, 1)], then the object D2 and vector clock [(Sx, 2)] are generated on this node after updating the object. D2 inherits from D1, that is, D2 overrides D1, and the counter increases by 1, but other nodes may be D1 or D2 at this time, depending on the network and node status.

Suppose the same client updates the object but is processed by a different server. The node Sy processes the request, then the object D3 and the vector clock [(Sx, 2), (Sy, 1)] are generated on the node after the object is updated.

Suppose another client reads D2 and tries to update it but is processed by a different server. The node Sz processes the request, then the object D4 and the vector clock [(Sx, 2), (Sz, 1)] are generated on the node after the object is updated.

Node data version recycling. There are now four versions of the data that exist and are passed between nodes, and when the node receives D3 or D4, D1 and D2 are recycled according to the vector clock because they are the ancestors of D3 and D4. However, if the nodes that receive D3 and D4 find that there is a parallel relationship between them according to the vector clock, they will retain both, and submit both to the client when they get to coordinate and merge the version.

Assuming that the client reads the data, it will get D3 and D4, and according to their vector clocks, they will be merged into D5 and vector clocks [(Sx, 2), (Sy, 1), (Sz, 1)]. The node Sx coordinates the write operation and updates the object and vector clocks.

As can be seen from the above process, when there are many nodes and the situation is extreme, the vector clock list will grow. Dynamo uses the clock truncation scheme to solve this problem, that is, (node, counter) pairs with a time stamp, when the number reaches the threshold (for example: 10), the earliest pair is removed from the vector clock.

4. Fault detection

(1) Ring Membership

At startup, each node stores its own mapping information on the ring and persists it to disk, and then each node randomly selects a peer node every other second, and propagates the mapping I information of the node through the Gossip protocol. Finally, each node knows the scope handled by the peer node, that is, each node can directly forward a key read / write operation to the correct dataset node without going through intermediate routing or hopping.

(2) External Discovery

If nodes An and B are manually added to the Dynamo ring, the Ring Membership will not immediately detect this change, but there will be a temporary logical split of the Dynamo ring (An and B think they are in the ring, but do not know each other exists). Dynamo uses External Discovery to solve this problem, that is, some Dynamo nodes act as seed nodes, configure the IP of seed nodes in non-seed nodes, and all non-seed nodes coordinate membership with seed nodes.

(3) Failure Detection

Dynamo uses Gossip-like protocol to realize decentralized fault detection, so that each node in the system can understand the addition of other nodes and likai.

5. Fault handling

In traditional Quorum, the system is not available in the case of node failure or network failure. In order to improve availability, Dynamo uses Sloppy Quorum and Hinted Handoff, that is, all read and write operations are performed by the first N healthy nodes in the preferred list, while the data sent to the failed node is marked and sent to the healthy node, and the copy is restored when the failed node becomes available again.

As shown above, the Dynamo configuration N is 3. If node An is temporarily unavailable during writing (Down or unable to connect), the copy sent to A will be sent to node D, and the copy sent to D will have a hint in its original data to indicate that node An is the intended recipient of the copy. D saves the copy data in a separate local storage, and when it detects that An is available, D attempts to send the copy to An if it is sent successfully. D removes the data from the local storage without reducing the total number of replicas in the system.

It is very important that a highly available storage system has the ability to handle entire IDC failures (power outages, natural disasters, network fault lights), and Dynamo has this ability. Dynamo can be configured to replicate objects across multiple IDC, that is, the preferred list of key consists of nodes across multiple IDC, these IDC are connected by high-speed dedicated lines, and the replication scheme across multiple IDC enables Dynamo to handle the entire IDC failure.

In addition, in order to deal with the situation that the replica is not available before the expected node of the hinted replica transfer, Dynamo implements the anti-entropy protocol to keep the replica synchronized. In order to detect the inconsistency between replicas more quickly and reduce the transmission volume, Dynamo uses MerkleTree.

6. Capacity expansion / reduction

(1) capacity expansion

When a new node X joins the system, it gets some token randomly assigned to the ring, node X is responsible for processing a key range, and these key are taken care of by some existing nodes before node X joins, and when node X joins, these nodes pass these key to node X. Taking figure 2 as an example, suppose node X is added to the position between An and B in the ring. When X is added to the system, it is responsible for the key range of (F, G], (G, A], (A, X]. Nodes B, C and D each have part of the key range that no longer needs to be stored, that is, B is responsible for (F, G], (G, A], (A, B]) before X joins. C is responsible for (G, A), (A, B], (B, C); D is responsible for (A, B], (B, C), (C, D), and after X is joined, B is responsible for (G, A], (A, X], (X, B); C is responsible for (A, X], (X, B], (B, C); D is responsible for (X, B], (B, C), (C, D). Nodes B, C and D begin this process after receiving the confirmation signal joined by node X.

(2) scale down

When a node is removed from the system, the reallocation of the key is the opposite of step (1).

7. Read / write operation

Reads and writes are performed by the request orchestration component, and each client request results in the creation of a state machine on the node that processes the request, each containing the following logic:

Identify the node responsible for a key

Send a request

Waiting for a response.

Possible retry processing

Processing and packaging return the client response.

Each state machine instance processes only one client request, and if it is a read request, the state machine is as follows:

Send a read request to the appropriate node

Wait for the minimum number of responses required

If too few responses are received within a given time, the request fails

Otherwise, collect the version of all the data and determine the version to be returned

If version merging is enabled, syntax coordination is performed and a write context that is opaque to the client is generated, which contains a vector clock that includes all versions.

After returning the read response to the client, the state machine waits for a period of time to accept any pending response, and if any response returns an outdated version, the coordinator updates the nodes with the latest version to complete the read fix.

The write request is usually followed by the read request, and the coordinator is acted as the node with the fastest reply to the read operation, which optimizes to improve read-write consistency.

V. solve the problem

1. Availability

Completely decentralized, no single point, can always be written.

2. Scalability

Consistent hash with virtual machine nodes: consistent hash solves the problem of capacity expansion / reduction, and virtual nodes solve the problem of machine heterogeneity.

3. Reliability

Make multiple copies of the data and use the vector clock to solve the problem of version merging.

4. Configurable

The balance is adjustable, that is, the availability and consistency are balanced according to the (NMagneWMague R) model, and it is suggested that the model parameters are (3meme2mem2).

Thank you for your reading, the above is the content of "how to understand Amazon's website data storage architecture". After the study of this article, I believe you have a deeper understanding of how to understand Amazon's website data storage architecture, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.