Case Analysis of data Model in Database 07/03 Update SLTechnology News&Howtos

Case Analysis of data Model in Database

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

In the case analysis of the data model in the database, I believe that many inexperienced people can do nothing about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Data model can be said to be the most important part of software development, because it affects the way we think, the way we solve problems, and the way we write code. Most applications are built with layers of superimposed data models, and the key question for each layer of data model is how to represent it with a lower-level data model.

Most application development uses the programming language of object-oriented programming, so whether a data model can well represent objects and the relationship between objects has become the standard of our choice.

An object consists of various attributes, and the relationship between an object is usually one-to-many / many-to-one and many-to-many.

Relation model

The relational model uses tables, rows and fields to represent a collection of entities, an entity and an attribute of an entity, respectively; the Id identity of another entity is stored in the field of one entity to represent the many-to-one relationship between entities, and a separate association table is used to store the Id identity of two entities to represent the many-to-many relationship between entities.

Relational models have strong schemas that must be defined before data is written, that is, writing schemas, similar to static (compile-time) type checking in programming languages.

The following example is a relational representation of a resume from Linked:

Document model

It is stored in a format similar to JSON, and the stored content is document. One-to-many entity relationships can be expressed flexibly by using the natural nested relationships of JSON, and of course many-to-one and many-to-many relationships can also be expressed by storing the Id of the document.

Compared with the relational model, the document model reduces the impedance mismatch between the application code and the storage layer, and has better locality in the one-to-many relationship.

The document model has a read-time mode, and there is no schema requirement for writing. Dynamic (runtime) type checking similar to that of programming languages.

For the example resume above, the representation using the document model is as follows:

Graph model

The graph model emphasizes the connection between objects. when the application is around many object connections and the query and calculation of these connections, we need to consider using the database of the graph model.

A graph consists of vertices (representing entities) and edges (relationships between entities). A complex graph model is usually composed of billions of vertices and hundreds of billions of edges.

The following is an example of a social network: it represents the relationship between two people and where they live.

Each data model has its own query language, the relational model uses SQL, and the graph model also has a corresponding query language to describe the characteristics of the graph model, but it has not yet formed an industry standard.

Storage engine

We are familiar with the data model above, but understanding the internal storage and retrieval principles of data is also very helpful for us to design and develop applications and database selection.

The main function of the database is to store data and then query and update. At present, there are mainly two types of databases: traditional relational database (page-oriented page-oriented) and NoSQL database (based on log structure log-structured).

Page oriented

The B-tree is almost the standard index implementation of the database. The B number breaks down the database into fixed-size blocks or pages, usually in the 4k-32k range, and can only read or write one page at a time. This design is closer to the underlying hardware because disks are also made up of fixed-size blocks.

Each page can be identified by an address, and one page references another, similar to a pointer, but on disk rather than in memory, as shown in the figure:

The number of references to child pages in the B-tree page is called the branching factor, which depends on the page size and the size of the index key. The larger the branching factor, the better. (how much 256TB can be stored in the fourth-level tree of a 4KB page with a branching factor of 500)

When querying data, starting from the root page (usually cached in memory), look for pages that meet the range of conditions according to the page reference, all the way to the leaf node.

When the data is updated, navigate to the leaf node and overwrite the disk page with the new data.

When inserting and deleting data, it will involve splitting and merging of pages to maintain the balance of the B-tree.

In order to ensure the high performance of data query and writing, the database usually caches the page data in memory. When the data is updated, the disk data will not be updated immediately, but the page data cached in memory will be updated first, write to the WAL log (write-ahead-log) synchronously, and the dirty pages in memory will be brushed to disk asynchronously (the disk will be written randomly into sequential writes). When the database is recovered after a crash, this log is used to restore the B-tree to a consistent state.

Log structure

In the storage mode based on log structure, each time the data is added or updated, the data is only appended to a specific log file, and when the file exceeds a certain size, a new file is opened for writing.

Each log structure storage segment is a series of key-value pairs, but in order to query the data later, the key-value pairs are required to be sorted by key in the file. The string table (Sorted String Table) of this sort is called SSTable.

In order to ensure that the number of log files is kept at a certain number, multiple file segments are merged (merging algorithm). When multiple identical key values appear, the old one is overwritten with the new value to ensure that the same key of a merged segment appears once.

The maintainer keys in memory to the index of the log file, which is sparse, and one key for every few thousand bytes of segment files is sufficient, because thousands of bytes can be scanned quickly. (you can group some records into blocks and compress them to disk.)

How to build and maintain SSTable (ensure that storage is sorted by key)

When writing data (add, delete, change), add it to a balanced tree structure in memory (such as a red-black tree), which is called a memory table (memtable)

To avoid data loss, the WAL log is written to the memory table as well as appended (used for database crash recovery)

When the memory table is greater than a certain threshold (usually a few megabytes), it is written to disk as an SSTable file. The new SSTable file becomes the latest part of the database.

When querying data, first try to find it in the memory table, and then find it in multiple file segments. (keep the number of file segments at a certain number by merging them to ensure search efficiency)

This kind of storage engine based on the principle of merging and compressing sorted files is often called LSM storage engine.

The LSM tree algorithm may be slow when looking for keys that do not exist. To optimize this access, additional Bloom filters are usually used.

The basic idea of LSM Tree

Save a series of SSTables merged in the background and continue to work even if the dataset is much larger than the available memory. Because the data is stored sequentially, range queries can be performed efficiently (scanning all keys higher than certain minimum and maximum values), and disk writes are continuous, so very high write throughput can be supported.

Business

In a database system, you encounter a variety of problems:

Database software and hardware may fail at any time (including half of the write operation)

The application can crash at any time (including the middle of a series of operations)

Network interruption will cut off the connection between the application and the database, or the connection between the database

Multiple applications may write to the database at the same time, overwriting each other's modifications

The application may read meaningless data because only part of the data has been updated

The application of competitive conditions of quality inspection may lead to a variety of unexpected results

Transactions have always been the preferred mechanism for simplifying these issues. A transaction is a way in which an application combines multiple read and write operations into a single logical unit. Conceptually, all read and write operations in a transaction are performed as a single operation: the entire transaction either succeeds or fails and rolls back. If it fails, the application can safely retry. For transactions, the error handling of the application is much easier, so you don't have to worry about partial failures.

The security provided by transactions is described by ACID. Atomic Atomicity, consistent Consistency, isolated Isolation, and persistent Durability are designed to establish precise terms for fault tolerance in the database.

Single object vs multiple object

A transaction is often understood as a mechanism for merging multiple operations on multiple objects into a single execution unit. But many distributed databases provide only the atomicity and isolation of a single object (atomicity is recovered by writing logs synchronously; isolation is accessed by a single thread by locking each object), as well as more complex atomic operations such as self-increment and CAS. So pay attention to this and see if it meets your application scenario.

Multi-object transactions, in addition to dealing with complex atomicity and isolation, distributed scenarios will also involve cross-partition (not partitioning may be on different machines), that is, distributed transactions.

Isolation level

If two transactions do not touch the same data, they can safely execute in parallel because neither of them depends on the other. Concurrency problems occur when one transaction reads data that another transaction modifies at the same time, or when two transactions try to modify the same data at the same time.

Concurrent bug is difficult to find through testing because such errors are triggered only at special times and are difficult to reproduce. For this reason, databases have been trying to hide the concurrency problems of application developers by providing transaction isolation. The stronger the transaction isolation level, the more concurrency problems can be avoided, such as serializability ensures that the effect of the transaction is the same as serial execution, but this means the sacrifice of concurrency performance. So database systems usually use weak isolation levels to prevent some but not all of the concurrency problems, so understanding these is very important for developing the right application.

Dirty writing

Dirty write means that one transaction overwrites the uncommitted data of another transaction, and the existing isolation level ensures that there is no dirty write. Databases usually use row locks to prevent dirty writes.

Dirty reading

Dirty reading means that one transaction writes part of the data and is not committed, and this is when another transaction reads this part of the uncommitted data.

Non-repeatable

The data read twice by the same transaction (read deviation) or the number of records read (magic reading) are inconsistent.

Missing updates

Two transactions read the data and update at the same time, both transactions are updated successfully, and the update logic is based on the previously read value, but the transaction commit will change the previously read value, resulting in the loss of updates. A typical scenario is read-> change-> write.

Writing deviation

Write bias can be seen as a generalization of the problem of missing updates. If two transactions read the same objects and then update some of them (different transactions may update different objects), write bias may occur.

Read submitted

Read submitted to provide two guarantees

When reading from the database, you can only see the data that has been submitted (no dirty reading)

When writing to the database, you can only overwrite the data that has been written (no dirty writes)

Repeatable read / snapshot isolation

Databases that support snapshot isolation retain different committed versions of an object because various ongoing transactions may need to see the state of the database at different points in time. This technique is called multi-version concurrency control (MVCC,multi-version concurrency control).

When a transaction starts, it is assigned a unique, ever-growing transaction ID (txid). Whenever a transaction writes anything to the database, the data it writes is marked with the writer's transaction ID.

A transaction can find an object that meets the following two conditions:

At the beginning of the read transaction, the transaction that created the object has been committed

Object is not marked for deletion, or if marked for deletion, the transaction requesting deletion has not been committed at the beginning of the read transaction

For missing updates and write deviations with data crossover, the database can be combined with snapshot spacing, which can automatically detect missing updates and abort the corresponding transaction. But MySQL/InnoDB 's repeatable readability does not detect missing updates. Some authors believe that MySQL does not provide snapshot isolation because data can prevent the loss of updates before it can be called snapshot isolation.

Under the MySQL/InnoDB repeatable read isolation level, you can use locked read (select for update) or compare and set CAS to avoid losing updates.

It is important to note that if the database allows where words to be read from the old snapshot, this statement may not prevent the loss of updates, because the where condition may be true even if another concurrent write occurs.

Serialization

However, there is no cross write deviation in writing data, which can only be avoided by serialized isolation level, but a lock object can be artificially introduced into the database by means of materialization conflict at the application level.

There are three ways to serialize the isolation level:

Literal serial sequential execution of transactions

Two-phase locking, which can be upgraded to exclusive lock by adding shared lock to read operation object and exclusive lock to write operation object. Shared locks are not mutually exclusive, shared locks are mutually exclusive and exclusive locks are mutually exclusive. At the same time, the database automatically detects thoughts between transactions and aborts one. Two-phase is a so-called pessimistic concurrency control mechanism.

Optimistic concurrency control technology, serializable snapshot isolation SSI (serializable snapshot isolation), is an optimistic concurrency control mechanism, which reads and writes data without locking, but detects serialization conflicts between writes through a specific algorithm when a transaction commits, and determines which transactions to abort. The advantage is that reads and writes do not block each other, and read-only queries run on consistent snapshots, which is very attractive for heavy read scenarios. However, the termination rate significantly affects the overall performance of SSI. Transactions that read and write data for a long time are likely to collide and be Chinese, because SSI requires transactions that read and write at the same time to be as short as possible.

Distributed transaction

In a multi-object transaction, if different objects exist in different partitions, you need to handle distributed transactions. When it comes to distributed transactions, we have to introduce two-phase commit, which is the basic idea of distributed transactions.

Two-phase submission

Two-phase commit 2PC (two-phase commit) is an algorithm for implementing atomic transaction commits across multiple nodes. It can be used internally in the database, or it can be available to applications in the form of XA transactions.

The two-phase submission introduces the role of coordinator, which is divided into two phases as a whole, and the specific process is as follows:

When the application wants to start a distributed transaction, it requests a globally unique transaction ID from the coordinator.

The application initiates a single-node transaction in each participant, and each single-node transaction carries the global transaction ID. All reads and writes are done separately in a single node transaction. If there is any problem at this stage, the coordinator or any participant can abort.

When the application is ready to commit, the coordinator sends a prepare request to all participants with a global transaction ID. If any of the requests fail or time out, the coordinator sends an abort request for the transaction ID to all participants.

When a participant receives a prepare request, they need to ensure that the transaction can indeed be committed under any circumstances. This includes writing all transaction data to disk (failure, power failure, or insufficient disk space cannot be a reason to refuse to commit later) and checking for any amount conflicts or constraints violations. Once a promise is made, it is not allowed to go back on it.

When the coordinator receives responses to all prepared requests, a clear decision is made on the submission or suspension of the transaction (only if all participants agree). The coordinator must write this decision to the transaction log on disk.

Once the coordinator's decision is closed, the request for submission or abandonment will be sent to all participants. If the request times out or fails, the coordinator must keep retrying forever.

The inherent cost of two-phase commit: due to the forced flushing required for crash recovery and additional network round trips, the whole process will lock up resources.

Percolator

Percolator is a system developed by Google for incremental processing and updating for big data cluster, which is mainly used for google web page search and indexing service. After using the Percolator-based incremental processing system instead of the original batch indexing system, Google reduces the average search latency of documents by 50% when processing documents with the same amount of data.

Percolator is a decentralized (no coordinator) two-phase commit, a single-line transaction based on BigTable, which implements a cross-line transaction engine. In addition, the snapshot isolation level can be achieved with the multi-timestamp version of BigTable.

Percolator relies on the central timer, does not have the role of a single point of Coordinator, and leaves it to all clients to coordinate the locking protocol, but the lock will be leaked when it crashes. Percolator chooses to lazily recycle leaked locks: when other clients Get () to this row of data, if they encounter a lock, they choose to wait for a Backoff retry, or clean up the lock.

However, because Percolator uses optimistic lock detection mechanism, it is not friendly to the concurrent update of hot spot data. I think this can be solved by implementing a pessimistic locking mechanism on top of Percolator.

Zoning

Partitions, also known as sharding, splits the dataset into multiple partitions, each of which is stored on a different machine, expanding the overall storage capacity and improving the performance of writes and reads. However, it also brings new difficulties for the database to support write and read across partitions.

Zoning mode

The goal of partitioning is to distribute the data and query load evenly among the nodes. If partitions are unfair or do not take into account hot spot data, then some partitions have more data or queries than others, which we call skew. Data partition is usually based on Key. In the case of data skew, special attention should be paid to the design of Key according to the specific partition algorithm of the database.

Specify a contiguous Key range for each partition according to the range partition of the Key, and the boundary of the partition Key is generally automatically selected by the database. The advantage is that range scanning is very simple. However, if the design of Key is unreasonable, it will reach the hot data and affect the query efficiency.

According to the hash partition of Key, the Key is calculated by a hash function, and then partitioned. This eliminates the risk of skew and hotspots, but loses the properties of the original Key range query.

Some databases, such as Cassandra, adopt a compromise strategy, declaring them with a compound primary key consisting of multiple columns. Only the first column in the key is used as the basis for the hash, while the other columns are used as the join index for the sorted data in Cassandra's SSTables. Although the query cannot scan the table in the first column of the compound primary key, if the first column has specified a fixed value, you can perform a valid range scan on the other columns of the key. The method of combining indexes provides an elegant data model for one-to-many relationships.

Index construction

We discussed the partitioning strategy for primary keys above, and secondary / secondary indexes are also necessary in practice, especially in relational models.

There are two ways to build a secondary index: local index and global index

Local index document partitioning so, in this indexing method, each partition is completely independent, and each partition maintains its own secondary index, overwriting only the documents in that partition. When data is written (add, delete, update), only the index update of the data in the partition needs to be processed. When querying data, you need to send the query to all partitions and merge all the returned results.

This method of querying partitioned databases is sometimes referred to as scatter/gather, and it can be quite expensive to read queries on secondary indexes. Even if the partition is queried in parallel, it is easy to cause tail delay magnification. MongoDB, Cassandra, ElasticSearch, and SolrCloud all use this document partitioning secondary index.

Global index keyword partitioning, which is indexed in the same way as primary key partitioning. Compared with the document partition index, reading is more efficient, there is no need to disperse / aggregate all partitions, the client only needs to make a request to the partition that contains keywords. The disadvantage is that writing is slow and complex, because writing to a single document may affect multiple partitions of the index.

Ideally, the index is always up to date. Every document written to the database is immediately reflected in the index. In keyword-based global indexes, this requires distributed transactions across partitions, which are not supported by all databases. In practice, updates to global secondary indexes are usually asynchronous.

Zonal rebalancing

With the increase of dataset size and query throughput, more machines are needed to process it. All of this requires data and requests to be moved from one node to another, a process called reblancing.

Rebalancing usually meets the following requirements:

After rebalancing, the load (data storage, read, and write requests) should be fairly shared among the nodes in the cluster

When rebalancing occurs, the database should continue to accept reads and writes

Only the necessary data is moved between nodes to facilitate rapid rebalancing and reduce the load on the network and disk

Balancing strategies can be divided into several types: fixed number of partitions, dynamic number of partitions and proportional partitions by nodes.

A fixed number of partitions create more partitions than nodes and assign multiple partitions to each node. If a node is added to the cluster, the new node can steal some partitions from each current node until the partitions are fairly distributed again. ElasticSearch uses a partitioning strategy in this way.

Only the partition moves between nodes, the number of partitions will not change, the partition corresponding to the key will not change, the only change is the node where the partition is located. This change is not real-time (it takes time to transfer data on the network), and the original partition still takes over read and write requests during the transfer.

The number of partitions is usually determined when the database is first established and does not change afterwards. Each partition contains a fixed percentage of the total amount of data, so the size of each partition is proportional to the total amount of data in the cluster. If the total size of the dataset is difficult to predict, it is difficult to choose the correct number of partitions. If the partition is too large, rebalancing and node failure recovery become expensive; if the partition is too small, it will generate too much overhead.

Dynamic number of partitions for databases partitioned with key ranges, a fixed number of partitions with fixed boundaries will be very inconvenient: boundary errors may result in no data for some partitions. A database that is partitioned with a keystroke range usually creates a partition dynamically.

When the partition grows to more than the configured size, it is split into two partitions, each accounting for about half of the data. The advantage of dynamic partitioning is that the number of partitions adapts to the total data and can balance the overhead of all aspects. This is the strategy adopted by HBase and MongoDB.

The dataset starts small until the separation point of the first partition is reached, all writes must be handled by a single node, while the other nodes are idle. To solve this problem, HBase and MongoDB allow a set of initial partitions (pre-delimited, pre-splitting) to be configured on an empty database. In the case of key range partitioning, pre-separation requires knowing in advance how the key is allocated.

According to the proportion of nodes, the number of partitions is proportional to the number of nodes, that is, each node has a fixed number of partitions. The size of each partition increases in proportion to the size of the dataset. When adding nodes, randomly select a fixed number of existing partitions to split, and then occupy half of each of these split partitions.

Request routing

Now that we have split the dataset into multiple shards running on multiple nodes, how to know which node to connect to when the client initiates a request. With the rebalancing of the partition, the allocation of the partition to the node also changes.

Not limited to databases, this problem can be summarized as Service Discovery (service discovery), and there are usually three scenarios:

Allows the client to connect to any node and process the request directly if the node happens to have the requested partition; otherwise, it forwards the request to the appropriate node, receives the result and returns it to the client.

The client sends the request to the routing layer, which determines which node should process the request and forwards it accordingly.

The client knows the partition and node allocation and connects directly to the appropriate node.

The crux of the above question is: how does the component that makes the routing decision understand the change in the partition-node allocation relationship? This is a challenging issue because it requires agreement among all participants.

Many distributed systems rely on a separate coordination service, such as ZooKeeper, to track cluster metadata.

Each node registers itself on ZooKeeper, and ZooKeeper maintains a reliable partition-to-node mapping

The routing layer can subscribe to this information on ZooKeeper, which can be sensed in real time when the partition allocation changes.

Copy

Replication means keeping copies of the same data on multiple machines connected through the network, and replicating data provides the following benefits:

Improve availability so that the system can continue to work even if part of the system fails

Increase read throughput by expanding the number of machines that can accept read requests

Make the data geographically close to the user, thereby reducing latency

The difficulty of replication is in dealing with changes to the replicated data. At present, there are three popular change replication algorithms: single leader (single leader), multiple leaders (multi leader) and leaderless (leaderless). Almost all distributed databases use one of these three methods.

Single leader replication process:

Each write to the database needs to be propagated to all replicas, otherwise the replicas will contain different data. The most common solution is called leader-based replication or master-slave replication.

One of the copies is designated as the leader (or master library). When the client wants to write to the database, it must be sent to the leader, and the leader will write the new data to his local storage.

Other copies, called followers (read-only copies, slave libraries), synchronize the change logs of the master library and write them in the same order as the master library.

When the client reads data from the database, it can query the leader or follower

Synchronous or async

An important detail of a replication system is whether replication occurs synchronously or asynchronously. Synchronous replication makes it take longer to write data, while asynchronous replication makes the data inconsistent between replicas, and the client may read historical data and may lose data in the event of a primary database failure. So the core of the replication system is how to keep the replica consistent and switch automatically in the event of a failure of the main library.

Consistency model

The consistency model (consistency model) is essentially a convention between the process and the data store. That is, if the process agrees to follow certain rules, the data store will run normally. Normally, when a process performs a read operation on a data item, it expects the operation to return the result of the data after its last write operation.

In the absence of a global clock, it is difficult to define exactly which write operation is the last. As an alternative, we need to provide other definitions, resulting in a series of consistency models. Each model effectively limits the value that should be returned by performing a read operation on a data item.

Note: do not confuse the consistency of database transactions with it; consistency of distributed replicas refers to the writing and reading of a single object.

Data centric

Linear consistency

Linear consistency, also known as strict consistency (Strict Consistency) or atomic consistency (Atomic Consistency), requires the following two conditions:

Any read can read the data most recently written to some data.

The order of operations seen by all processes is consistent with that under the global clock.

The idea of linear consistency is to make a system appear to have only one copy of data, and all operations are atomic. The application does not have to worry about many problems caused by multiple replicas, and it is a perfect ideal model as a reference for other models (the strongest consistency model).

There are no concurrent operations in linearly consistent data storage: there must be one and only one timeline, and all operations are on this timeline, forming a full order relationship.

Sequential consistency

Sequential consistency first appeared in the Shared-Memory Multi-Processor System stand-alone model, providing programmers with a strong memory visibility guarantee. The sequentially consistent memory model has two main characteristics:

The result of any execution is the same as that of all processors in some order.

Each individual processor operates in the order specified by its program.

All operations must be performed immediately or atomically relative to each other processor (visible immediately).

In chronological order, C1 occurs after B2. For linear consistency, C1 must follow B2, but for sequential consistency B2 can occur after C1.

Sequential consistency can produce uncertain results. This is because the sequence of operations between processors may be different during different runs of the program.

For sequential consistency, it needs to find a legitimate sequential execution process that retains the original order within the thread / process

For linear consistency, it is also about finding a legal sequential execution process. However, this sequential execution process should retain not only the order within the thread / process, but also the sequence of operations between the threads / processes.

Linear consistency can be defined as sequential consistency with real-time constraints (real-time constraint).

It is personally understood that in the field of distributed replicas, it is unlikely to find an order that can be agreed upon by processes other than timing. Therefore, the reference in the field of distributed replica is of little significance, and it is more likely to cause confusion.

Causal consistency

Relative to linear consistency, read and write are guaranteed to have a global order, while causal consistency only needs to ensure that read and write operations with interdependence remain in the same order. In fact, causal consistency is the strongest consistency model with the highest performance and availability.

The difficulty of achieving causal consistency is how to define and capture causality. You need to know which operation occurs before which (happen before). But this causal relationship is more from the upper application, the underlying storage is imperceptible, so tracking all causality is not practical.

Causality operations must be sequenced in time sequence, so they are sorted by full-order serialization or timestamp (logical clock), so that all operations have temporal causality. So linear consistency is that all operations satisfy causal consistency (even if most operations have no dependencies).

Final consistency

In the end, consistency is not a consistency model, and there is no guarantee of consistency, except that in the absence of updates, replicas will remain consistent for a certain period of time. Because of the existence of network delay, the application may read inconsistent data at any time. It can be said to be the weakest acceptable consistency model.

Client-centric

The consistency discussed above from the perspective of data storage, in causal consistency and stronger consistency models, there are no unexpected read and write problems on the client side. But in the weaker consistency model, there are a variety of reading and writing problems.

Client-centric consistency provides a consistency guarantee for a single client and ensures the consistency of the client's access to the data storage, but it does not provide any consistency guarantee for the concurrent access of different clients.

Client-centric consistency consists of four models:

Monotone read consistency (Monotonic-read Consistency): if a process reads the value of the data item x, the process reads either the first read value or the updated value for all subsequent reads of x. That is to ensure that the client will not read the old value.

Monotone write consistency (Monotonic-write Consistency): a process must write to data item x before the process performs any subsequent writes to x. That is, to ensure that the write operation of the client is serial.

Read-write consistency (Read-your-writes Consistency): the result of a single write to data item x by a process is always seen by subsequent reads performed by the process to x. That is, to ensure that the client can read its latest written value.

Write-read consistency (Writes-follow-reads Consistency): a write operation performed by the same process after a read operation on the data item x, ensuring that it occurs on the same value as or newer than the x read value. That is, to ensure that the client's write operation to a data item is based on the latest value read by the client.

But the real situation is that the client session will be transferred due to the existence of server load balancing and server failure, so the consistency model based on client access is unreliable.

Consensus agreement

Lamport timestamp

We know that in a distributed system, it is unlikely that each machine has the same clock (global clock). In a paper in 1978, Lamport proposed a logical timestamp to solve the timing problem of distinguishing the occurrence of events in distributed systems. This paper is one of the most frequently cited papers in the field of distributed systems.

Lamport timestamp is a simple combination of the two: timestamp / counter + node ID. The rules are as follows:

Each event corresponds to a Lamport timestamp with an initial value of 0

If the event occurs within the node, the timestamp in the local process is added by 1

If the event belongs to a sent event, the timestamp in the local process is added by 1 and the timestamp is added to the message

If the event belongs to a receive event, the timestamp in the local process = Max (local timestamp, timestamp in the message) + 1

The order of events is sorted by timestamp. If the timestamp is the same, it is sorted by node ID size.

In the figure above, the full order relationship of all events of the ABC node is as follows:

The idea behind the Lamport timestamp is that the premise that two events can establish a temporal (causal) relationship is whether information transmission has occurred between the two events. Therefore, the Lamport timestamp only ensures the correctness of causality (partial order), not the correctness of absolute timing.

Full-order broadcast

Lamport timestamps determine the timing of events through message delivery, leading to full-order broadcasts (protocols for exchanging messages between nodes). Full-order broadcasting needs to meet two security attributes:

Reliable delivery (reliable delivery), no message lost: if the message is delivered to one node, it will be delivered to all nodes

Full order delivery (total ordered delivery), where messages are delivered to each node in the same order

Full-order broadcasts are just what database replication requires: if each message represents a database write and each replica processes the same writes in the same order, the replicas remain consistent with each other (except for temporary replication delays. Read operations can also be used as messages to achieve consistent reads). This principle is called state machine replication (state machine replication).

Because the write and read operations of the database are agreed through message interaction, according to the Lamport timestamp, all operations are in full order, so linear consistent storage can be achieved.

Raft protocol

Raft is a consensus algorithm designed to make it easy to understand. It is equivalent to Paxos in fault tolerance and performance. The difference is that it is decomposed into relatively independent sub-problems and cleanly solves all the major parts required by the actual system, actually engineering the full-order broadcast / state machine copy above.

Raft protocol animation demonstration: thesecretlivesofdata.com/raft/

In a Raft cluster, the server may be one of these three identities: leader (leader), follower (follower), or candidate (candidate). Under normal circumstances, there will be only one leader, and the others are followers. The leader will be responsible for all external requests, and if not received by the leader's machine, the request will be directed to the leader.

Raft splits the problem into several sub-problems to solve separately:

Leadership election (Leader Election)

Log replication (Log Replication)

Cluster membership change (Cluster membership changes)

Security (Safety)

After reading the above, have you mastered the method of case analysis of the data model in the database? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.