How to realize master backup in NoSQL database 07/03 Update SLTechnology News&Howtos

How to realize master backup in NoSQL database

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to achieve master backup in the NoSQL database. The content of the article is of high quality, so the editor will share it with you for reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Many people have heard of the high performance of Tarantool DBMS, including its rich suite of tools and some specific features. For example, it has a very powerful on-disk storage engine, Vinyl, and knows how to handle JSON documents. However, most articles tend to overlook a key point: usually, Tarantool is only regarded as memory, but in fact, its characteristic is the ability to write code inside memory, thus processing data efficiently. If you want to know how igorcoding and I built a system within Tarantool, please read on.

If you have ever used the Mail.Ru email service, you should know that it can collect emails from other accounts. If the OAuth protocol is supported, then when collecting mail from other accounts, we do not need to ask the user to provide third-party service credentials, but use the OAuth token instead. In addition, Mail.Ru Group has many projects that require authorization through third-party services and require the user's OAuth token to handle certain applications. Therefore, we decided to set up a service to store and update tokens.

I guess you all know what OAuth tokens look like. Close your eyes and recall that the OAuth structure consists of the following 3-4 fields:

{"token_type": "bearer", "access_token": "XXXXXX", "refresh_token": "YYYYYY", "expires_in": 3600}

Access token (access_token)-allows you to perform actions, get user data, download a user's friend list, and so on

Update token (refresh_token)-allows you to regain a new access_token, unlimited number of times

Expiration time (expires_in)-token expiration timestamp or any other predefined time, if your access_token expires, you can no longer access the resources you need.

Now let's take a look at the simple framework of the service. Imagine that there are some front ends that can write and read tokens on our service, and a separate updater that, once the token expires, can obtain a new access token from the OAuth service provider.

As shown in the figure above, the structure of the database is also very simple, consisting of two database nodes (master and slave). In order to show that the two database nodes are located in two data centers separated by a vertical dashed line, one data center contains the master database node and its front end and updater, and the other data center contains the slave database node and its front end, as well as an updater to access the master database node.

The difficulties faced

The main problem we face is the useful life of the token (one hour). After learning more about the project, someone may ask, "is it really a high-load service to update 10 million records in an hour? if we divide it by a number, the result is about 3000rps." However, if some records are not updated because of database maintenance or failure, or even a server failure (anything is possible), things will become troublesome. For example, if our service (master database) is interrupted for 15 minutes for some reason, 25% of the service will be interrupted (1/4 of the tokens become invalid and can no longer be used); if the service is interrupted for 30 minutes, half of the data will not be updated; if interrupted for 1 hour, all tokens will expire. Suppose the database crashes for an hour, we restart the system, and then the entire 10 million tokens need to be updated quickly. Is this a high-load service?

At first everything went well, but two years later, we expanded the logic, added a few indicators, and began to implement some auxiliary logic. . In short, Tarantool runs out of CPU resources. Although all resources are exhausting resources, the result does surprise us.

Fortunately, the system administrator helped us install the CPU in the memory in stock at that time, which solved our CPU requirements for the next 6 months. But this is only a stopgap measure, and we must come up with a solution. At that time, we learned a new version of Tarantool (our system is written in Tarantool 1.5, which is almost unused except in Mail.Ru Group). Tarantool 1.6 strongly advocates primary primary backup, so we thought: why not set up a database backup in each of the three data centers connected to the primary primary backup? That sounds like a good plan.

Three hosts, three data centers, and three updateers, each connected to their own master database. Even if one or two hosts are down, the system is still running as usual, right? So what are the disadvantages of this plan? The disadvantage is that we have effectively tripled the number of requests from an OAuth service provider, that is, we have to update almost the same number of tokens as many copies. The most direct solution is to let each node decide who the leader is, so that only the nodes stored on the leader need to be updated.

Select the leader node

There are many algorithms for selecting leader nodes, one of which is called Paxos, which is quite complex and doesn't know how to simplify it, so we decided to use Raft instead. Raft is a very easy-to-understand algorithm. Whoever can communicate can be chosen as leader, and once the communication connection fails or other factors, re-select leader. The specific measures for implementation are as follows:

There is neither Raft nor Paxos outside the Tarantool, but we can use the net.box built-in mode to connect all nodes into a mesh (that is, each node connects all the remaining nodes), and then directly use the Raft algorithm to select leader nodes on these connections. * all nodes become either leader nodes, follower nodes, or neither.

If you find the Raft algorithm difficult to implement, the following Lua code can help you:

Local r = self.pool.call (self.FUNC.request_vote, self.term, self.uuid) self._vote_count = self:count_votes (r) if self._vote_count > self._nodes_count / 2 then log.info ("[raft-srv] node% d won elections", self.id) self:_set_state (self.S.LEADER) self:_set_leader ({id=self.id Uuid=self.uuid}) self._vote_count = 0 self:stop_election_timer () self:start_heartbeater () else log.info ("[raft-srv] node% d lost elections", self.id) self:_set_state (self.S.IDLE) self:_set_leader (msgpack.NULL) self._vote_count = 0 self:start_election_timer () end

Now we send requests (other Tarantool copies) to the remote server and count the votes from each node. If we have a quorum, we select a leader and send a heartbeats to tell the other nodes that we are still alive. If we lose in the election, we can launch another election, and after a period of time, we can vote or be elected as leader.

As long as we have a quorum and select a leader, we can assign the updater to all nodes, but only allow them to serve leader.

In this way, we regulate the traffic, and because the tasks are sent by a single node, each updater gets about 1/3 of the tasks. With this setting, we can lose any host, because if a host fails, we can initiate another election, and the updater can switch to another node. However, like other distributed systems, there are several issues related to quorum.

"abandoned" node

If the contact between the data centers is lost, then we need to have some appropriate mechanisms to maintain the normal operation of the whole system, and a set of mechanisms to restore the integrity of the system. Raft succeeds in doing two things:

Suppose the Dataline data center goes offline, then the node in this location becomes an "abandoned" node, that is, the node cannot see the other nodes, and other nodes in the cluster can see that this node is lost, so another election is triggered, and then the new cluster node (that is, the parent node) is selected as leader, and the whole system remains running. Because the nodes are still consistent (most of the nodes are still visible to each other).

So the question is, what about the updater related to the lost data center? the Raft specification does not give such nodes a separate name. Usually, nodes without quorum and nodes that cannot contact leader will be idle. However, it can establish a network connection and update tokens on its own. In general, tokens are updated in connection mode, but perhaps tokens can also be updated with an updater that connects to "deprecated" nodes. At first we weren't sure this made sense, wouldn't it lead to redundant updates?

We need to figure out this problem in the process of implementing the system. Our idea is not to update: we have consistency, we have quorum, and if we lose any member, we should not update it. But then we had another idea. Let's take a look at the master backup in Tarantool. Suppose there are two master nodes and a variable (key) X. at the same time, we assign a new value to this variable on each node, one with a value of 2 and the other with a value of 3. Then the two nodes exchange backup logs with each other (that is, the value of the X variable). In terms of consistency, this is a bad way to implement a primary primary backup (no offense to Tarantool developers).

If we need strict consistency, this will not work. However, recall that our OAuth token is made up of two important factors:

Update token, essentially * valid

Access token, valid for one hour

Our updater has a refresh function that can get any number of access tokens from an update token, and once released, they will remain valid for an hour.

Let's consider the following scenario: two follower nodes are interacting with a leader node, and they update their tokens and receive * access tokens, which are replicated, so now each node has this access token, and then the connection is broken, so one of the follower nodes becomes an "obsolete" node, it has no quorum, it can see neither leader nor other follower, however. We allow our updater to update the token on the "deprecated" node, and if the "obsolete" node is not connected to the network, then the whole scheme will stop running. However, if a simple network split occurs, the updater can still operate normally.

Once the network split is over and the "abandoned" nodes rejoin the cluster, it will trigger another election or data exchange. Note that the second token, like the third token, is "good".

After the original cluster member is restored, the next update will occur on only one node, and then back up. In other words, when the cluster is split, the split parts are updated independently, but once reintegrated, data consistency is restored. Typically, you need to have 2 active nodes (or 2 active nodes for a 3-node cluster) to keep the cluster running. However, even one active node is sufficient for us, which sends as many external requests as possible.

To reiterate, we have discussed the gradual increase in the number of requests. During a network split or node outage, we can provide a single active node, and we will update this node as usual. if there is an absolute split (that is, when a cluster is divided into * nodes, each node has a network connection), as mentioned above, the number of requests from OAuth service providers will be tripled. However, since this event has occurred for a relatively short time, the situation is not too bad, and we do not want to work in split mode all the time. In general, the system is in a state where there are quorum and network connections, and all nodes are up and running.

Slice

There is another problem that remains unsolved: we have reached the upper limit of CPU, and the most direct solution is slicing.

Suppose we have two database shards, each with a backup, and there is a function like this. Given some key values, we can calculate which shard has the required data. If we slice through e-mail and part of the address is stored on one slice and the other part on another, we know exactly where our data is.

There are two ways to slice. One is client sharding. We choose a continuous sharding function that returns the number of shards, such as CRC32, Guava or Sumbur. This function is implemented in the same way on all clients. One of the obvious advantages of this approach is that the database knows nothing about sharding, your database is working properly, and then shredding happens.

However, this method also has a serious flaw. At first, the client was very busy. If you want a new sharding, you need to add sharding logic to the client. The problem here is that some clients are using this mode, while others are using a completely different mode. The database itself does not know that there are two different sharding modes.

We choose another method-database internal fragmentation, in which case the database code becomes more complex, but in order to compromise we can use a simple client, each client connecting to the database is routed to any node. a special function is used to calculate which node should be connected and which node should be controlled. As mentioned earlier, as the database becomes more complex, the client becomes simpler in order to compromise, but in that case, the database is solely responsible for its data. In addition, the most difficult thing is resharding. If you have a lot of clients that cannot be updated, by contrast, if the database is responsible for managing its own data, resharding will become very easy.

How will it be implemented?

The hexagon represents the Tarantool entity, with three nodes forming shard 1 and another 3-node cluster as shard 2. What happens if we connect all the nodes to each other? According to Raft, we can know the status of each cluster, who is the leader server and who is the follower server. Because it is an intra-cluster connection, we can also know the status of other shards (such as its leader shards or follower shards). In general, if a user who visits a shard finds that it is not the shard he needs, we know exactly where to go.

Let's look at some simple examples.

Suppose the user sends a request to the key residing on the * shard, and the request is received by a node on the * shard. This node knows who the leader is, so it reroutes the request to the shard leader. In turn, the shard leader reads or writes the key and feeds back the result to the user.

In the second scenario, the user's request reaches the same node in the * shard, but the requested key is on the second shard. This situation can be handled in a similar way. The * shard knows who the leader is on the second shard, and then sends the request to the leader of the second shard for forwarding and processing, and then returns the result to the user.

This solution is very simple, but it also has some drawbacks. The problem of * is the number of connections. In the case of two-shard, each node is connected to the other remaining nodes, and the number of connections is 6-5-30. If you add a 3-node shard, then the number of connections will be increased to 72, isn't that a little too much?

How can we solve this problem? We just need to add some Tarantool instances, which we call an agent, not a shard or database, and use an agent to solve all shard problems: including calculating key values and locating shard leaders. Raft clusters, on the other hand, remain self-contained and work only within shards. When the user accesses the agent, the agent calculates the required shards. If leader is needed, it redirects the user accordingly. If it is not leader, the agent redirects the user to any node in the shard.

The resulting complexity is linear, depending on the number of nodes. Now there are a total of 3 nodes, each node has 3 fragments, and the number of connections is several times less.

The design of the proxy scheme takes into account the further scale expansion (when the number of fragments is greater than 2). When there are only two fragments, the number of connections remains the same, but when the number of fragments increases, the number of connections will decrease sharply. The shard list is stored in the Lua configuration file, so if we want to get a new list, we just need to overload the code.

To sum up, first of all, we make a master backup, apply the Raft algorithm, and then add shards and proxies. * what we get is a single block and a cluster, so the current solution looks relatively simple.

All that's left is the front end of the read-only or write-only token, and we have an updater to update the token, get the update token, pass it to the OAuth service provider, and then write a new access token.

As mentioned earlier, some of our auxiliary logic has run out of CPU resources, and now let's move these auxiliary resources to another cluster.

Auxiliary logic is mainly related to the address book. Given a user token, there will be a corresponding address book, and the amount of data in the address book is the same as the token. In order not to exhaust the CPU resources on a machine, we obviously need a cluster that is the same as the replica, and we only need to add a bunch of updateers to update the address book (this task is rare, so the address book will not be updated with the token).

* by integrating the two clusters, we get a relatively simple and complete structure:

Token update queue

Why do we use our own queues when we can use standard queues? This is related to our token update model. Once the token is issued, the validity period is one hour. When the token is about to expire, it needs to be updated, and the token update must be completed before a specific point in time.

Suppose the system is down, but we have a bunch of expired tokens, and while we update these tokens, other tokens expire one after another, although we are sure to update all of them. but wouldn't it be more reasonable if we update those that are about to expire (within 60 seconds) and then use the rest of the resources to update those that have expired? (priority * tokens with 4-5 minutes to expire)

It is not easy to implement this logic with third-party software, but it is effortless for Tarantool. Look at a simple scenario: there is a tuple in Tarantool that stores data, and some of the ID of this tuple has basic key values set. To get the queue we need, we only need to add two fields: status (queue token status) and time (expiration time or other predefined time).

Now let's consider the two main functions of queues-put and take. Put is to write new data. Given some load, set up status and time yourself when put, and then write the data, which is to create a new tuple.

As for take, it means to build an index-based iterator, pick out those tasks that are waiting to be solved (tasks in the ready state), and then check to see if it is time to receive them, or if they have expired. If there is no task, take switches to wait mode. In addition to the built-in Lua,Tarantool, there are some so-called channels, which are essentially interconnected optical fiber synchronization primitives. Any optical fiber can set up a channel and say, "I'll wait here." the rest of the fiber can wake up the channel and send a message to it.

A waiting function (waiting for a release task, waiting for a specified time, or something else) establishes a channel, labels the channel appropriately, places it somewhere, and then listens. If we receive an urgent update token, put notifies the channel, and take receives the update task.

Tarantool has a special function: if a token is issued unexpectedly, or if an update token is received by take, or just a receiving task occurs, Tarantool can trace to the client interrupt in all three cases. We associate each connection with the task assigned to that connection and keep these mappings in the session save. Suppose the update process fails due to a network outage, and we don't know if the token will be updated and written back to the database. As a result, the client is interrupted, searches for session saves for all tasks related to the failed process, and then automatically releases them. Any published task can then use the same channel to send information to another put, which quickly receives and executes the task.

In fact, the specific implementation does not require much code:

Function put (data) local t = box.space.queue:auto_increment ({'ritual,-- [[status]] util.time (),-- [[time]] data-- [[any payload]]}) return t end function take (timeout) local start_time = util.time () local q_ind = box.space.tokens.index.queue local _ T while true do local it = util.iter (q_ind, {'r'}, {iterator = box.index.GE}) _, t = it () if t and t [F.tokens.status] ~ ='t 'then break end local left = (start_time + timeout)-util.time () if left

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.