In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly introduces "how to use sub-database sub-table middleware". In daily operation, I believe many people have doubts about how to use sub-library sub-table middleware. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "how to use sub-database and table middleware"! Next, please follow the editor to study!
What are the high availability issues
As a stateless middleware, the problem of high availability is not that difficult. However, to minimize the loss of traffic during the unavailable period, it still needs some work. These traffic losses are mainly distributed in:
(1) the physical machine where a middleware is located suddenly goes down.
(2) upgrade and release of middleware.
Because our middleware is provided to the application as a database agent, that is, the application treats our middleware as a database, as shown in the following figure:
Therefore, after the emergence of the above problems, it is difficult for the business to shield these effects through operations such as retry. This is bound to require us to do some operations at the bottom to automatically sense the state of the middleware so as to effectively avoid the loss of traffic.
Downtime of the physical machine where the middleware is located
Physical machine outage is actually a common phenomenon, when the application does not respond in an instant. Then the sql running on it must also have failed (an unknown state, to be exact, unless the application cannot know the exact status unless the back-end database is re-queried). We certainly can't save this part of the traffic. What we do is that we can quickly find and eliminate the down middleware nodes on the client side (Druid data source).
Find and eliminate unavailable nodes find unavailable nodes by heartbeat
Naturally, we use the heartbeat to explore the survival of the back-end middleware. We do the heartbeat by periodically creating a new connection ping (mysql's ping) and then turning it off immediately (this allows us to distinguish between normal traffic and heartbeat traffic, which is slightly more troublesome if we maintain a connection and keep sending a sql like select'1').
In order to prevent occasional connect failures caused by network jitter, we determine that a middleware is unavailable only after three connect failures. However, these three explorations prolong the misperception time, so the interval of our three connect decays exponentially, as shown in the following figure:
Why not send connect twice in a row after the first connect failure? Maybe considering that the jitter of the network may have a time window, if you send it three times in a row within the time window, and the network okay out of this time window, you will mistakenly find that a node at the back end is unavailable, so we make a compromise of exponential attenuation.
Unavailable nodes are found by error counting
The heartbeat perception mentioned above always has a time window, and when the traffic is large, those who use this unavailable node in this time window will fail. Therefore, we can use error counting to assist the perception of unavailable nodes (of course, the implementation of this means is still planned).
One thing to note here is that it can only be counted by creating connection exceptions, not by things like read timeout. The reason is that the read timeout exception may be caused by a slow sql or a backend database problem, and the middleware problem can only be determined by creating a connection exception (connection closed may also close the connection by the backend, which does not mean it is not available as a whole), as shown in the following figure:
Problems caused by the use of several connections in a request
Because we need to keep transactions as small as possible, multiple sql in a request do not use the same connection. In the case of non-transactional (auto-commit), as many sql as you run, take as many connections from the connection pool and put them back. It is important to keep transactions small, but this can cause problems in the event of a middleware outage, as shown in the following figure:
As shown in the figure above, during the fault discovery window (that is, when it has not yet been determined that a middleware is not available), the data source is randomly selected to connect. And this connection has a certain probability of hitting unavailable middleware (N is the number of middleware), which leads to the failure of a sql, which leads to the failure of the whole request. Let's do a calculation:
Suppose N is 8, and a request has 20 sql
Then the probability of failure of each request during this period is (1-(7 to 8) to the power of 20) = 0.93
That is, there is a 93% chance of failure!
What's more, the entire application cluster will go through this stage, that is, each application has a 93% chance of failure.
The downtime of a middleware leads to the failure of almost all requests for the whole service in more than ten seconds, which is unbearable.
Using sticky data source to solve the problem
Since we cannot instantly detect and confirm that the middleware is unavailable, this fault discovery window must exist (of course, the error counting method will greatly shorten the discovery time). But ideally, if there is a downtime, it would be good to lose only 1 pound N of traffic. We use sticky data source to solve this problem, so that the probability of losing traffic is only 1 beat N, as shown in the following figure:
With wrong counting, the loss of total traffic will be even smaller (because the fault window is short).
As shown in the figure above, only requests randomly selected to middleware 2 (unavailable) during downtime will fail. Let's take a look at the entire application cluster.
Only the request traffic from sticky to middleware 2 has a loss, because it is randomly selected, so the loss of this traffic is applied to 1max N.
High availability in the process of middleware upgrade and release
The upgrade and release of sub-database and sub-table middleware is inevitable. For example, bug fix and the addition of new features need to restart the middleware. The restart time will also lead to unavailability. Compared with the physical machine downtime, the unavailable time point is knowable, and the restart action is controllable, so we can use this information to achieve smooth and lossless traffic.
Let the client know that he is about to go offline.
In many of the practices I know, making client-side aware of offline is to introduce a third-party coordinator (such as zookeeper/etcd). We don't want to introduce third-party components to do this, because this will introduce the high availability problem of zookeeper and make the configuration of the client side more complicated. The rough idea of smoothing and lossless (state machine) is shown in the following figure:
Let heartbeat traffic be sensed offline and normal traffic maintained
We can reuse the logic that the client side detects that the new connection is unavailable, that is, the new connection of the heartbeat fails and the new connection of the normal request succeeds. In this way, the client will assume that the Server is not available and remove the server internally. Since we are only simulated and unavailable, both the established connection and the normally new connection (non-heartbeat) are available normally, as shown in the following figure:
The creation of a heartbeat connection can be distinguished by the fact that the first line executes the ping of mysql while the first line of normal traffic executes a sql. (of course, the Druid connection pool we use will also ping after the new connection is successful, so we use another way to distinguish it. This detail will not be discussed here).
After three heartbeats failed, the client determines that the Server1 failed and needs to destroy the connection to the server1. The idea is that when the business layer runs out of connections and returns to the connection pool, it will be dropped directly to close (of course, this is a simple description, and the actual operation to the Druid data source is also slightly different).
Since a connection maximum hold time is configured, it is certain that the number of connections to Server1 will be 0 after this time.
As the online traffic is not low, this convergence time is relatively fast (further measures, in fact, take the initiative to destroy, but we have not yet done this operation).
How to determine that there is no more traffic in offline Server
After the above careful operation, there will be no traffic loss during the offline process of Server1. But we also need to determine when there will be no new traffic on the server side, that is, Server1 does not have any client-side connections.
This is why we can make the number of connections zero by destroying connections after executing sql, as shown in the following figure:
When the number of connections is 0, we can reissue Server1 (database and table middleware). For this, we wrote a script with pseudocode as follows:
While (true) {count = `netstat-anp | grep port | grep ESTABLISHED | wc-l` if (0 = = count) {/ / the traffic has reached 0, shut down the server kill Server / / publish the upgrade server public Server break} else {Sleep (30s)}
Connect this script to the release platform, and you can scroll online and offline.
Now you can explain why recover_time is longer, because new connections also result in an increase in the number of connection count calculated by the script, so you need a time window not to establish a heartbeat to make the script run smoothly.
Recover_time is actually unnecessary.
If we separate the port number created by heartbeat from the port number of normal traffic, recover_time is not required, as shown in the following figure:
Picture
Adopting this scheme will greatly reduce the complexity of our client-side code.
But this undoubtedly adds a new configuration to the client, a burden on the users, and one more operation to open the wall on the network, so we have adopted the recover_time scheme.
Start-up sequence of Middleware
The previous process is an elegant offline process, but we find that our middleware is not elegant in some cases when it comes online. That is, when the middleware starts, if the newly established connection to the back-end database is disconnected for some reason, the reactor thread of the middleware will be stuck for about one minute, which can not be served during this period of time, resulting in traffic loss. So after all the backend database connections have been created, we can solve this problem by starting the invite thread of reactor to receive new traffic, as shown in the following figure:
At this point, the study on "how to use sub-database sub-table middleware" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.