The Great trick of "scalability" of distributed system-- detailed explanation of "horizontal- Vertical Segmentation" 07/01 Update SLTechnology News&Howtos

The Great trick of "scalability" of distributed system-- detailed explanation of "horizontal- Vertical Segmentation"

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

If you see my article for the second time, please scan the code at the end of the article and subscribe to my personal official account (cross-border architect).

The length of this article is 5389 words. It is recommended to read for 14 minutes.

Insist on originality, each article is written by heart ~

I didn't expect that this article has been written for so long, and if it hasn't been digested for a while, you can collect it first.

This is the fourth part of the "scalability" chapter. Let's give the newcomers a brief review of the first three articles.

The most important thing to do "scalability" is to be "stateless" first, so that you can "scale" horizontally as you like, without having to worry about confusion caused by switching between multiple copies. This is what the "stateless" detailed explanation, the focus of distributed systems, talks about.

However, even if the scale-out is done, it is still a "big program" in essence, but it has become "replicable".

If we want to eliminate the "big program", we have to "split". To do a good job of segmentation must be inseparable from the core idea of "high cohesion and low coupling". This is what the detailed explanation of "High cohesion and low Coupling", which is the focus of distributed system, is about.

Digression: when you encounter a single point application can not support the use, Brother Z's general advice to you is: first consider "expansion", and then consider "cut". Like writing code, it is often easier to "add" new features than to change old ones.

For "expansion", first consider "vertical expansion" (adding hardware, money can not solve the problem), and then consider "horizontal expansion" (stateless transformation + multi-node deployment, which is a minor operation).

The word "cut" is usually "vertical cut" (according to business segmentation, this is a major operation), and "horizontal cut" is occasionally used (which is actually layering in a single application, such as front and rear separation).

In the third article, "distributed system concern-flexible Architecture", we talked about two common "loosely coupled" architecture models, in order to make the application "scalable" please climb to another tall building.

All of the above are work at the application level. In general, surgery at the application level, combined with the full use of cache, can support the development of the system for a long time. In particular, it is a "CPU-intensive" scenario with a small amount of data and a large number of requests.

However, if the work scenario is a very mature and large-scale project, the more developed it is, the later the bottleneck always appears in the database. There will even be cpu long-term high load, downtime and other phenomena.

In such a scenario, the database has to be operated on. This time Brother Z will come to talk to you about the good ways to do the "scalability" of the database.

When the core demand is faced with a database that needs to be operated on, the whole system often looks like this.

As mentioned earlier, the bottleneck at this time is often reflected in the "CPU".

Because for the database, the expansion of hard disk and memory is relatively easy, because they can be directly "increased" way.

CPU is different, once the CPU soars high, at most to check whether the index has been done, after that, basically can only do and watch.

So the idea to solve this problem naturally becomes: how to spread the CPU pressure of one database to multiple CPU. It can even be increased at any time on demand.

Then this is the same as the application to do "slicing". It is also the embodiment of the thought of "divide and conquer" of distributed system.

Since it is sharding, it is essentially the same as the application, it is also divided into "vertical sharding" and "horizontal sharding".

Vertical syncopation is sometimes called "vertical syncopation".

Like applications, it is based on the "business" as the dimension of segmentation, running different business databases on different database servers, each performing its own duties.

In general, Brother Z suggests that you give priority to "vertical segmentation" rather than "horizontal segmentation". Why? You can feel free to open the SQL statement in the project at hand. I think there must be a large number of "join" and "transaction" keywords. This kind of related query and transaction operation is essentially a kind of "relational binding". Once faced with a database split, it is impossible to play.

You only have two choices at this time.

Either get rid of unnecessary "relationship bundling" logic, which requires business adjustments to remove unnecessary "bulk operations" business, or unnecessary strong consistency transactions. But as you know, there must be some scenes that can't be finished.

Or float the logic such as "merge" and "association" into the code of the business logic layer and even the application layer.

In the end, no matter how you choose, the change is a big project.

In order to make the project as small as possible and pursue a better performance-to-price ratio, we need to adhere to a principle-"avoid splitting closely related tables."

Because the closer the relationship between the two tables is, the greater the need for "join" and "transaction", so adhering to this principle can cause the same modules and closely related businesses to fall into the same library so that they can continue to work with "join" and "transaction".

Therefore, we should give priority to "vertical segmentation".

The idea of "vertical sharding" is very simple. In general, the suggestion is to correspond to the segmented application one by one, neither more nor less.

In practical work, to do a good job of "vertical segmentation" is mainly reflected in the familiarity of the "business", so we will not continue to carry out here.

The advantages of "vertical segmentation" are:

High cohesion and clear separation rules. The redundancy of data is lower than that of horizontal segmentation.

The relationship with the application is 1:1, making it easy to maintain and locate problems. Once abnormal data is found in a database, it is OK to troubleshoot the associated program in that database.

But this is not a "once and for all" solution, because no one can predict how the business will develop in the future, so the most obvious disadvantage is that there are still performance bottlenecks for tables that are accessed very frequently or have a large amount of data.

If we really need to solve this problem, we need to move out of the "horizontal segmentation".

Digression: try to avoid "horizontal segmentation" unless you have no choice. After reading the rest of the content, you will know why.

Next, Brother Z will give you a good talk about "horizontal segmentation", which is the focus of this article.

Horizontal sharding imagine that after you have done a "vertical sharding", you still find a table with more than 1 billion items of data in a database.

At this time to do a "horizontal segmentation" of the watch, how do you think about this matter?

The idea that Brother Z taught you is:

First find the "read" field of "most high frequency".

Then look at the characteristics of the actual use of this field (batch query or single query, whether it is also an associated field of other tables, and so on).

Then choose the appropriate segmentation scheme according to this characteristic.

Why find the high-frequency "read" field first?

Because in the actual use, the "read" operation is often much larger than the "write" operation. Generally speaking, you have to check through "read" before "writing". However, "read" has its own separate use scenarios. Therefore, in view of the more high-frequency "reading" scenario, the value is bound to be greater.

For example, the current table with 1 billion data is an order table, and the structure is as follows:

Order (orderId long, createTime datetime, userId long) Let's first look at several ways of "horizontal slicing", and then we can see what kind of scene is suitable for which way.

Range segmentation this is a "continuous" way of segmentation.

For example, if we split according to time (createTime), we can divide it by year and month, order_ 201901 a library, order_ 201902 a library, and so on.

According to the ordinal number (orderId) segmentation, it can be 1000000 to 199999 a library, 200000 to 299999 a library, and so on.

The advantage of this syncopation is that the size of a single table is controllable and there is no need for data migration when expanding.

The disadvantage is also obvious, in general, the newer the time or the larger the serial number, the more "new" the data, so the frequency and probability of being accessed are more than the "old" data. As a result, the stress is mainly concentrated in the new library, and the older the library, the more idle it is.

Hash segmentation is the opposite of "range segmentation", which is a "discrete" way of segmentation.

Its advantage is that it solves the disadvantage of "range segmentation", and the new data is distributed to each node to avoid the pressure concentrated on a small number of nodes.

Similarly, the disadvantage is contrary to the advantages of "scope segmentation". Once the second extension is carried out, it will inevitably involve data migration. Because the Hash algorithm is fixed, as soon as the algorithm changes, the data distribution changes.

In most cases, our hash algorithm can be carried out through a simple "modular" operation. It looks like this:

If divided into 11 libraries, the formula is orderId%. % =, assigned to db0. % =, assigned to db1. .... % =, assigned to db0. % =, assigned to db1. In fact, in some scenarios, we can do this by customizing the generation of id (see the previous article, "Global unique document number generation, an essential medicine in distributed systems"). We can not only break up hot data through hash segmentation, but also reduce reliance on global tables to locate specific data.

For example, add the Mantissa of userId to orderId to achieve the effect that the results of orderId and userId are equal. Let me give you an example:

A user's userId is 200004. If you take a 4bit Mantissa, this is 4, denoted by 0100.

Then, we generate the first 60 bits of orderId by customizing the id algorithm, followed by 0100.

Therefore, the result of orderId% 10 and userId% 10 is the same.

Of course, it doesn't work if you want to add other factors besides userId. That is, an additional dimension can be supported without adding global tables.

The global table is mentioned twice, so what is the global table?

The global table approach is to save the partition Key used as the basis for sharding and the id corresponding to each specific piece of data in a separate library or table. For example, to add a table like this:

NodeId orderId 01 100001 02 100002 01 100003 01 100004... In this way, it is true that most of the specific data is distributed on different servers, but this global table will give people a sense of "distraction".

Because when requesting data, it is impossible to directly locate which server the required data is on, so each operation should first query this global table to know where the specific data is stored.

The side effect of this "centralised" model is that bottlenecks and risks are transferred to the overall table. However, the victory lies in the simple logic.

OK, so how to choose these segmentation schemes?

Brother Z's advice to you is that if the hot spot data is not a particularly concentrated scenario, it is recommended to use "range segmentation" first, otherwise choose the other two.

When choosing the other two, the larger the amount of data, the more inclined to choose Hash segmentation. Because the latter is better than the former in terms of overall availability and performance, but the implementation cost is higher.

"horizontal segmentation" can really achieve "unlimited expansion", but it also has corresponding disadvantages.

1) batch queries, paging, etc., need to do more extra work. Especially if a table has multiple high-frequency fields for where, order by, or group by.

2) the split rule is not as clear as "vertical split".

So let's say one more "nonsense": there is no perfect plan, only the right one, and we should choose it according to the specific situation. (you are welcome to raise the scene where you have doubts in the message area and discuss with Brother Z.)

How to implement when you specifically implement "horizontal segmentation", you can operate at two levels, either at the "table" level or at the "library" level.

The table is divided into tables under the same database, with the table names order_0, order_1, order_2.

It can solve the problem that the single table data is too large, but it can not solve the problem of CPU load. So, when the CPU is not very stressful, but the table is too large, causing the SQL operation to be slow, you can choose this way.

Library at this time the table name can not be changed, all called order, but divided into 10 libraries. So it's db0-user db1-user db2-user.

Much of our previous conversation is based on this model, so we won't say much about it.

Table + library can also be divided into libraries and tables, for example, 10 libraries first, and then 10 tables for each library.

This is actually a two-level index idea, through the library to locate for the first time to reduce a certain amount of resource consumption.

For example, first by year, and then by month. In this way, if the data you need to obtain is only across the month but not the New year, we can do aggregation operations in a single database, without involving cross-database operations.

However, no matter which way you choose to do it, you will more or less face the following two problems that cannot be escaped.

Cross-library join.

Global aggregation or sort operations.

The best way to solve the first problem is to change your programming thinking. Try to embody some logic, relationships, constraints, etc., in the code of the application, and avoid doing these things in SQL because of convenience.

After all, the code can be written as "stateless" and can be extended at any time, but SQL follows the data, and the data is "state", which is naturally not conducive to expansion.

Of course, second choice, you can also redundant a large number of global tables to deal with. In this way, it is not only a big test for "data consistency" work, but also a great overhead for storage resources.

The solution to the second problem is to change the original aggregation or sort into two operations. Traversing multiple nodes can be done in a "parallel" manner.

So how does the program use it after data sharding? This can be divided into two modes, "in-process" and "out-of-process".

"in-process" can be done in the encapsulated DAL access framework, in the ORM framework, or in the database driver. This model is better known for solutions such as Ali's tddl.

"out-of-process" is the proxy mode, and the more well-known solutions are mycat, cobar, atlas, and so on, because this pattern is "low intrusive" to applications and uses like "a database". However, due to one more network communication, there will be more loss in performance.

The old rules, let's share some best practices.

Best practices start by sharing two tips for data segmentation without downtime

. Let's take a look at the example of horizontal segmentation by implementing the hash method.

When you do segmentation for the first time, you can use the new node as a copy of the original node in master-slave form for full real-time synchronization.

Then delete the data that does not belong to it on this basis. (of course, there's no problem if you don't delete it, just take up more space.)

So you don't have to shut down.

Second, with the passage of time, if the follow-up can not hold up and need to split twice, we can choose to expand with a multiple of 2.

In this way, the migration of the data becomes very simple and only needs to be migrated locally, the same way of thinking as the first time.

Of course, if the chosen segmentation method is "range segmentation", there will be no trouble in the second segmentation, and the data will naturally run to the latest node. For example, if we break down the table by year and month. The March 2019 data will naturally fall into the xxxx_201903 table.

At this point, Brother Z still wants to emphasize that we can try not to split as much as possible, and we can first use programs such as "read-write separation" to deal with the problems we are facing.

If you really want to split, you must first "vertical segmentation", and then consider "horizontal segmentation".

Generally speaking, considering in this order, the ratio of performance to price is better.

Summary

All right, let's sum up.

This time, Brother Z first introduced to you two ideas for database segmentation. The popular understanding of the two ideas is: "vertical split" equals "column" change "row" unchanged, "horizontal split" equals "row" change "column" unchanged.

Then it focuses on the three ways to realize "horizontal segmentation" and the train of thought of concrete implementation.

Finally, I share some practical experience with you.

I hope it will enlighten you.

The focus of distributed system-- detailed explanation of "stateless"

The focus of distributed system-- detailed explanation of "High cohesion and low Coupling"

The focus of distributed system-- flexible Architecture

The necessary medicine in distributed system-- global unique document number generation

Author: Zachary

Source: https://www.cnblogs.com/Zachary-Fan/p/databasesegmentation.html

If you like this article, you can click "thumb" in the lower left corner.

This will give me some feedback. :)

Thank you for your help.

▶ about the author: Zhang Fan (Zachary, personal WeChat account: Zachary-ZF). Persist in polishing each article with high quality and originality. Welcome to scan the QR code below.

Publish the original content regularly: architecture design, distributed system, product, operation, some thinking.

If you are a junior programmer, you want to promote but don't know how to do it. Or as a programmer for many years, I fell into some bottlenecks and wanted to broaden my horizons. Welcome to follow my official account "Cross-border architect", reply to "Technology" and send you a mind map that I have collected and sorted out for a long time.

If you are an operator, there is nothing you can do in the face of a changing market. Or you may want to understand the mainstream operation strategy in order to enrich your "warehouse". Welcome to follow my official account "Cross-border architect", reply to "Operation" and send you a mind map that I have collected and sorted out for a long time.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.