How to realize horizontal Segmentation in MySQL Database 07/19 Update SLTechnology News&Howtos

How to realize horizontal Segmentation in MySQL Database

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

MySQL database how to achieve horizontal segmentation, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

With the wide popularity of Internet applications, the storage and access of massive data has become a bottleneck in system design. For a large Internet application, billions of PV every day undoubtedly cause a high load on the database. It has caused great problems for the stability and expansibility of the system. Through data segmentation to improve the performance of the website, scale-out data layer has become the preferred way for architecture researchers. Split the database horizontally, which can reduce the load of a single machine and minimize the loss caused by downtime. Through the load balancing strategy, it effectively reduces the access load of a single machine and reduces the possibility of downtime; through the cluster scheme, it solves the problem that a single point of database can not be accessed caused by database downtime; through the read-write separation strategy, it maximizes the speed and concurrency of reading (Read) data in the application. At present, a large number of domestic large-scale Internet applications have adopted such data segmentation scheme, Taobao,Alibaba,Tencent, and most of them have implemented their own distributed data access layer (DDAL). According to the way of implementation and the level of implementation, it is roughly divided into two levels (Java application as an example): the encapsulation of the JDBC layer and the implementation of the ORM framework layer. As far as the direct packaging of JDBC layer is concerned, a well-developed project in China is called "amoeba" (Amoeba), which is developed by the research institute of Ali Group and is still in the testing stage (beta version). Its operating efficiency and production timeliness need to be studied. As far as the implementation of ORM framework layer is concerned, such as Taobao's distributed data access layer based on ibatis and Spring, it has been applied for many years, and its running efficiency and production effectiveness have been affirmed by developers and users. This paper is a distributed data access layer based on ORM framework layer. The difficulty of this topic lies in the formulation and selection of routing rules and the later expansibility, such as how to achieve the purpose of expanding the database capacity (increasing machine nodes) with the least amount of data migration. The core problem will focus on the routing rules and load balancing strategy of database subdatabase and table.

Chapter 2 basic principles and Concepts

2.1 basic principles:

The process of human cognitive problems always goes like this: what-why-how

Next, this paper will discuss and study these three issues:

2.1.1 what is data segmentation

The word "Shard" means "fragment" in English, and as a database-related technical term, it seems to have first appeared in massively multiplayer online role-playing games. "Sharding" let's call it "fragmentation". Sharding is not a new technology, but a relatively simple software concept. It is well known that data table partitioning was not available until MySQL 5, so before that, many potential users of MySQL had concerns about the scalability of MySQL, and the ability to partition has become a key indicator of whether a database is extensible or not (not the only indicator, of course). Database extensibility is an eternal topic, and promoters of MySQL are often asked: how do you do things like dealing with application data on a single database because of the need for partitioning? The answer is Sharding. Sharding is not an accessory function of a particular database software, but an abstract processing based on specific technical details. It is a solution of horizontal expansion (Scale Out, or horizontal expansion or outward expansion). Its main purpose is to break through the limitation of IMagano capability of single-node database server and solve the problem of database expansibility.

The data is distributed horizontally to different DB or table through a series of segmentation rules, and the specific DB or table that needs to be queried is found through the corresponding DB routing or table routing rules for Query operation. The term "sharding" here usually refers to "horizontal segmentation", which is the focus of this article. What kind of segmentation and routing will there be? At this point, readers will inevitably have some questions. Let's take a simple example: let's explain the log in a Blog application. For example, the log article (article) table has the following fields:

In the face of such a watch, how do we divide it? How do you distribute such data into tables in different databases? In fact, by analyzing the application of blog, it is not difficult to come to the conclusion that there are two kinds of users in the application of blog: the viewer and the owner of blog. When browsing a blog, visitors actually browse under a specific user's blog, while the owner of blog manages his own blog and also operates under a specific user's blog (in his own space). The so-called specific user is represented by the field of the database as "user_id". This is the "user_id", which is the basis and rule basis of the sub-library that we need. We can do this by putting all the article information with a user_id of 1: 10, 000 in the article table in DB1, all the article information with user_id of 1000, 1, 000, 000 in the article table in DB2, and so on, all the way up to DBn. In this way, the article data is naturally divided into various databases to achieve the purpose of data segmentation. The next problem to be solved is how to find a specific database. In fact, the problem is simple and obvious, since we used the distinguishing field user_id when dividing the library, then naturally, the process of database routing is still rare, user _ id. Consider the blog application we have just presented, whether it is accessing other people's blog or managing our own blog. In short, I need to know who the user of this blog is, that is, when we know the user_id of this blog, we use this user_id, make use of the rules of sub-database, and in turn locate the specific database, such as user_id is 234. using the rules of this talent, we should locate the DB1, if the user_id is 12343. Using the rules of the talent, you should locate to DB2. And so on, using the rules of sub-library, reverse routing to a specific DB, a process we call "DB routing".

Of course, taking into account the data segmentation of the DB design must be unconventional, unorthodox DB design. So what kind of DB design is orthodox DB design?

It's basically what we use as a rule. Usually, we will consciously design our database according to the paradigm, and the high load may consider using the relevant Replication mechanism to improve the throughput and performance of reading and writing, which may already meet many needs, but the shortcomings of this mechanism are still obvious (mentioned below). The above mentioned "self-conscious design according to the paradigm". Considering that the DB design of data segmentation will violate this usual rule and constraint, in order to split, we have to appear redundant fields in the tables of the database to distinguish fields or tag fields called sub-libraries, such as fields such as user_id in the example of article above. (of course, the example just now does not reflect the redundancy of user_id very well, because even if the user_id field is not divided into libraries. It is also going to appear, so we have picked up a bargain. Of course, the emergence of redundant fields is not only in the sub-library scenario, in many large-scale applications, redundancy is also necessary, which involves the design of efficient DB, this article will not repeat.

2.1.2 Why should data be split?

The above gives a summary description and explanation of what data segmentation is, and readers may wonder why data segmentation is needed. Is a mature and stable database like Oracle enough to support the storage and query of massive data? Why do you need data slicing? Indeed, Oracle's DB is mature and stable, but the high cost of use and high-end hardware support are not affordable for every company. Just imagine the cost of using tens of millions of dollars a year and the tens of millions of yuan of minicomputers as hardware support. can ordinary companies afford this? Even if we can afford it, if there is a better solution, a cheaper one with better scale-out performance, why not choose it?

However, things are always unsatisfactory. Usually, we will consciously design our database according to the paradigm, and the high load may consider using the relevant Replication mechanism to improve the throughput and performance of reading and writing, which may already meet many needs, but the shortcomings of this mechanism are still obvious. First of all, its effectiveness depends on the proportion of read operations. Master often becomes a bottleneck, and write operations need to be queued in sequence to be executed. If the Master is overloaded, the delay of data synchronization in Slaves may be large, and it will greatly consume the computing power of CPU, because write operations still need to be run on every slave machine after they are executed on Master. At this point, Sharding may become a chicken rib. Replication can't handle it, so why does Sharding work? The reason is very simple, because it can be well expanded. We know that every machine, no matter how well configured, has its own physical upper limit, so when our application has reached or far exceeded a certain upper limit of a single machine, we can only seek the help of other machines or continue to upgrade our hardware, but the common solution is to scale out, sharing the pressure by adding more machines. We also have to consider whether our machines can meet the demand through linear growth as our business logic continues to grow. Sharding can easily distribute computing, storage, and Icano to multiple machines in parallel, which can make full use of the processing power of multiple machines, avoid a single point of failure, provide system availability, and provide good error isolation.

Considering the above factors, data segmentation is necessary, and the data segmentation we discuss here also takes MySql as the background. Based on cost considerations, many companies also choose MySql of Free and Open. Developers who know something about MySql may know that table partitioning was not available until MySQL 5, so before that, many potential users of MySQL had concerns about the extensibility of MySQL, and the availability of partitioning has become a key indicator of whether a database is extensible or not (not the only indicator, of course). Database extensibility is an eternal topic, and promoters of MySQL are often asked: how do you do things like dealing with application data on a single database because of the need for partitioning? The answer is also Sharding, which is what we call a data segmentation scheme.

We use free MySQL and cheap Server or even PC to do clusters to achieve the effect of minicomputer + large commercial DB, reduce a lot of capital investment, reduce operating costs, why not? So we chose Sharding and embraced Sharding.

2.1.3 how to achieve data segmentation

When it comes to data segmentation, once again, we will elaborate and explain in detail the methods and forms of data segmentation.

Data segmentation can be physical, distribute the data to different DB servers through a series of segmentation rules, and route to a specific database through routing rules, so that each access faces not a single server, but N servers, which can reduce the load pressure on a single machine.

Data segmentation can also be within the database, through a series of segmentation rules, the data is distributed to different tables in a database, for example, article is divided into article_001,article_002 and other sub-tables, and several sub-tables are assembled to form a logically complete article table. The purpose of doing so is actually very simple. For example, if there are 5000w pieces of data in the article table, we need to insert a new piece of data in the table. After insert, the database will re-index the table, and the overhead of indexing 5000w rows of data can not be ignored. But conversely, if we divide this table into 100 table, from article_ 001 to article_100,5000w row data average, there are only 500000 rows of data in each child table. At this time, the time of indexing insert data in a table with only 50w rows of data will decrease by an order of magnitude, which greatly improves the runtime efficiency of DB and the concurrency of DB. Of course, the benefits of sub-table are not yet known, as well as lock operations such as write operations, will bring a lot of obvious benefits.

To sum up, the sub-library reduces the load of the single point machine, and the sub-table improves the efficiency of data operation, especially the efficiency of Write operation. So far, we have not dealt with the question of how to divide it. Next, we will elaborate and explain the segmentation rules in detail.

As mentioned above, in order to achieve horizontal data segmentation, there must be redundant characters in each table as the segmentation basis and tag field. In general, we choose user_id as the distinguishing field. Based on this, there are three ways and rules to divide the database: (of course, there are other ways)

Divided by number segment:

(1) user_id is the distinction, the corresponding DB1,1001~2000 of 1 to 1000 is the corresponding DB2, and so on.

Advantages: partial migration

Disadvantages: uneven distribution of data

(2) hash module score:

Hash user_id (or directly use the value of user_id if user_id is numeric), and then use a specific number, for example, if you need to split a database into four databases in the application, we use the number 4 to modulo the hash value of user_id, that is, user_id%4, so that there are four possibilities for each operation: the result is 1 corresponding to DB1. A result of 2 corresponds to a DB2; of 3 and a DB3; of 0 corresponds to a DB4, so that the data is evenly distributed among the four DB.

Advantages: uniform distribution of data

Disadvantages: data migration is troublesome and cannot be apportioned according to machine performance.

(3) Save the database configuration in the authentication repository

Is to establish a DB, this DB separately stores the mapping relationship between user_id and DB. Every time we visit the database, we have to query the database to get the specific DB information, and then we can carry out the query operation we need.

Advantages: strong flexibility, one-to-one relationship

Disadvantages: there is one more query before each query, and the performance is greatly reduced.

These are the three ways we usually choose in development, which may be mixed in some complex projects. Through the above description, we also have a simple understanding and understanding of the rules of sub-database. Of course, there will be a better and more perfect way to divide the library, and we need to constantly explore and discover.

Chapter 3: the basic outline of this research.

In the above text, according to the law of human cognition, what?why?how expounds some concepts and significance of database segmentation and gives a brief introduction to some conventional segmentation rules. The distributed data layer discussed in this topic is not only that, it is a complete data layer solution, what is it? In the following text, I will elaborate on the complete idea and implementation of this research topic.

The distributed data scheme provides the following functions:

(1) provide sub-library rules and routing rules (RouteRule referred to as RR). The three segmentation rules mentioned in the above instructions are directly embedded in the system, and the specific embedding methods are explained and discussed in detail in the following content.

(2) the concept of Group is introduced to ensure the high availability of data.

(3) introduce load balancing strategy (LoadBalancePolicy referred to as LB)

(4) the availability detection mechanism of cluster nodes is introduced to detect the availability of single point machines regularly to ensure the correct implementation of LB strategy and the high stability of the system.

(5) introduce read / write separation to improve the query speed of data.

Only the data layer design of sub-database and sub-table is not perfect. What will it be like when the DB server on a node goes down? Yes, we have adopted the database sharding scheme, that is to say, there are N-too machines to form a complete DB. If a machine is down, only 1 / N of the data of a DB cannot be accessed. This is acceptable to us, at least it is much better than the situation before sharding, and it is not impossible to access the entire DB. In general applications, the inaccessibility of data caused by such machine failure is acceptable, assuming that our system is a highly concurrent e-commerce site? The economic loss caused by single-node machine downtime is very serious. In other words, there are still some problems in our scheme, and the fault tolerance performance can not stand the test. Of course, there are always solutions to problems. We introduce the concept of cluster, which I call Group here, that is, we introduce multiple machines for each sub-database node, and each machine stores the same data. In general, these multiple machines share the load. When there is a downtime, the load balancer will distribute the load to the down machine. thus,

The problem of fault tolerance is solved. So we introduce the concept of cluster and embed it into our framework as part of the framework.

As shown in the figure above, the entire data layer is composed of three clusters of Group1,Group2,Group3. These three clusters are the result of horizontal data segmentation. Of course, these three clusters also form a DB containing complete data. Each Group includes 1 Master (of course, Master can also be multiple) and N Slave, and the data of these Master and Slave are consistent. For example, if there is a downtime in a slave in Group1, then two more slave can be used. Such a model will never cause a certain part of the data to be inaccessible, unless all the machines in the Group are down, but considering that the probability of such a thing is very small (unless there is a power outage, it is not easy to happen).

Before the introduction of clustering, the process of one of our queries was roughly as follows: request the data layer and pass the necessary sub-library distinction fields (usually user_id)? The data layer distinguishes the field Route to the specific DB according to the distinction. Data manipulation is performed within this determined DB. This is the case where no cluster was introduced. What would it be like to introduce a cluster at that time? If you look at figure 1, you can see that the rules and policies on our router can only be routed to a specific Group, that is, to a virtual Group, which is not a specific physical server. The next thing you need to do is to find a specific physical DB server to perform specific data operations. Based on the requirements of this link, we introduce the concept of load balancer (LB). The job of the load balancer is to locate a specific DB server. The specific rules are as follows: the load balancer analyzes the read and write characteristics of the current sql. If it is a write operation or a real-time operation, it directly distributes the query load to the Master. If it is a read operation, it allocates a Slave through the load balancing policy. The main research of our load balancer is load distribution strategy. Usually, load balancing includes random load balancing and weighted load balancing. Random load balancing is easy to understand, that is, one Slave is randomly selected from N Slave. Such random load balancing does not take into account machine performance, it defaults to the same performance for each machine. If this is the case, there is nothing wrong with doing so. What if this is not the case? When the physical performance and configuration of each Slave are different, it is very unscientific to use random load balancing without considering performance. This will bring unnecessary high load to machines with poor performance, and even bring the risk of downtime. At the same time, high-performance database servers can not give full play to their physical performance. Based on this consideration, we introduce weighted load balancing, that is, through a certain interface within our system, we can assign a weight to each DB server, and then the runtime LB allocates a certain proportion of the load to the DB server according to the proportion of the weight in the cluster. Of course, the introduction of such a concept undoubtedly increases the complexity and maintainability of the system. There are gains and losses, and there is no way for us to escape.

With sub-libraries, clusters, and load balancers, is everything all right? Things are not as simple as we thought. Although with these things, basically can ensure that our data layer can withstand a lot of pressure, but this design can not completely avoid the harm of database downtime. If the slave2 in the Group1 goes down, then the LB of the system does not know, which is actually very dangerous, because LB does not know, it will still think that the slave2 is available, so it will still allocate the load to the slave2. In this way, the problem arises, and the client will naturally have an error or exception in which the data operation fails. This is very unfriendly! How to solve such a problem? We introduce the availability detection mechanism of cluster nodes, or the availability data push mechanism. What is the difference between these two mechanisms? First of all, let's talk about the detection mechanism, as the name implies, even if the probe is my data layer client, it tries to use the availability of various databases in the cluster from time to time, the principle of which is to try to link or access the database port tentatively. Of course, you can also use JDBC to determine the usability by using Java's Exception mechanism, which will be mentioned in the following text. What is the data push mechanism? In fact, this is about to be discussed in the real application scenario. In general, if the DB database of the application goes down, I believe DBA must know. At this time, DBA manually pushes the current state of the database to the client, that is, the application side of the distributed data layer, and updates a list of local DB status. And tell LB that this database node cannot be used, please do not assign load to it. One is the active monitoring mechanism, and the other is the passive informed mechanism. Each has its own strengths. But they can all achieve the same effect. In this way, the hypothetical problem will not happen, and even if it does, the probability of occurrence will be reduced to a minimum.

We have not done much in-depth explanation of Master and Slave mentioned in the above text. As shown in figure 1, a Group consists of 1 Master and N Slave. Why would you do that? The Master is responsible for the load of the write operation, that is, all the write operations are carried out on the Master, while the read operations are allocated to the Slave. In this way, the efficiency of reading can be greatly improved. In general Internet applications, after some data survey, it is concluded that the read / write ratio is about 10:1, that is to say, a large number of data operations are focused on read operations, which is why we have multiple Slave. But why separate reading from writing? Developers who are familiar with DB all know that write operations involve locks, whether row locks, table locks or block locks, which reduce the efficiency of system execution. Our separation is that the write operation is concentrated on one node, while the read operation is carried out on the other N nodes, which effectively improves the reading efficiency and ensures the high availability of the system. Read-write separation also introduces new questions, such as how to keep the data on my Master synchronized and consistent with other Slave machines in the cluster. This is a problem that we do not need to pay too much attention to. MySql's Proxy mechanism can help us to do this, because the Proxy mechanism is not very relevant to this topic.

I won't go into details here.

To sum up, this is the general function of the distributed data layer studied in this topic. The above is some discussion and exposition of the basic principles. Then on the system design level, carry on the in-depth analysis and research.

Chapter 4 system Design

4.1 selection of system implementation level

In the introduction, it is mentioned that there are two choices in the implementation level of the system, one is based on the choice at the JDBC level, and the other is based on the choice at the existing data persistence layer framework level, such as Hibernate,ibatis. Each of the two levels has its own advantages and disadvantages. Based on the system implementation on the JDBC level, the difficulty of system development and later use will be greatly improved. The development cost and maintenance cost of the system are greatly increased. The orientation of this topic is to encapsulate the upper layer on the basis of the formed ibatis persistence layer framework, rather than directly modify the ibatis source code, so that the system will not have too much intrusiveness to the existing framework, thus increasing the flexibility of use. The reasons for choosing ibatis are as follows:

(1) the learning cost of ibatis is very low, and proficient Java Programmer can skillfully use ibatis in a very short time.

(2) ibatis is a lightweight ORM, which simply completes the mapping of RO,OR, and its query statement is also simply configured at the native sql level through the configuration file sql-map.xml file, that is to say, we do not introduce the concept of HQL such as Hibernate, thus enhancing the controllability of sql. Excellent DBA can optimize sql from the sql level, so that the application of the data layer has strong controllability. Although Hibernate is very powerful, because Hibernate is a heavy encapsulation of OR and introduces the concept of HQL, it is not convenient for the DBA team to control sql statements and tune performance.

Based on the above two reasons, this topic chooses the persistence layer framework ibatis, which is easy to learn and use and lightweight, in the choice of ORM products. The following discussions are also specific to ibatis.

4.2 Choice of other open source frameworks

In some large-scale Java applications, we usually use open source frameworks like Spring, especially IoC (DI), to effectively help developers manage the dependency and hierarchy of objects and reduce the entity coupling between different levels of the system. The advantages and usefulness of Spring I believe are well known to developers and I am not going to repeat them here. The data layer of this topic will also use Spring as the framework of IoC (DI).

4.3 introduction of system development techniques and tools

Development language: Java JDK1.5

Integrated development environment: Eclipse 3.3.4

Test server in Web environment: JBoss 4.2

Build tools: Antx, a build tool developed by Taobao (similar to Maven). Of course, you can also use Maven.

Dependent open source Jar:Spring2.0,ibaits,commons-configuration (read configuration file), log4j,junit, etc.

Chapter 5 system Analysis

1. What are the key data that need to be split in the system?

The system has static data and dynamic data.

Static data is often small and does not need to be split, so it is necessary to have a separate static configuration library. Considering that many sql must associate static data with join, then static data will certainly be replicated to dynamic databases. At the same time, a separate static library is very convenient for static data management.

Dynamic data, only horizontal segmentation, or longitudinal segmentation plus horizontal segmentation. For example: vertical segmentation according to the user and blog, and horizontal segmentation of the user and blog.

The granularity of segmentation needs to be accurately mastered, otherwise it will cause a lot of difficulties in development and maintenance.

2. Globally unique ID

If you use hash sub-database or geographical location attribute sub-database for horizontal segmentation, the ID of user ID or blog must have meaning and can be located to the corresponding split library through ID.

Either way, ID must have meaning, literally or algorithmically able to identify which shredding library the data belongs to.

3. Syncopation data integration (associated)

If it is a huge database, there must be many models in it. For example: user model, data model and so on, whether the corresponding data of a user can be found in the corresponding database of this user. Doing so will reduce the difficulty of development as much as possible, and it is impossible to completely require developers to avoid the use of associated tables.

4. Management of sub-table

After using horizontal database segmentation, there must be database sub-tables.

Whether there is a subtable in the sharding library, or whether there are multiple subtables in a sharding library.

Those subtables are in that library, and those subtables belong to a shredding library, all of which need to be managed.

If a sharding library has only one sub-table of one type, it would be a waste of database resources.

5. Connection management after sub-database

After database segmentation, the number of databases will increase, which will inevitably lead to the increase of connections.

An application is to connect all the sub-libraries? Or do you divide the application of a function to connect to a specific sub-library?

If you use oracle's rac, you should also consider rac's gc buffer busy.

6. Cooperation of sql in development.

If the system is small, you can configure the sql+xml file. Systems that use sub-libraries are generally very large.

Using the configuration of sql+xml, there will be a lot of problems, the difficulty of development and the degree of non-cooperation will certainly rise in a straight line. Sql needs to be delegated to developers, and abstract and logical management patterns can be used for tables in sql. It not only improves the development efficiency, but also manages the sql that needs to be executed in the sub-library.

7. Business parameters

If the business logic is complex, each function or method is required to pass the sub-library parameters displayed. The logic and application of the sub-library will be completely soft together, which will lead to extremely difficult maintenance and upgrade in the later stage.

The business logic parameters are passed at the page or interface level, and these parameters are used at the bottom of the orm later.

8. Business trade-off

Previous applications can be queried by name. After the sub-library, such a query cannot be carried out, and it needs to be queried according to the attributes of the sub-library. It will lead to changes in business or operation.

Generally speaking, the operation of some businesses will change with the progress of the sub-database.

9. Distributed transaction

After splitting multiple databases, it is inevitable that businesses operate multiple databases, which will lead to the problem of distributed transactions (distributed transactions not only occur in multiple databases). At the same time, a busy database, cross-database operations, to avoid cross-library deadlock. The database does not detect another library and its own deadlock.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.