What is mysql database sharding? 07/09 Update SLTechnology News&Howtos

What is mysql database sharding?

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces what the mysql database segmentation is, the article is very detailed, has a certain reference value, interested friends must read it!

Preface to mysql Database Segmentation

Extensions achieved through MySQLReplication capabilities are always limited by the size of the database. Once the database is too large, especially when writes are too frequent to be supported by a host, we will still face an expansion bottleneck. At this time, we must find other technical means to solve this bottleneck, that is, the evil data segmentation technology that we are going to introduce in this chapter.

What is data segmentation?

Perhaps many readers have seen articles about data segmentation on the Internet or in magazines many times, only to call it data Sharding in some articles. In fact, whether it is called data Sharding or data segmentation, the concept is the same.

To put it simply, it means that through some specific conditions, the data we store in the same database will be distributed to multiple databases (hosts), so as to achieve the effect of dispersing the load of a single device. Data sharding can also improve the overall availability of the system at the same time, due to the Crash of a single device. Only some part of the overall data is not available, not all of the data.

The Sharding of data is based on the type of its segmentation rules. It can be divided into two segmentation modes.

One is to split into different databases (hosts) according to different tables (or Schema), which can be called vertical (vertical) segmentation of data. The other is to split the data in the same table to multiple databases (hosts) according to certain conditions according to the logical relationship of the data in the table. Such segmentation is called horizontal (horizontal) segmentation of data.

The most important feature of vertical segmentation is that the rules are simple and the implementation is more convenient, especially suitable for very low coupling between businesses. A system with very little interaction and clear business logic. In such a system, it can be very easy to split the tables used by different business modules into different databases. Split according to different tables. The impact on the application is also smaller, and the splitting rules will be simpler and clearer than the split rules.

Horizontal segmentation is compared to vertical segmentation. Relatively speaking, it's a little more complicated. Because you want to split different data in the same table into different databases, for the application, the split rule itself is more complex than splitting based on the table name, and later data maintenance will be more complex.

When the amount of data and query of one (or some) table is so large that it still cannot meet the performance requirements after putting it on a separate device through vertical sharding, we must combine vertical sharding with horizontal sharding. First split vertically, then horizontally. To solve the performance problem of such a super-large table.

In the following, we analyze the architecture implementation of vertical, horizontal and combined data segmentation and the integration of data after segmentation.

Vertical segmentation of data

Let's first take a look at what kind of vertical segmentation of data is. Vertical segmentation of data. It can also be called longitudinal segmentation. Think of a database as consisting of a large number of "data blocks" (tables). We cut these "data blocks" vertically and spread them across multiple database hosts. Such a segmentation method is a vertical (vertical) data segmentation.

An application system with better architecture design. Its overall function must be made up of a lot of functional modules. The data needed by each function module corresponds to one or more tables in the database.

In the architecture design, the more unified the interaction points between the functional modules are, the lower the coupling degree of the system is, and the better the maintainability and expansibility of the system modules are. Such a system. The more easy it is to realize the vertical segmentation of the data.

The clearer the functional module and the lower the coupling degree, the more easy the rule definition of data vertical segmentation. The data can be segmented according to the functional modules, and the data of different functional modules are stored in different database hosts, which can be very easy to avoid the existence of cross-database Join. At the same time, the system architecture is also very clear.

That's for sure. It is very difficult for a system to achieve that the tables used by all functional modules are completely independent, and there is no need to ask each other's tables or the tables of two modules for Join operation. In this case, we must make an evaluation and tradeoff according to the actual application scenario. Decide whether to accommodate the application to store all the relevant tables that need Join in the same database, or to let the application do a lot of other things, that is, the program completely obtains the data from different databases through the module interface, and then completes the Join operation in the program.

Generally speaking. Suppose it is a system with a relatively low load and very frequent table associations. That could be a database concession. The solution of merging several related modules together to reduce the work of the application can reduce a lot of work. It is a feasible plan.

That's for sure. Through the concession of the database, allowing multiple modules to centrally share the data source, in fact, a brief introduction acquiesced in the development of the increased coupling of the architecture of each module, which may make the future architecture worse and worse. Especially when it develops to a certain stage, it is found that the database really can not bear the pressure caused by these tables. Have to face the time of syncopation again. The cost of architectural transformation may be much higher than in the first place.

So. When the database is divided vertically, how to split and to what extent is a difficult problem to test people. It can only balance the costs and benefits of all aspects in the actual application scenario. Ability to analyze a really suitable for their own split plan.

For example, let's take a brief analysis of the example database that demonstrates the sample system used in this book. Then a simple slicing rule is designed for a vertical split.

The function of the system can be divided into four functional modules: user, group message, phase and event. Corresponding to, for example, the following tables:

1. User module table: user,user_profile,user_group,user_photo_album

two。 Group discussion table: groups,group_message,group_message_content,top_message

3. Correlation table: photo,photo_album,photo_album_relation,photo_comment

4. Event information table: event

At first glance, no module can exist independently from other modules, and there is a relationship between modules and modules. Is it impossible to slicing?

Of course not. After a little more in-depth analysis, we can find that although there are associations between the tables used by each module, the relationships are clearer and simpler.

The relationship between the ◆ group discussion module and the user module is mainly through the user or group relationship. Generally speaking, the association is carried out through the user's id or nick_name and the id of group. It doesn't cause much trouble to implement through the interface between modules.

The ◆ phase module is only associated with the user module through the user. The association between the two modules basically has the content associated through the user id. Simple and clear, clear interface

The ◆ event module may be associated with each module, but all of them only focus on the ID information of the objects in each module, which can be very easy split.

So. Our first step is to split the database vertically according to the tables related to the functional module. The form involved in each module is unique to a database, and the table association between the module and the module is handled through excuses on the application side. For example, as seen in the following figure:

After such a vertical split. A service that previously could only be provided through a database. It is split into four databases to provide services, and the service capacity is naturally added several times.

Advantages of vertical slicing

The split of ◆ database is simple and clear, and the split rules are clear.

The ◆ application module is clear and integrates easy.

◆ data maintenance is easy and easy positioning.

Disadvantages of vertical slicing

The ◆ partial table association cannot be completed at the database level. It needs to be finished in the process.

◆ still has calm performance for tables with extremely frequent queries and a large amount of data, which may not necessarily meet the requirements.

◆ transaction processing is relatively more complex

After ◆ shredding reaches a certain level, scalability will encounter limitations

◆ over-reading segmentation may make the system too complex and difficult to maintain.

For vertical segmentation may encounter data segmentation and transaction problems, it is very difficult to find a better processing scheme at the database level. In practical application cases, most of the vertical segmentation of the database corresponds to the module of the application system, and the data source of the same module is stored in the same database, which can solve the problem of data association within the module. Between modules, the required data is provided to each other through the application program in the way of service interface.

Although this does add to the overall number of operations of the database, it is intentional in terms of overall system scalability and architectural modularity. The single response time for some operations may be slightly added. However, the overall performance of the system is very likely to be improved to some extent. And expand the bottleneck problem. It can only be overcome by relying on the horizontal data segmentation architecture that will be described in the next section.

Horizontal segmentation of data

The above section analyzes the vertical segmentation of the data, and this section examines the horizontal segmentation of the data. The vertical segmentation of data can be simply understood as dividing the data according to the table according to the module, while the horizontal segmentation is no longer according to the table or functional modules. Generally speaking, simple horizontal segmentation is mainly to spread a very mediocre table into multiple tables according to some rules of a field. Each table contains a portion of the data.

To put it simply. We can understand the horizontal segmentation of data as the segmentation of data rows. Is to split some rows in the table into one database, while other rows are split into other databases. Of course, in order to be able to determine which database each row of data is sliced into compared to easy, sharding always has to be done according to certain rules.

For example, the range of a time type field based on a specific number of numeric type fields. Or the hash value of a character type field. It is assumed that most of the core tables in the entire system can be associated through a field. Then this field is naturally a top choice for horizontal partitioning. Of course, if it is very special and cannot be used, you can only choose something else.

Generally speaking, sites like Web2.0, which are very popular on the Internet today. Basically, most of the data can be related through the member user information, and perhaps a lot of core tables are very suitable for horizontal data segmentation through the member ID.

And like the forum community discussion system. Even more easy segmentation, very easy according to the forum number to carry out horizontal data segmentation.

After segmentation, there is basically no interaction between the various libraries.

Such as our demo sample system. All the data is associated with the user. Then we can split horizontally according to the users and split the data of different users into different databases. Of course, the only difference is that the groups table in the user module is not directly related to the user. So groups cannot split horizontally based on the user. For such a special case of the table, we can be completely independent. Put it in a separate database.

In fact, this approach can be said to take advantage of the "vertical segmentation of data" method described in the previous section. I will describe in more detail the joint segmentation method used at the same time for both vertical and horizontal segmentation in the next section.

So, for our demo sample database, most of the tables can be split horizontally based on the user's ID. The data related to different users are segmented and stored in different databases. For example, the ID of all users is fetched through 2 and then stored in two different databases.

Each table associated with the user ID can be shredded in this way. In this way, basically every user-related data. All in the same database, even if it needs to be associated, it can be very simple to relate.

We can show the relevant information of horizontal segmentation more intuitively through the following figure: the advantages of horizontal segmentation.

The association of ◆ tables can basically be completed on the database side.

◆ will not have the problem of bottlenecks in some very large data volumes and high-load tables.

There are relatively few changes to the overall architecture of the ◆ application side.

◆ transaction processing is relatively simple

◆ only needs to be defined by sharding rules. Basically, it is difficult to encounter scalability limitations.

Disadvantages of horizontal slicing

◆ segmentation rules are relatively more complex, so it is very difficult to abstract a segmentation rule that can satisfy the whole database.

The difficulty of data maintenance in the later stage of ◆ has been added, and it is more difficult to locate the data manually.

The coupling degree of each module of ◆ application system is high, which may cause some difficulties to the migration and separation of the later data.

Combined use of vertical and horizontal syncopation

In the above two sections. We learned about the implementation of "vertical" and "horizontal" segmentation and the architecture information after segmentation. At the same time, the advantages and disadvantages of the two architectures are also analyzed. However, in the actual application scenario, except that the load is not too heavy. Systems with relatively simple business logic can solve scalability problems through one of the above two segmentation methods. I am afraid that most of the other systems with slightly more complex business logic and larger system load can not achieve better scalability through any of the above data segmentation methods. However, it is necessary to combine the above two segmentation methods, and different scenes use different segmentation methods.

In this section. I will combine the advantages and disadvantages of vertical segmentation and horizontal segmentation to further improve our overall architecture and further improve the scalability of the system.

Generally speaking. It is very difficult for all the tables in our database to be associated with one (or a few) fields, so it is very difficult to solve all the problems simply through horizontal segmentation of the data. Vertical sharding can only solve part of the problem, for those systems with very high load, even a single table cannot bear its load through a single database host.

We must combine "vertical" and "horizontal" segmentation and use them at the same time to make full use of their strengths and avoid their disadvantages.

The load of each application system increases step by step, and when the performance bottleneck is encountered at the beginning, most architects and DBA will choose to split the data vertically first, because this cost is the first. It is most in line with the maximum input-output ratio pursued in this period. However. As the business continues to expand. With the continuous growth of system load, after a period of stability of the system, the database cluster after vertical split may once again be overwhelmed and encounter performance bottlenecks.

How should we choose at this time? Is it to further subdivide the module again, or to find other ways to solve it? Suppose we continue to subdivide the modules and split the data vertically as we did at the beginning, then we may encounter the same problems we face today in the near future. And with the continuous refinement of the module, the architecture of the application system will become more and more complex, and the whole system is very likely to get out of control.

At this time, we must solve the problems encountered here through the advantage of horizontal segmentation of the data. Moreover, when we use horizontal data segmentation, we do not need to push down the results of data vertical segmentation before, but on its basis to make use of the advantages of horizontal segmentation to avoid the disadvantages of vertical segmentation. To solve the problem of increasing complexity of the system.

The disadvantages of horizontal split (the rules are difficult to unify) have also been solved by the previous vertical split. So that the horizontal split can be carried out easily.

For our demo sample database. Suppose at the beginning. We carried out the vertical segmentation of the data, but with the continuous growth of the business, the database system encountered a bottleneck, we chose to reconstruct the architecture of the database cluster. How to reconstruct? Considering that the vertical segmentation of the data has been done before, and the structure of the module is clear.

And the momentum of business growth is getting stronger and stronger. Even if we split the module again now, it won't last long.

We choose to split horizontally on the basis of vertical splitting.

Each of the database clusters after a vertical split has only one functional module. Basically, all the tables in each functional module are associated with a field. For example, all the user modules can be segmented through the user ID, while the group discussion module can be segmented through the group ID. The phase module is divided according to the phase ID. The final event notification information table takes into account the time limit of the data (only asking for information about a recent event segment), then consider dividing it by time.

The following figure shows the entire architecture after sharding:

In fact, in many large-scale application systems, vertical and horizontal data segmentation methods basically coexist. And often in the continuous alternation, in order to constantly add the expansion ability of the system. When we deal with different application scenarios, we should also fully take into account the limitations and advantages of these two segmentation methods. Different combinations are used in different periods (load pressure).

Advantages of joint syncopation

◆ can make full use of the advantages of vertical segmentation and horizontal segmentation to avoid their own shortcomings.

◆ maximizes the scalability of the system.

Disadvantages of joint syncopation

The architecture of ◆ database system is more complex than that of traditional database. It is more difficult to maintain.

◆ application architecture is also relatively more complex.

Data segmentation and integration scheme

Through the previous chapter. We have made it very clear that data segmentation through the database can greatly improve the scalability of the system. However, after the data in the database is stored in different database hosts after vertical and / or horizontal segmentation, the biggest problem facing the application system is how to make these data sources better integrated. Perhaps this is also an issue of great concern to many readers. The main content of this section is to analyze the overall solutions that can be used to help us achieve data segmentation and data integration.

Data consolidation is very difficult to rely on the database itself to achieve this effect, although MySQL has a Federated storage engine, which can solve some similar problems. However, it is very difficult to make good use of it in practical application scenarios. So how do we integrate these data sources scattered across MySQL hosts?

In general, there are two ways to solve the problem:

1. Configure one (or more) data sources you need to manage yourself in each application module. Ask each database directly and complete the data integration in the module.

two。 All data sources are managed uniformly through the intermediate agent layer. Back-end database clusters are transparent to front-end applications

It is possible that more than 90% of people tend to choose the other when faced with the above two solutions, especially when the system is getting bigger and more complex.

Indeed. This is a very correct choice, although the cost may be relatively higher in the short term, but it is very helpful for the scalability of the whole system.

Therefore, I am not prepared to make too much analysis of the first solution here. Below, I will focus on some of the solutions in another solution.

★ develops intermediate agent layer by itself.

After deciding to solve the architectural direction of data source integration through the intermediate agent layer of the database, many companies (or enterprises) choose to develop their own proxy layer applications in line with their own application-specific scenarios.

Through the self-development of the intermediate agent layer, we can deal with the specific application of our own application to the greatest extent. Maximized customization has a lot of personalized needs, and it can also respond flexibly in the face of changes. This should be said to be the biggest advantage of self-development agent layer.

Of course, while choosing to develop on your own and enjoy the fun of maximizing personalized customization, it naturally requires a lot of other costs to carry out pre-research and development and continuous upgrading and improvement work in the later stage. And its own technical barriers may also be higher than simple Web applications. Therefore, it is better to conduct a comprehensive assessment than to make a comprehensive assessment before deciding to develop on your own.

As the self-development of many other times is to consider how to better adapt to their own application systems and deal with their own business scenarios, so it is not easy to analyze too much. Later, we mainly analyze several popular data source integration solutions.

★ uses MySQLProxy to realize data segmentation and integration

MySQLProxy is an official database proxy layer product provided by MySQL. Like MySQLServer, it is an open source product based on GPL open source protocol. It can be used to monitor, analyze or transmit communication between them. His flexibility allows you to maximize the use of it, currently with the main functions are connection routing, Query analysis, Query filtering and modification, load balancing. And the main HA mechanism and so on.

In fact, MySQLProxy itself does not have all of the above features. Instead, it provides the basis for implementing the above functions.

To achieve these functions, we also need to write our own LUA scripts to achieve.

MySQLProxy actually establishes a connection pool between the client request and the MySQLServer. All client requests are sent to MySQLProxy and then analyzed via MySQLProxy. Infer whether it is a read operation or a write operation and distribute it to the corresponding MySQLServer. For multi-node Slave clusters, it can also achieve the effect of load balancing. The following is the basic architecture diagram of MySQLProxy:

Through the architecture diagram above. We can clearly see the position of MySQLProxy in practical application, as well as the basic things we can do.

More specific implementation rules for MySQLProxy are introduced and demonstrated in very specific examples in the official MySQL documentation. Interested readers can download it for free directly from the official MySQL site or read it online, so I won't repeat the waste of paper here.

★ uses Amoeba to realize data segmentation and integration

Amoeba is an open source framework based on Java, focusing on solving distributed database data source integration Proxy programs, based on GPL3 open source protocol. At present, Amoeba already has Query routing, Query filtering, read-write separation, load balancing, HA mechanism and other related content.

The following problems are mainly solved by Amoeba:

1. Integration of complex data sources after data segmentation

two。 Provide data segmentation rules and reduce the impact of data segmentation rules on the database.

3. Reduce the number of connections between database and client.

4. Read-write separate routing

We can see that what Amoeba does is exactly what we need to improve the scalability of the database through data segmentation.

Amoeba is not an agent layer Proxy program, but a development framework for developing database agent layer Proxy programs. At present, there are two Proxy programs based on Amoeba, AmoebaForMySQL and AmoebaForAladin.

AmoebaForMySQL is mainly aimed at the solution of MySQL database. The protocol requested by the front-end application and the data source database connected to the back-end must be MySQL. For any application of client, AmoebaForMySQL is no different from a MySQL database. Any client request using the MySQL protocol can be parsed by AmoebaForMySQL and processed accordingly. For example, you can tell us the architecture of AmoebaForMySQL (from the Amoeba developer blog):

AmoebaForAladin is a more widely used one. A more powerful Proxy program.

He can connect to data sources of different databases at the same time to provide services for front-end applications, but only accept requests from client applications that conform to the MySQL protocol. In other words, only after the front-end application is connected through the MySQL protocol, AmoebaForAladin will actively analyze the Query statement and actively identify which physical host of the Query data source is located in what type of database according to the requested data in the Query statement. The following figure shows the architectural details of AmoebaForAladin (from the Amoeba developer blog):

At first glance, the two seem exactly the same. After a closer look, you will find that the main difference between the two is only after processing through MySQLProtocalAdapter. According to the analysis results, the data source database is inferred. Then select a specific JDBC driver and the corresponding protocol to connect to the back-end database.

In fact, you may have found the characteristics of Amoeba through the above two architecture diagrams, it is just a development framework. In addition to the two products he has already provided, ForMySQL and ForAladin, we choose. It can also carry out the corresponding secondary development based on its own needs. Get a Proxy program that is more suitable for our own application characteristics.

When it comes to using MySQL databases. Both AmoebaForMySQL and AmoebaForAladin can be used very well. Of course, considering that no matter what a system is more complex, there will be a certain loss of performance, and the maintenance cost will naturally be relatively higher. Therefore, I recommend using AmoebaForMySQL when you only need to use the MySQL database.

AmoebaForMySQL is very simple to use, all the configuration files are standard XML files, there are always four configuration files in common. They are:

◆ amoeba.xml: main configuration file, configuration of all data sources, and parameters settings of Amoeba itself.

◆ rule.xml: configure information for all Query routing rules.

◆ functionMap.xml: configure the Java implementation class that is used to parse the function in Query

◆ rullFunctionMap.xml: an implementation class that configures specific functions to be used in routing rules

Assuming that your rules are not too complex, you basically only need to use the first two of the above four configuration files to get all the work done. Proxy programs often use functions such as read-write separation. Load balancing and other configurations are carried out in amoeba.xml. Besides. Amoeba already supports its own active routing to achieve vertical and horizontal segmentation of data. Routing rules can be set in rule.xml.

At present, the main thing that Amoeba lacks is its online management function and support for transactions. In the past, I put forward relevant suggestions in the process of communicating with relevant developers, hoping to provide a command line management tool that can carry out online maintenance management, which is convenient for online maintenance and use. The feedback is that the management special management module has been included in the development schedule. In addition, in terms of transaction support, Amoeba cannot do it temporarily, even if the client application includes transaction information in the request submitted to Amoeba, Amoeba will ignore the transaction-related information. Of course, after getting better and better, I'm sure transaction support is the feature that Amoeba is focused on adding.

The more specific usage of Amoeba is available to readers through the Amoeba developer blog (http://amoeba.sf.net), which is available through the handheld, so I won't go into details here.

★ uses HiveDB to realize data segmentation and integration

Like the previous MySQLProxy and Amoeba, HiveDB is an open source framework based on Java's data sharding and integration for MySQL databases, but the current HiveDB only supports horizontal data sharding.

It mainly solves the scalability of the database and the high-performance query of the data under a large amount of data, and supports data redundancy and the main HA mechanism at the same time.

The implementation mechanism of HiveDB is different from that of MySQLProxy and Amoeba. It does not use the Replication function of MySQL to achieve data redundancy, but implements the data redundancy mechanism on its own, and its bottom layer is mainly based on HibernateShards to achieve data segmentation.

In HiveDB, data is distributed among multiple MySQLServer through various Partitionkeys defined by the user (in fact, making rules for data segmentation). When asked. When executing a Query request. It proactively analyzes the filter conditions, reads data from multiple MySQLServer in parallel, and merges the result sets back to the client application.

Purely in terms of function, HiveDB may not be as powerful as MySQLProxy and Amoeba, but its idea of data segmentation is not essentially different from that of the former two. In addition, HiveDB is not just content shared by open source enthusiasts, but open source projects supported by commercial companies.

The following is the picture of the above chapter on the official HiveDB site, which describes the basic information of how HiveDB organizes the data. Although it does not specifically show much architectural information, it can basically show its unique side in terms of data segmentation.

★ mycat data consolidation: specific http://www.songwie.com/articlelist/11

Other solutions to data segmentation and integration in ★

In addition to the overall solutions for data segmentation and integration described above, there are many other solutions that provide data segmentation and integration. Such as HSCALE based on MySQLProxy and SpockProxy built by Rails. And Pyshards based on Pathon and so on.

No matter which solution you choose, there should be no change in the overall design idea. That is to enhance the overall service ability of the database through the vertical and horizontal segmentation of the data, so that the overall expansion ability of the application system can be improved as much as possible. The way to expand is as convenient as possible.

We only need to overcome the problem of data segmentation and data source integration through the middle-tier Proxy application. Then the linear scalability of the database will be very easy as convenient as our application. Just by adding cheap PCServerserver, you can linearly add the overall service capacity of the database cluster, so that the database will no longer easily become the performance bottleneck of the application system.

Possible problems in data segmentation and integration

Here. We should have a certain understanding of the implementation of data segmentation and integration. Perhaps many readers have basically selected the scheme suitable for their own application scenarios according to the respective characteristics of various solutions, and the following work is mainly to prepare for implementation.

Before implementing the data segmentation scheme, we still need to do some analysis on some possible problems.

Generally speaking, the main problems we may encounter are as follows:

◆ introduces the problem of distributed transactions.

The problem of ◆ Cross-Node Join

◆ merges sorting paging problems across nodes.

1. The problem of introducing distributed transactions

Once the data is segmented and stored in multiple MySQLServer, no matter how perfect our sharding rules are designed (in fact, there are no perfect sharding rules), it is possible that the data involved in some previous transactions is no longer in the same MySQLServer.

In such a scenario, assume that our application still follows the old solution. Then the potential must be solved by introducing distributed transactions. Among the various version numbers of MySQL, only the version numbers starting from the beginning of MySQL5.0 provide support for distributed transactions, and currently only Innodb provides distributed transaction support. It's more than that. Even if we happen to use the MySQL version number that supports distributed transactions. At the same time, the Innodb storage engine is also used, the distributed transaction itself consumes a lot of system resources, and the performance itself is not too high. And the introduction of distributed transaction itself will bring more uncontrollable factors in exception handling.

What shall I do? In fact, we can solve this problem through a flexible way. The first thing to consider is: is the database the only place where transactions can be solved? In fact, this is not the case, we can completely combine the database and the application to solve the problem. Each database solves its own transactions. Then the transactions on multiple databases are controlled by the application.

In other words. As long as we want to. It is completely possible to split a distributed transaction across multiple databases into multiple small transactions on top of a single database. And through the application program to control each small transaction.

Of course, the requirement is that our Russian applications must be sufficiently robust. Of course, it will also bring some technical difficulties to the application.

two。 The problem of cross-node Join

The above describes the possible introduction of distributed transactions, and now let's look at the issues that require cross-node Join.

After the data is segmented. It may cause some old Join statements to no longer be used. Because the data sources used by Join may be split into multiple MySQLServer.

What shall I do? From the point of view of MySQL database, assuming that it has to be solved directly on the database side, it can only be solved through MySQL, a special storage engine, Federated. The Federated storage engine is MySQL's solution to problems such as DBLink, which is similar to Oracle.

The main difference between Federated and OracleDBLink is that Federated keeps a copy of the definition of the remote table structure locally. At first glance, Federated is indeed a very good solution to cross-node Join. However, we should also be clear that it seems to assume that the remote table structure has changed, and the local table definition information will not change accordingly. Assume that the local Federated table definition information is not updated when the remote table structure is updated. It is very likely to cause an error in Query execution and fail to get the correct results.

To deal with this kind of problem, I still recommend that you deal with it through the application, first taking out the corresponding driver result set in the MySQLServer where the driver table is located. Then according to the driven result set, the corresponding data is extracted from the MySQLServer where the driven table is located. Many readers may think that this will have a certain impact on performance, yes, it does have a negative impact on performance, but there are basically no other better solutions.

Moreover, after the database has been well expanded, the load of each MySQLServer can be better controlled. For a single Query, the response time may be a little higher than before shredding, so the negative impact on performance is not too great. Besides. Similar to the need for cross-node Join is not too much. It may only be a very small part of the overall performance. So for the sake of overall performance, occasionally sacrifice a little bit. Actually, it's worth it. After all, system optimization itself is a process of a lot of trade-offs and balance.

3. Cross-node merge sorting paging problem

Once the data is split horizontally, it may not only be the cross-node Join that cannot execute properly, but the data source of some sorted paged Query statements may also be split into multiple nodes. The direct consequence of this is that these sorted paging Query cannot continue to execute normally. In fact, this is the same as cross-node Join. The data source exists on multiple nodes, and to be solved through a Query is the same operation as a cross-node Join. The same Federated can also be partially solved. Of course, the risk is the same.

It's the same problem. What should I do? In the same way, I still continue to suggest that the solution be solved through the application.

How to solve it? The solution is generally similar to that of cross-node Join, but not quite the same as that of cross-node Join. Very often, Join has a relationship between being driven and being driven. Therefore, there is generally a sequential relationship between data reads between multiple tables involved in Join itself. However, sorting paging is not the same, the data source of sorting paging is basically a table (or a result set). There is no sequential relationship in itself, so the process of fetching data from multiple data sources can be completely parallel.

Like this. Sorting paging data fetch efficiency we can do more than cross-library Join. As a result, the performance loss is relatively small, and in some cases it may be more efficient than in a database that has not previously been segmented.

Of course, whether it is cross-node Join or cross-node sorting paging. Will make our application server consume a lot of other resources, especially memory resources, because we have to process a lot of other data than before in the process of reading queries and merging result sets.

At this point of analysis, many readers may find that all of the above problems are basically solved through applications. People may be muttering to themselves. Is it because I am DBA that a lot of things are left to application architects and developers?

In fact, this is not the case at all, first of all, because of the particularity of the application. Can be very easy to achieve very good scalability, but the database is not the same. You have to expand in a lot of other ways. And in this expansion process, it is very difficult to avoid some situations that can be solved in the centralized database but become a difficult problem after being split into a database cluster.

To maximize the overall expansion of the system, we can only get the application to do a lot of other things. To solve the problem that the database cluster can not solve well.

Through data segmentation technology, a large MySQLServer is divided into several small MySQLServer, which not only overcomes the write performance bottleneck, but also improves the scalability of the whole database cluster again at the same time. Either vertically or horizontally. Can make the system less likely to encounter bottlenecks. Especially when we use a combination of vertical and horizontal segmentation, there will be no expansion bottleneck in theory.

The above is all the contents of mysql database segmentation, thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.