Under what circumstances should we consider sub-database and sub-table? 04/21 Update SLTechnology News&Howtos

Under what circumstances should we consider sub-database and sub-table?

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "when to consider sub-database sub-table", the content of the explanation is simple and clear, easy to learn and understand, now please follow the editor's train of thought slowly in-depth, together to study and learn "when to consider sub-database sub-table" bar!

Database bottleneck

Whether it is IO bottleneck or CPU bottleneck, it will eventually lead to an increase in the number of active connections to the database, and then approach or even reach the threshold of the number of active connections that the database can carry. From the business service point of view

That is, there are few or no available database connections, and then you can imagine (concurrency, throughput, crash).

IO bottleneck

First: disk read IO bottleneck, too much hot data, database cache, each query will generate a large number of IO, reduce the query speed-> sub-database and vertical sub-table

Second: network IO bottleneck, too much data requested, insufficient network bandwidth-> sub-library

CPU bottleneck

The first kind: SQl problem: for example, SQL includes join,group by, order by, non-index field conditional query, etc., increase the operation of CPU operation-> SQL optimization, establish an appropriate index, and carry out business calculation in the business Service layer.

Second: the amount of data in a single table is too large, too many rows are scanned when querying, the efficiency of SQl is low, and the operation of CPU operation is increased. -> horizontal table.

Horizontal sub-library

1. Concept: split the data in one database into multiple databases on the basis of fields and according to certain strategies (hash, range, etc.).

2. Results:

The structure of each library is the same

The data in each library is different and there is no intersection.

The data union of all libraries is full data.

3. Scenario: when the absolute concurrency of the system has come up, the sub-table is difficult to solve the problem fundamentally, and there is no obvious business attribution to divide the database vertically.

4. Analysis: with more libraries, the pressure of io and cpu can be alleviated exponentially.

Horizontal subtable

1. Concept: based on fields and according to certain strategies (hash, range, etc.), the data in one table is split into multiple tables.

2. Results:

The structure of each table is the same

The data of each table is different, there is no intersection, and the union of all tables is full data.

3. Scenario: the absolute concurrency of the system does not come up, but there is too much data in a single table, which affects the efficiency of SQL and increases the burden of CPU, so that it becomes a bottleneck, so horizontal sub-table can be considered.

4. Analysis: the amount of data in a single table is less, and the execution efficiency of a single execution of SQL is higher, which naturally reduces the burden of CPU.

Vertical sub-library

1. Concept: based on tables, different tables are divided into different libraries according to different business attributions.

2. Results:

The structure of each library is different.

The data in each library is also different, and there is no intersection.

The union of all libraries is full data.

3. Scenario: when the absolute concurrency of the system is up, and individual business modules can be abstracted.

4. Analysis: at this point, we can basically become service-oriented. For example, with the development of business, there are more and more common configuration tables, dictionary tables, and so on. At this time, these tables can be split into separate libraries, or even serviced. Furthermore, with the development of the business, a set of business model has been hatched, at this time, the relevant tables can be split into a separate library, or even serviced.

Vertical subtable

1. Concept: based on the field, the fields in the table are split into different tables (main table and extended table) according to the activity of the field.

2. Results:

The structure of each table is different.

The data of each table is also different. Generally speaking, the fields of each table have at least one column intersection, usually the primary key, which is used to associate the data.

The union of all tables is full data.

3. Scenario: the absolute concurrency of the system does not come up, and there are not many records in the table, but there are many fields, and hot data and non-hot data together, the storage space for a single row of data is large, so that the data rows cached by the database are reduced. When querying, going back to read disk data produces a large number of random read IO, resulting in IO bottleneck.

4. Analysis: list pages and details pages can be used to help understand. The splitting principle of vertical split table is to put hot data (data that may be queried frequently) together as main table and non-hot data together as extended table, so that more hot data can be cached, thus reducing random read IO. After dismantling, if you want to get all the data, you need to associate two tables to get the data.

But remember not to use join, because Join not only puts a burden on CPU but also couples two tables together (it must be on a database instance). The associated data should be carried out in the service layer, obtaining the data of the main table and the extended table respectively, and then associating all the data with the associated fields.

Sub-library and table tool

Sharding-jdbc (Dangdang)

TSharding (Mogujie)

Atlas (Qihoo 360)

Cobar (Alibaba)

MyCAT (based on Cobar)

Oceanus (58.com)

Self-examination of the advantages and disadvantages of Vitess (Google) various tools

Problems caused by sub-database and sub-table

Sub-library and sub-table can effectively alleviate the performance bottleneck and pressure caused by stand-alone and single table, break through the bottleneck of network IO, hardware resources and the number of connections, but also bring some problems. These problems and solutions are described below.

Transaction consistency problem

Distributed transaction

When the updated content exists in different libraries at the same time, it will inevitably lead to the problem of cross-database transactions. Cross-fragment transactions are also distributed transactions, there is no simple solution, generally can use the "XA protocol" and "two-phase commit" processing.

Distributed transactions can maximize the atomicity of database operations. However, when committing a transaction, multiple nodes need to be coordinated, which postpones the time point of committing the transaction, prolongs the execution time of the transaction, and increases the probability of conflict or deadlock when the transaction accesses the shared resource. With the increase of database nodes, this trend will become more and more serious, which will become the shackle of the horizontal expansion of the system at the database level.

Final consistency

For those systems with high performance requirements but low consistency requirements, the real-time consistency of the system is often not demanding, as long as the final consistency is achieved within the allowable period of time, transaction compensation can be used. Different from the immediate rollback of errors in the execution of transactions, transaction compensation is a remedial measure for post-inspection. Some common implementation methods are: reconciliation of data, comparison based on logs, regular synchronization with standard data sources, and so on.

Cross-node association query join problem

Before segmentation, the data of many lists and details tables in the system can be completed through join, but after segmentation, the data may be distributed on different nodes, so the problem caused by join is more troublesome. Considering the performance, try to avoid using Join query. Some of the solutions:

Global table

Global tables, also known as "data dictionary tables", are tables that all modules in the system may rely on. In order to avoid library join queries, you can save a copy of these tables in each database. These data are usually rarely modified, so there is no need to worry about consistency.

Field redundancy

A typical anti-paradigm design that uses space for time and avoids join queries for performance. For example, when the order table saves the userId, it also saves a redundant copy of the userName, so that the user name userName can be found in the query order details table, instead of querying the buyer user table. However, this method is also suitable for limited scenarios, and it is more suitable for situations where there are few dependent fields, and it is difficult to ensure the consistency of redundant fields.

Data assembly

In the system service business level, it is divided into two queries, the first query result set to find the associated data id, and then according to the id initiator second request to get the associated data, and finally the results will be field assembly. This is a more common method.

ER fragmentation

In the relational database, if the relationship between tables (such as order table and order detail table) has been determined, and the table records that have the association relationship are stored on the same fragment, then the problem of cross-fragment join can be avoided and join can be carried out in one fragment. In the case of 1:1 or 1 n, the primary key is usually segmented according to the ID of the primary table.

Cross-node paging, sorting, function problems

When querying across nodes and multiple databases, there will be some problems, such as limit paging, order by sorting and so on. Paging needs to be sorted according to the specified field. When the sorting field is a paging field, it is easier to locate the specified shard through the slicing rule; when the sorting field is not a shard field, it becomes more complex.

The data needs to be sorted and returned in different shard nodes, then the result sets returned by different shards are summarized and sorted again, and finally returned to the user as shown below:

The above figure only takes the data from the first page, which does not have a great impact on performance. However, if you get a large number of pages, the situation becomes much more complicated, because the data in each shard node may be random, for the accuracy of sorting, you need to sort the first N pages of all nodes for merging, and then sort them as a whole, which consumes CPU and memory resources, so the larger the number of pages, the worse the system performance will be.

When using functions such as Max, Min, Sum, Count, and so on, you also need to execute the corresponding function on each shard, and then summarize the result set of each shard and calculate it again.

The problem of global primary key avoiding weight

In the environment of sub-database and sub-table, because the data in the table exists in different databases at the same time, the self-growth of primary key values will be useless, and the self-generated ID of a partition database can not be guaranteed to be globally unique. Therefore, it is necessary to design the global primary key separately to avoid the problem of cross-library primary key repetition. Here are some strategies:

UUID

The standard form of UUID is 32 hexadecimal digits divided into 5 paragraphs and 36 characters in the form of 8-4-4-4-12.

UUID is the simplest solution with local generation, high performance and no network time-consuming, but it has obvious disadvantages and takes up a lot of storage space. in addition, there are performance problems in building indexes as primary keys and querying based on indexes, especially under InnoDb engine, the disorder of UUID will lead to frequent changes in index position and lead to paging.

Maintain the primary key ID table with the database

Create the sequence table in the database:

CREATE TABLE `sequence` (`id` bigint (20) unsigned NOT NULL auto_increment, `room`char (1) NOT NULL default'', PRIMARY KEY (`id`), UNIQUE KEY `sequence` (`sequenc`) ENGINE=MyISAM

The stub field is set to a unique index, and there is only one record for the same value in the sequence table, which can be a global ID for the birthdays of multiple tables at the same time. Higher performance has been achieved by using the MyISAM engine instead of InnoDb. MyISAM uses table locks, and reads and writes to the table are serial, so you don't have to worry about reading the same ID twice in concurrency. When a globally unique ID is required, execute:

REPLACE INTO sequence (stub) VALUES ('a'); SELECT 1561439

This scheme is relatively simple, but the disadvantage is obvious: there is a single point problem, strong dependence on DB, when DB is abnormal, the whole system is not available. Configuring master and slave can increase availability. In addition, the performance bottleneck is limited to the read and write performance of a single Mysql.

There is another primary key generation strategy, similar to the sequence table scheme, which better solves the problems of single point and performance bottleneck. The overall idea of this scheme is to establish more than two global ID-generated servers, each server deploys only one database, and each library has a sequence table to record the current global ID.

The increasing step size in the table is the number of libraries, and the starting values are staggered in turn, so that the generation of ID can be hashed to each database.

In this scheme, the pressure of generating ID is evenly distributed between the two machines, and the system fault tolerance is provided at the same time. If there is an error in the first one, it can be automatically switched to the second machine to obtain ID. But there are several disadvantages: the system adds machines, the horizontal expansion is more complex; the pressure of reading the DB,DB every time you get ID is still very high, and the performance can only be improved through the heap machine.

Snowflake distributed self-increasing ID algorithm

Twitter's snowfalke algorithm solves the need for distributed systems to generate global ID, generating 64-bit long numbers, which are composed of:

The first one is not used

The next 41 bits are millisecond time, and the length of 41 bits can represent 69 years.

5-bit datacenterId,5 bit workerId. 10-bit length supports deployment of up to 1024 nodes

The last 12 bits are counted in milliseconds, and the 12-bit counting sequence number supports each node to produce 4096 ID sequences per millisecond.

Problems of data migration and expansion

When the business is developing at a high speed and facing performance and storage bottlenecks, sharding design will be considered. at this time, it is inevitable to consider the migration of historical data. The general practice is to read out the historical data first, and then write the data to each shard node according to the specified sharding rules. In addition, capacity planning needs to be carried out according to the current data volume and the speed of business development, and the approximate number of shards needs to be calculated (it is generally recommended that the amount of data in a single table of a single shard should not exceed 1000W).

When to consider sub-database and sub-table

If you can't, you can't.

Not all tables need to be segmented, mainly depending on the growth rate of the data. Segmentation increases the complexity of the business to some extent. Do not easily use the "big trick" of sub-database and sub-table as a last resort to avoid "over-design" and "premature optimization". Before sub-database and sub-table, try your best to optimize as much as you can: upgrade hardware, upgrade network, read-write separation, index optimization and so on. When the amount of data reaches the bottleneck of a single table, sub-database and sub-table are considered.

The amount of data is too large, and normal operation and maintenance affect business access.

The operation and maintenance here refers to: for database backup, if the single table is too large, a large number of disk IO and network IO are required for backup. DDL,MYSQL a large table will lock the entire table, which will take a long time, during which the business cannot access the table, which has a great impact.

When large tables are accessed and updated frequently, lock waits are more likely to occur.

As the business develops, some fields need to be split vertically

There are no examples here. In the actual business, you may encounter that some fields that are accessed infrequently or updated less frequently should be separated from the large table.

The amount of data is growing rapidly

With the rapid development of business, the amount of data in a single table will continue to grow. When the performance is close to the bottleneck, we need to consider horizontal segmentation and do sub-database sub-table.

Thank you for your reading. The above is the content of "when to consider sub-database and sub-table". After the study of this article, I believe you have a deeper understanding of when to consider sub-database and sub-table. the specific use of the situation also needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.