Summary of data subtable 10/22 Update SLTechnology News&Howtos

Summary of data subtable

2025-10-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Summary of data subtable

Tags: sharding data sheet split

Background subdatabase, sequelae brought by subtable strategy, some matters needing attention background

Recently, the database table split project has been completed. Here is a brief summary.

The split mainly includes orders and coupons, both of which cover all the business lines of all molecular companies in the group. With the rapid development of the company's business, whether it is the requirements of storage, or the nature of writing and reading have basically reached a warning level.

Orders are the core of the transaction, and coupons are the core of marketing. These two pieces are basically the positive core of the whole platform. In order to support the rapid development in the next three to five years, we need to split the data.

There are many mature solutions in the database table splitting industry, which is no longer an advanced technology, but it is basically a pure engineering process, but it is still a rare opportunity to have the opportunity to actually operate, so it is very necessary to make a summary.

As the sub-library sub-table contains a variety of technical selection and methods, this article is not to list and summarize various methods, but to summarize some of our experience in the process of implementing sub-library sub-table.

According to the business scenario, we mainly do horizontal split and logical DB split. Considering the future database write bottleneck, we can migrate a group of sharding tables directly to the sub-database.

Sequelae caused by sub-database and sub-table

Sub-library and sub-table will bring a lot of sequelae, which will make the architecture of the whole system more complex. The key to whether the score is good or bad is how to find the sharding key. If the sharding key happens to be the dividing line in the business dimension, it will directly improve performance and complexity, otherwise there will be various scaffolding to support, and the system will become complex.

Such as user ID, order type, merchant ID, channel ID in the order system, batch ID, channel ID, organization ID in the coupon system, these are potential sharding key.

If there happens to be such a sharding key, it will be convenient to deal with routing (routing) later, otherwise you will need some large and comprehensive index tables to handle OLAP queries.

Once sharding, the first problem we have to face is the problem of sorting and paging when querying.

Merge and sort

It turns out that it is convenient to deal with sorting paging in a database table, and there will be multiple data sources after sharding. Here, multiple data sources are collectively referred to as shards.

In order to achieve multi-slice sorting and paging, we need to sort the data of each slice together, and we need to use the merge sorting algorithm. The data can be ordered (the output is ordered) in each fragment, but it is disordered as a whole.

Let's look at a simple example:

Shard node 1: {1 、 3 、 5 、 7 、 9} shard node 2: {2 、 4 、 6 、 8 、 10}

These are two shards for parity sharding. We assume that the paging parameters are set to 4 per page. The current page 1 parameters are as follows:

PageParameter:pageSize:4 、 currentPage:1

In the most optimistic case, we need to read the first two of the two sharding nodes:

Shard node 1: {1 、 3} shard node 2: {2 、 4}

After sorting, it happens to be {1, 2, 3, 4}, but this scenario is basically unlikely. Assume the following sharding node data:

Shard node 1: {7 、 9 、 11 、 13 、 15} shard node 2: {2 、 4 、 6 、 8 、 10 、 12 、 14}

It must be wrong to read the first two items of each node, because the most pessimistic and real situation is that all the data comes from one shard after sorting. So we need to read the pageSize size of each node to ensure the correctness of the data.

This example only assumes that the output data of our query conditions is exactly equal, and the real situation must be a set of data filtered out by various query conditions. At this time, this data must not be arranged in this way. The real thing is the latter structure.

And so on, what's wrong with our currentPage:1000? we need to read 4000 pieces of data per sharding node to sort, because in the most pessimistic case, it's possible that all the data comes from one sharding node.

If you turn the page indefinitely, the machine that handles sorting paging will definitely burst its memory, even if it does not burst, it will trigger a performance bottleneck.

This simple example is used to illustrate the practical problems caused by sorting pages after slicing, which also helps us to understand why there is a maximum paging limit when doing multi-node sorting pages in distributed systems.

Deep paging performance issues-re-paging by changing query conditions

A large data set splits data in a variety of ways, split into different data sources by organization, time, channel, and so on. The general deep paging problem can be solved smoothly by changing the query conditions, but this solution can not solve all business scenarios.

For example, we have an order list, the amount of data from C-end users to query their own order list will not be very large, but the operation background system may face all the order data of the whole platform, so the amount of data will be very large.

There are two ways to change the query conditions. one is the display setting to minimize the scope of the query, which is generally given priority, such as time range, payment status, delivery status, and so on. A small number of data sets can be filtered out through multiple superimposed conditions.

Then the second condition is set implicitly. For example, the order list is usually sorted by the time the order was created, so we can change this time when we turn the page to the restricted condition.

Sharding node 1:orderID createDateTime100000 2018-01-10 10 10 10:10:12sharding node 2:orderID createDateTime110000 10200 000 2018-01-10 10 10 10 11300000 2018-01-10 10 10 11300000 2018-01-10 10 10:10:12sharding node 2:orderID createDateTime110000 10500000 2018-01-20 10 10600000 2018-01-20 10 101700000 2018-01-20 2018-01-11: 10220000 2018-01-11: 10 purl 11320000 2018-01-11 10purl 12420000 2018-01-11 10purl 10purl 13520000 2018-01-21 10purl 10620000 2018-01-21 10purl 11720000 2018-01-21 10:10:12

Let's assume that the above is a list of orders, so don't worry about the order number of orderID. Because after sharding, all orderID will be issued uniformly by the dispatcher, and multiple consumers in multiple clusters will get it at the same time, but the speed of creating orders is different, so sequencing no longer exists.

The above two sharding node order numbers are basically crossed, and if you sort by time, node 1 and node 2 are going to get the data alternately.

For example, our query conditions and paging parameters:

Where createDateTime > '2018-01-11 00:00:00'pageParameter:pageSize:5, currentPage:1

The result set obtained is:

OrderID createDateTime100000 2018-01-10 10 Performing 10 10200000 2018-01-10 10 purl 11300000 2018-01-10 10 Performing 10 Visual12400000 2018-01-10 10 Vista 10purl 13110000 2018-01-11 10:10:10

The first four records are from node 1 and the last one is from node 2. The whole sorted set is:

Sharding node 1:orderID createDateTime100000 2018-01-10 10 10:10:10sharding node 2:orderID createDateTime110000 1015 0000 2018-01-10 10 10 11300000 2018-01-10 10 10 1400000 2018-01-10 10 10 13500000 2018-01-20 2018-01-11 10 10 10220000 2018-01-11 10 10 purl 10220000 2018-01-11 1010 purl 11320000 2018-01-11 1014 20000 2018-01-11: 10purl 13520000 2018-01-21 10:10:10

If you keep turning the page like this, you need to get 5 more pieces of data in node 1 and node 2 every time you turn the page. Here we can change the entire page turn into a re-query by modifying the query conditions.

Where createDateTime > '2018-01-11 10 10 purl 13'

Because we can be sure that all the data has been queried before the time of '2018-01-11, 10, 10, 10, 10, and 13', but why the time does not start from '2018-01-21, 10, 10, 10, or 10, because we have to consider concurrency, there will be multiple orders coming in within 1 second.

This method is the simplest to implement and does not need to be supported by external computing. One problem with this approach is that if you want to recalculate the paging without losing the data, you need to keep the original data so that you know where to start, so you will see it in the next paging. But from the real deep paging scenario can also be ignored, because few people will be page by page all the way to 500 pages, but directly skip to the last few pages, there is no problem at this time.

If you have to accurately control this deviation, you need to remember the interval, or do it in other ways, such as full lookup table, sharding index table, maximum order tps value, etc., to assist in the calculation.

Data synchronization middleware can be used to establish single-table multi-level index and multi-table multi-dimensional index to assist calculation. The data synchronization middleware we use are datax, yugong, otter and canal, which can solve the problem of full and incremental synchronization.

Sub-table strategy

There are many ways of sub-table, such as mod, rang, presharding, custom routing, each of which has its own emphasis.

We mainly use mod + presharding, one of the biggest problems brought by this method is the later node change data migration problem, which can be solved by referring to the virtual node of the consistent hash algorithm.

There are some differences between data table splitting and cache sharding. Cache can accept cache miss and maintain cache data through passive caching. But there is no such scenario as select miss in the database.

Consistent hash can be used to eliminate the pressure problem of adjacent shards when reducing or increasing sharding node in cache sharding scenarios. But once the database appears data migration must not accept the data query can not come out. So for the smooth migration of data in the future, we make a virtual node + real node mapping.

Physics node: node 1 node 2 node 3 node 4virtual node: node 1 node 2 node 3.node 20node mapping: virtual node 1 ~ node 5 {physics node 1} virtual node 6 ~ node 10 {physics node 2} virtual node 11 ~ node 15 {physics node 3} virtual node 16 ~ node 20 {physics node 4}

In order to reduce the cost and delay of rehash when migrating data in the future, the values after hash are saved in the table, which can be queried directly and imported quickly in the future.

The power problem of hash sheet 2

In the hashmap that we are familiar with, in order to reduce conflicts and provide certain performance, the size of the hash bucket is set to the n power of 2, and then the hash& (legnth-1) bit and method is used to calculate, this is mainly because the masters found that the binary of 2 to the n power is 1 except for the high bit is 0, and then the position plus 1 is the final value through bits and can be quickly reversed.

We do not need to refer to this principle when doing database sharding, this principle is mainly for the use of internal hash tables, external we originally want hash mod to determine sharding node.

There will be a problem of non-uniformity in the way of taking the module through mod, on this basis, we can make a custom parity route, so that the data on both sides can be evenly distributed.

Some points for attention

1. There are some minor problems with integrating sharding-JDBC in existing projects. Sharding-jdbc does not support batch insertion. If a large number of batch insert statements have been used in the project, it needs to be modified, or use the auxiliary hash to calculate the physical table name.

two。 The original project data layer uses Druid + MyBatis, and after integrating sharding-JDBC, sharding-JDBC wraps Druid, so some sql statements that sharding-JDBC does not support are basically impassable.

3. When I use springboot to integrate sharding-JDBC, I need to set up IncrementIdGenerator when bean loads, but there is a classloader problem.

IncrementIdGenerator incrementIdGenerator = this.getIncrementIdGenerator (dataSource); ShardingRule shardingRule = shardingRuleConfiguration.build (dataSourceMap); ((IdGenerator) shardingRule.getDefaultKeyGenerator ()) .setIncrementIdGenerator (incrementIdGenerator); private IncrementIdGenerator getIncrementIdGenerator (DataSource druidDataSource) {.}

It was later discovered that springboot's classloader used restartclassloader, so the conversion failed all the time. As long as spring-boot-devtools package is removed, restartclassloader is for hot start.

4.dao.xml reverse engineering problem, many of the database table mybatis generation tools we use are generated with physical table names. Once we use sharding-JDCB, we all use logical table names, so the generation tool needs to provide options to set logical table names.

5. The SqlSessionFactory provided for mybatis needs to be packaged in shading-JDCB on the basis of Druid.

Snowflake algorithm is used by default in 6.sharding-JDBC DefaultkeyGenerator, but we can't use it directly. We need to set workerId segment according to datacenterid-workerid and zookeeper.

(snowflake workId 10 bit Decimal 1023 data bit ID 5 bit Decimal 31, WorkId 5 CenterId Decimal 31)

7. Because we are using mysql com.mysql.jdbc.ReplicationDriver 's native implementation of read-write separation, it is much more convenient to deal with read-write separation. If you are not using this, you need to manually set up the Datasource Hint to handle it.

8. Multiple logical tables are required when using mybatis dao mapper, because some data source tables do not need to go to sharding, custom shardingStragety to handle branching logic.

9 several methods of global id

9.1 if you use zookeeper for distributed ID, be aware that session expired may have the problem of repeating workid, locking or accepting a certain degree of parallelism (ordered column numbers guarantee a period of time and space).

9.2. Centralized number generator service is adopted, and pre-generated table + incrment plug-in is used in the main DB (the classic number picker is implemented, and the TRX_SYS_TRX_ID_STORE transaction number in the innodb storage engine is also in this way)

9.3. Fixed-length dispatcher and business rule dispatcher, which requires a business context, needs to be pre-configured, and then each request comes with an acquisition context to describe the type of service to be acquired.

10. The self-incrementing id sort is used in some parts of the project, and the data table needs to be modified after it is split, because the ID size order no longer exists. According to the latest sorting of the data when using id sorting needs to be modified to use time field sorting.

Author: Wang Qingpei (Senior JAVA architect, Hujiang Group)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.