Detailed explanation of full table scanning and fragmentation of relational database 04/26 Update SLTechnology News&Howtos

Detailed explanation of full table scanning and fragmentation of relational database

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Guide: data bus (DBus) focuses on real-time data collection and real-time distribution, and can aggregate the data generated by IT system in the business process. After conversion and processing, it becomes a unified JSON data format (UMS), which is provided to different data users for subscription and consumption, and serves as the data source for business such as data warehouse platform, big data analysis platform, real-time report and real-time marketing.

From the point of view of data slicing, this paper specifically introduces what kind of slicing strategy and principle DBus uses in the process of data acquisition, as well as the problems and solutions encountered in the process.

1. Slicing strategy

For traditional relational databases, DBus meets the needs of users' data collection by providing full data pull and incremental data acquisition. The DBus data extraction process is shown in the following figure (take mysql as an example):

The main principle of total data acquisition is to determine the shard column according to the primary key, unique index, index and other information. The reason why sharding columns should be selected according to the primary key, unique index, index, etc., is because the data of these columns are well indexed in the database, which can improve the efficiency of data scanning.

According to the selected slicing column, we disassemble the data, determine the upper and lower bounds of each piece, and then pull the data with a concurrency of about 6: 8 according to the upper and lower bounds of each piece. (the concurrency of about 6 to 8 is the empirical value obtained by a large number of tests. The experimental results show that the concurrency of about 6 to 8 will not cause excessive pressure on the source database, but also maximize the efficiency of full data extraction. )

Schematic diagram of DBus sharding strategy:

Schematic diagram of DBus pull strategy:

So, what type of columns does DBus support as sharding columns? What is the sharding strategy for different types of sharding columns?

As for sharding strategy, DBus draws lessons from Sqoop's sharding design, and supports the following columns as sharding columns:

BigDecimal/numericBooleanDate/time/timestampFloat/doubleInteger/smallint/longChar/Varchar/Text/NText

The principle of slicing is roughly the same, and the upper and lower boundaries of each slice are calculated and determined according to the maximum and minimum values of the slicing sequence and the set size of each slice. But the details of implementation vary greatly. Especially for the Text/NText type, we have found some problems in the process of reference and application, and we have made some adjustments and optimization.

The main purpose of this article is to share with you the holes encountered and our solutions.

II. Slicing principle 2.1 Digital type slicing column

Let's first introduce the principle of slicing with the simplest and clear digital type as an example.

As mentioned earlier, we will determine the sharding column according to the priority of the primary key-> unique index-> index. If the table has a primary key, we list the sharding column with the primary key; if there is no primary key and there is a unique index, we list the sharding column with a unique index. and so on. If the key or index found is a federated primary key or federated index, I take the first column as the shard column. If no suitable column is found as the sharding column, all the data will be pulled together without sharding (unable to enjoy the efficiency improvement brought by concurrent pulling).

First of all, a certain column should be selected as the shard column according to certain rules, and then the upper and lower bounds of each shard should be calculated and determined according to the maximum and minimum value of the shard column and the set size of each shard:

1) get the MIN () and MAX () of the split field

"SELECT MIN (" + qualifiedName + "), MAX (" + qualifiedName + ") FROM (" + query + ") AS" + alias

2) different segmentation methods are adopted according to different types of MIN and MAX.

Support for Date, Text, Float, Integer,Boolean, NText, BigDecimal and so on. Take the number as an example: the interval generated by the number of steps = (maximum-minimum) / mapper is [minimum, minimum + step) [minimum + step, minimum + 2 *). The condition generated by [maximum-step size, maximum] is similar to: splitcol > = min and splitcol

< min+splitsize 实现代码片段如下： 2.2 字符串类型分片列对于分片列类型为数字类型的情况，很好理解。如果分片列类型为char/varchar等字符串类型呢？每一片的上下界该如何计算？原理还是一样的：查出该列的最小、最大值，根据每片大小，计算每片分界点，生成每一片的上下界。技术细节上不一样的地方是：每片分界点/上下界的计算。分片列类型为int，min 为2 ，max为10， shard size为3，分片很好理解： Split[2,5)Split[5,8)Split[8,10] 如果分片列类型为varchar(128), min 为abc，max为 xyz，怎么计算拆片点呢？ Sqoop的分片机制是通过将"字符串"映射为"数字"，根据数字计算出分片上下界，然后将以数字表达的分片上下界映射回字符串，以此字符串作为分片的上/下界。如下所示：字符串映射为数值（a/65536 + b/65536^2 + c/65536^3）数值split 计算分割点，生成插值插值映射回会字符串然而，在实际应用中，上述分片机制碰到各种问题，下面将我们碰到和解决这一系列问题的经验分享如下。三、分片经验3.1 首先，根据上面的分片进行数据的拉取，有卡死情况。 1)现象无错误输出，但全量抽取进程输出一部分分片后卡死，无任何输出经过检查，发现30秒后， storm worker被莫名其妙重启了？ 2)分析 nimbus.task.timeout.secs的缺省时间为30秒，nimbus发现worker无响应，就重启动worker为什么worker无响应？字符串的插值是任意可能的，例如：splitcol >

= 'abc' and splitcol

< 'fxxx'xx' 3)解决办法使用binding变量方式，而不是拼接字符串方式Select * from T splitcol >

=? And splitcol

1) phenomenon

Display exception: [ERROR] Illegal mix of collations (utf8_general

-_ ci,IMPLICIT) and (utf8mb4_general_ci,COERCIBLE) for operation'

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.