Two ideas for realizing horizontal Segmentation of Database 04/24 Update SLTechnology News&Howtos

Two ideas for realizing horizontal Segmentation of Database

2025-04-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Introduction

With the wide popularity of Internet applications, the storage and access of massive data has become a bottleneck in system design. For a large Internet application, billions of PV every day undoubtedly cause a high load on the database. It has caused great problems for the stability and expansibility of the system. Through data segmentation to improve the performance of the website, scale-out data layer has become the preferred way for architecture researchers.

Horizontal split database: can reduce the load of a single machine while minimizing the loss caused by downtime; load balancing strategy: can reduce the access load of a single machine and reduce the possibility of downtime; cluster scheme: solves the problem that a single point of database cannot be accessed due to database downtime; read-write separation strategy: maximizes the speed and concurrency of reading data in the application

Why do we need data segmentation?

The above gives a summary description and explanation of what data segmentation is, and readers may wonder why data segmentation is needed. Is a mature and stable database like Oracle enough to support the storage and query of massive data? Why do you need data slicing?

Indeed, Oracle's DB is mature and stable, but the high cost of use and high-end hardware support are not affordable for every company. Just imagine the cost of using tens of millions of dollars a year and the tens of millions of yuan of minicomputers as hardware support. can ordinary companies afford this? Even if we can afford it, if there is a better solution, a cheaper one with better scale-out performance, why not choose it?

We know that every machine, no matter how well configured, has its own physical upper limit, so when our application has reached or far exceeded a certain upper limit of a single machine, we can only seek the help of other machines or continue to upgrade our hardware, but the common solution is to scale out, sharing the pressure by adding more machines. We also have to consider whether our machines can meet the demand through linear growth as our business logic continues to grow. Sharding can easily distribute computing, storage, and Icano to multiple machines in parallel, which can make full use of the processing power of multiple machines, avoid a single point of failure, provide system availability, and provide good error isolation.

Considering the above factors, data segmentation is very necessary. We use free MySQL and cheap Server or even PC to do clusters to achieve the effect of minicomputer + large commercial DB, reduce a lot of capital investment, reduce operating costs, why not?

In large and medium-sized projects, when designing the database, considering the maximum amount of data the database can bear, the database or data table is usually divided horizontally to reduce the pressure on a single database and a single table. This paper introduces the data table segmentation methods commonly used in two projects. Of course, these methods are all in the program? Use certain techniques to route to specific tables. First of all, we have to confirm the basis for horizontal segmentation. In our system (SNS), the user's UID runs through the system and is only self-growing. According to this field table, it couldn't be better.

Method 1: use MD5 hash

The way to do this is to encrypt the UID with md5, then take the first few bits (we take the first two here), and then you can hash different UID into different user tables (user_xx).

Function getTable ($uid) {$ext = substr (md5 ($uid), 0,2); return "user_". $ext;}

Through this technique, we can distribute different UID into 256user tables, namely user_00,user_01. User_ff . Because UID is numeric and incremented, according to md5's algorithm, user data can be almost evenly divided into different user tables.

But the problem here is that if there are more and more users in our system, there will be more and more data in a single table, and the table cannot be extended according to this algorithm, which will return to the problem at the beginning of the article.

Method 2: use shift

The specific methods are:

Public function getTable ($uid) {return "user_". Sprintf ("d", ($uid > > 20);}

Here, we move uid 20 bits to the right so that we can put about 1 million of the user data in the first table user_0000 and the second 1 million user data in the second table user_0001, so that if we have more and more users, we can just add the user table. Since the table suffix we keep is four digits, here we can add 10, 000 user tables, namely user_0000,user_0001. User_9999 . Ten thousand tables with 1 million data in each table, we can save 10 billion user records. Of course, if you have more user data than this, it doesn't matter, all you have to do is change the reserved table suffix to increase the number of expandable tables, for example, if you have 100 billion pieces of data, each table stores 1 million, then you need 100000 tables. We just need to keep the table suffix as 6 digits.

The above algorithm can also be written flexibly:

/ * * according to the UID subtable algorithm * @ param int $uid / / user ID * @ param int $bit / / the table suffix retains a few digits * @ param int $seed / / the number of bits moved to the right * / function getTable ($uid, $bit, $seed) {return "user_". Sprintf ("0 {$bit} d", ($uid > > $seed);}

Summary

Both of the above methods need to estimate the maximum amount of user data in our current system and the maximum capacity of a single table in the database.

For example, in the second scenario, if we estimate that the number of users of our system is 10 billion and the optimal amount of data for a single table is 1 million, then we need to move the UID 20 to ensure that each table is 1 million of the data, leaving four bits of the user table (user_xxxx) to expand 10, 000 tables.

Also like the first scheme, each table 1 million, md5 after the first two, there can only be 256tables, the total database of the system is: 256 * 1 million; if the total amount of data in your system is more than this, then you must MD5 to take the first three or four or even more.

Both methods split the data horizontally into different tables, and the second method is more scalable than the first method.

Summary

The above is the whole content of this article. I hope the content of this article has a certain reference and learning value for everyone's study or work. Thank you for your support. If you want to know more about it, please see the relevant links below.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.