How does MySQL use sharding to solve the 50 billion data storage problem? 07/02 Update SLTechnology News&Howtos

How does MySQL use sharding to solve the 50 billion data storage problem?

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This is a technical study of how we split data on multiple MySQL servers. We completed this sharding method in early 2012, and it is still the system we use to store core data today.

Before we discuss how to split the data, let's take a look at our data. Mood lighting, chocolate strawberries, Star Trek quotes...

Pinteres is the discovery engine for everything you are interested in. From a data perspective, Pinterest is the largest atlas of human interest in the world. More than 50 billion Pin are saved by Pin friends on 1 billion drawing boards. Users Pin again, like other people's Pin (roughly a simple copy), follow other Pin friends, artboards and interests, and then check all the information about Pin friends subscribed to on the home page. Great! Now let it scale up!

The pain of growth

We were successful in 2011. In some evaluation reports, we are growing much faster than other startups. In September 2011, each of our infrastructure was overloaded. We have applied some NoSQL technologies, all of which have led to disastrous consequences. At the same time, a large amount of MySQL for reading generates a lot of annoying bugs from the server, especially caching. We reconstructed the entire data storage pattern. In order to make it effective, we have carefully formulated our requirements.

Business requirement

All of our systems need to be very stable, easy to operate and easy to expand. We hope to support the database to start with a small amount of storage and expand with the development of the business.

All content generated by Pin friends must be accessible at any time on the site.

Support to request access to N Pin in a certain order to be displayed in the artboard (such as according to the time of creation, or in a user-specific order). The Pin list of your favorite Pin friends and Pin friends can also be displayed in a specific order.

For the sake of simplicity, updates generally ensure the best results. To achieve ultimate consistency, you need something extra, such as distributed transaction logs. This is an interesting and (not) simple thing.

Solution ideas and key points remarks

Because the solution needs to distribute a large number of data slices to multiple database instances, the whole data can not be integrated by means of relational database connections, foreign keys or indexes. If you think about it, the associated subqueries cannot span different database instances.

Our solution requires load-balanced data access. We hate data migration, especially record-by-record migration, which is very error-prone and adds unnecessary complexity to the system because of the complexity of the relationship. If data must be migrated, it is best to migrate the logical node set as a whole.

In order to achieve the reliable and rapid implementation of the scheme, we need to use the easiest and most robust technical solution on our distributed data platform.

All data on each instance will be completely copied to a slave instance as a data backup. We are using high availability MapReduce (distributed computing environment) S3. Our front-end business logic accesses the background data and only accesses the primary instance of the database. Never let your front-end business read and write access from the instance. Because there is a delay in synchronizing with the master instance data, it will cause inexplicable errors. Once the data is sliced and distributed, there is no reason for your front-end business to read and write data from the instance.

Finally, we need to carefully design an excellent scheme to generate and parse the globally unique identity (UUID) of all our data objects.

Our slicing scheme

In any case, we need to design robust, high-performance and maintainable data distribution solutions that meet our needs. In other words, it cannot be immature (not widely tested). Therefore, our basic design is based on MySQL, see we chose a mature technology (choose mature technology). At the beginning of design, we naturally skip database products that claim to have new auto-scaling capabilities, such as MongoDB,Cassandra and Membase, because they seem to be simple to implement but poorly applicable (often inexplicable errors lead to crashes).

Narrator: it is strongly recommended to start from the bottom and avoid fashionable and fresh things-learn and use MySQL well in a down-to-earth manner. Believe me, every word is tears.

MySQL is a mature, stable and easy to use relational database product. Not only we use it, but also many well-known large companies also use it as background data support, storing huge amounts of data. (translation note: about a few years ago, because MySQL was acquired by Oracle with SUN, it was attributed to Oracle. Many companies, such as google,facebook, due to concerns about the open source of MySQL, have transferred to another open source database developed by the original author of MySQL, MariaDB) MySQL supports our technical requirements for sequential data requests to the database, queries for specified range data and transaction processing at the row (record) level. MySQL has a lot of features, but we don't need those. Because MySQL itself is a monolithic solution, we have to slice our data. Note: this means that a single instance managing huge amounts of data is bound to cause performance problems. Now slicing a huge amount of overall data into individual data sets requires a powerful technical solution to integrate individual units into a whole and improve performance without making mistakes.) here is our design:

We started with eight EC2 servers, each running an instance of MySQL:

Each MySQL server uses master-master backup (master-master replicated) to 1 redundant host for disaster recovery. Our front desk business only reads / writes data from the main service instance. I suggest you do the same. It simplifies a lot of things and avoids delays. Master-master backup (master-master replicated) is a function provided by the MySQL database itself, which refers to a mode in which two machines back up each other. Compared with other modes, such as master-slave backup, the data of the two machines are exactly the same, and the background is synchronized. Each machine has its own IP and can be accessed simultaneously by read / write. However, the original author has repeatedly stressed that although the two are redundant to each other using master-master backup, both can be accessed. But you logically distinguish between master and slave, and always read / write from only one of them. For example, as shown in the figure, the master-master backup between MySQL001A and MySQL001B, but you only have read / write access from MySQL001A. Another: they use 16 machines, and the other 8 slaves may or may not be EC2)

Each MySQL instance can have multiple databases:

Notice how each database is uniquely named db00000,db00001 until dbNNNN. Each database is a fragment of our database. We made a design so that once a piece of data is assigned to a shard, it will not move out of that shard. However, you can get more capacity by moving shards to other machines (we'll discuss this later).

We maintain a configuration database table that records the machine on which the slice database is located:

[{"range": (0511), "master": "MySQL001A", "slave": "MySQL001B"}, {"range": (512, 1023), "master": "MySQL002A", "slave": "MySQL002B"},. {"range": (3584, 4095), "master": "MySQL008A", "slave": "MySQL008B"}]

This configuration table is modified only when the slice database is migrated or the host is replaced. For example, if a master instance host goes down, we will promote its slave instance host, and then replace a new machine as the slave instance host as soon as possible. The configuration script is kept on the ZooKeeper, and when the above changes occur, the configuration changes are sent through the script to the machine that maintains the slicing service. (translation note: it can be found that the original author has always emphasized the advantage of the front-end business reading and writing data only from the logical primary instance.

Each slice database maintains the same database table and table structure, such as pins, boards, users_has_pins, users_likes_pins, pin_liked_by_user and other database tables. Build synchronously at deployment time.

Design scheme of distributing data to slice server

We combine slice ID (shard ID), data type identification, and local ID (local ID) to form a 64-bit globally unique identity (ID). Slice ID (shard ID) occupies 16 bits (bit), data type identification occupies 10 bits (bit), and local ID (local ID) occupies 36 bits (bit). The discerning eye will soon find out that this is only 62. My past experience in distributing and integrating data tells me that keeping a few bits for expansion is priceless. Therefore, I keep 2 bits (set to 0). (note: to explain here, according to the following operations and instructions, the unique identification ID of any object is 64-bit, the highest 2-bit is always 0, followed by a 36-bit local ID, followed by a 10-bit type ID, and finally a 16-bit slice ID. Local identifiers can represent 2 ^ 36 up to more than 60 billion ID. Data types can represent 2 ^ 10 up to 1024 object types, and slice identifiers can be subdivided into 2 ^ 16 up to 65536 slice databases. The aforementioned scheme cuts 4096 slice databases)

ID = (shard ID 36) & 0x3FF = 1Local ID = (241294492511762325 > > 0) & 0xFFFFFFFFF = 7075733

You can see that this Pin object is in the 3429 slice database. Suppose the Pin object data type is identified as 1, and its record is in the 7075733 record row in the pin data table in the 3429 slice database. For example, suppose the slice 3429 database is in MySQL012A, and we can use the following statement to get its data record: (translation note: here the original author gives a general example, if according to the previous example, 3429 should be on MySQL007A)

Conn = MySQLdb.connect (host= "MySQL012A") conn.execute ("SELECT data FROM db03429.pins where local_id=7075733")

There are two types of data: objects or relationships. The object contains details of the object itself. Such as Pin.

Database table in which objects are stored

Each record in the object library table represents an object in our front-end business, such as Pins (nail note), users (user), boards (whiteboard) and comments (comment). Each such record designs an identification ID field in the database table (this field acts as the self-increasing primary key "auto-incrementing primary key" of the record, that is, the local ID "local ID" we mentioned earlier). And a blob data field, which uses JSON to hold the specific data of the object.

CREATE TABLE pins (local_id INT PRIMARY KEY AUTO_INCREMENT, data TEXT, ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP) ENGINE=InnoDB

For example, a Pin object is shaped like this:

{"details": "New Star Wars character", "link": "http://webpage.com/asdf"," user_id ": 241294629943640797," board_id ": 241294561224164665, … }

Create a Pin object and collect all the data to form JSON blob data. Then, determine its slice ID "shard ID" (we prefer to put the slice data of the Pin object in the same slice database as the whiteboard "board" object, which is not a mandatory design rule). The data type of the Pin object is identified as 1. Connect to the slice database indicated by the slice ID, insert (insert) the JOSON data of the Pin object into the Pin object database table, and the successful MySQL operation will return the self-incrementing primary key "auto-incrementing primary key" to you, which is the local ID "local ID" of this Pin object. Now that we have the necessary information such as shard, type value, and local ID, we can build a 64-bit ID for this Pin object. (translation note: as mentioned by the original author, every object used by their front-end business is stored in an object database table, and each object record is found through a globally unique ID, but the globally unique ID is not the local ID in the database table, because of slicing. The original author has been talking about this design and its principle. The purpose of this design is to improve the performance of massive data slicing, but also easy to use, maintainable and scalable. Later, the author will explain in turn)

Edit a Pin object and use the MySQL transaction "transaction" to read out-modify-write back to the JOSON data field of the "read-modify-write" Pin object on the data record of the Pin object:

> BEGIN > SELECT blob FROM db03429.pins WHERE local_id=7075733 FOR UPDATE [modify json blob] > UPDATE db03429.pins SET blob='' WHERE local_id=7075733 > COMMIT

Edit a Pin object, and of course you can directly delete the data record of that object in the MySQL database table. However, think carefully about whether it might be better to add a field called "active" to the object's JSON data and leave the culling to the front-end intermediate business logic.

Those who have studied relational database should know that the self-increasing primary key is solid in the record table, and deleting the record in it will cause holes. When too much, it is bound to cause the decline of database performance. The database is only responsible for saving data and high-performance query, reading and writing data, and the relationship between the data is maintained entirely by well-designed object global ID through middleware logic. This design concept has been running through the author's writing. Only by understanding this can you grasp the core of this article)

Relational mapping database table

The relational mapping table represents the relationship between front-end business objects. Such as: a whiteboard (board) which nail note (Pin), a nail note (Pin) on which whiteboard (board), and so on. The MySQL database table that represents this relationship consists of three fields: a 64-bit "from" ID, a 64-bit "to" ID, and a sequence number. An index is made on each field to facilitate quick query. Its records are stored on the slice database indicated by the slice ID deconstructed according to the "from" field ID.

CREATE TABLE board_has_pins (board_id INT, pin_id INT, sequence INT, INDEX (board_id, pin_id, sequence)) ENGINE=InnoDB

(translation note: the relational mapping here refers to the relationship between front-end business objects is operated and maintained by database tables, not the relational database mapping mentioned in my comments in the previous section. At the beginning, the author mentioned that due to slicing, it is impossible to do relational mapping between relational database tables, such as one-to-one, one-to-many, many-to-many, etc.)

Relational mapping tables are unidirectional, for example, the board_has_pins (board with notes) table makes it easy to query how many Pin (nail notes) are on the board (whiteboard) ID. If you need to query which board (whiteboard) they are on according to the Pin (nail note) ID, you can create a separate table pin_owned_by_board (whiteboard to which the note belongs), where the sequence field represents the sequence number of the Pin on the board. (because the data is distributed on the slice database, our ID itself cannot represent its order.) when we add a new Pin object to the board, we usually set its sequence to the current system time. Sequence can be set to any integer and set to the current system time to ensure that the sequence of the new object is always larger than that of the old object. This is a convenient and easy method. You can query the object dataset from the relational mapping table with the following statement:

SELECT pin_id FROM board_has_pins WHERE board_id=241294561224164665 ORDER BY sequence LIMIT 50 OFFSET 150

The statement finds 50 pin_ids (note ID), which can then be used to query the specific information of these object ID.

We only map these relationships at the business application layer, such as board_id-> pin_ids-> pin objects (from whiteboard ID-> note IDs-> note object). A great feature of this design is that you can cache these relational mapping pairs separately. For example, our cache pin_id-> pin object (note ID-> note object) relationship is mapped on the memcache (memory cache) cluster server, and the board_id-> pin_ids (whiteboard ID-> note IDs) relationship mapping is cached on the redis cluster server. In this way, it is very suitable for us to optimize the caching technology strategy.

Increase service capacity

In our system, there are three main ways to improve service processing capacity. The easiest thing is to upgrade the machine (more space, faster hard disk speed, more memory, no matter what upgrade to solve the bottleneck of the system)

Another way is to expand the scope of the section. Initially, we only sliced 4096 databases, and there is still a lot of room compared to the 16-bit sliced ID we designed, because 16 bits can represent 65536. At some time nodes, if we provide 8 more machines to run 8 MySQL database instances and provide slice databases from 4096 to 8192, after that, the new data will only be stored on the slice database in this interval. There are 16 parallel computing databases, and the service capacity is bound to be improved.

Finally, migrate the slice database host to the new slice host (local slice expansion) to improve the capability. For example, we want to extend the MySQL001A slicing host in the previous example (on which is the slice database numbered 0 to 511) to 2 slicing hosts. With our design, we create a new master-master mutual backup host pair as a new slice host (named MySQL009A and B) and copy the data as a whole from the MySQL001A.

When the data replication is complete, we modify the slice configuration. MySQL001A is only responsible for the slice database from 0 to 255, and MySQL009A is only responsible for the slice database from 256to 511. Now each host in 2 Taichung is only responsible for half of the tasks that the host was responsible for in the past, and the service ability has been improved.

Some feature descriptions

For the business object data that has been generated in the old system, according to the design, for the business objects to generate their UUIDs in the new system, you should realize that it is up to you to decide where to put them (which slice database). (you can plan the distribution of old data on the slice database.) however, when you put it into the slice database, the database will return the local ID of the inserted object only when the record is inserted. With this, you can build the object's UUID.

(translation note: the establishment of relationships between business objects should be considered when migrating, through UUID)

For those of you who have used the modify table structure class command (ALTERs)-- such as adding a field-- on database tables that already have a large amount of data, you know that it is a very long and painful process. Our design is never to use ALTERs-level commands on MySQL (when data is already available). On our business system Pinterest, we used the last ALTER statement about 3 years ago. For objects in the object table, if you need to add an object property field, you add it to the JOSON blob field of the object data. You can set a default value for the properties of the new object, and when you access the data of the old object, you can add the default value of the new property to the old object if it does not have new properties. For relational mapping tables, simply create a new relational mapping table to meet your needs. You know all this! Let your system set sail!

Slicing of mod database

The name of Modular data slice (mod shard) is just like Mod Squad, but it is completely different.

Some business objects need to be queried and accessed in a non-ID (non-ID) way. For example, if a Pin friend (Pinner) logs in to our business platform with his or her registered facebook account, this ID refers to the 64-bit UUID in the previous design specification. We need to map its facebook ID to the ID of our Pin Pinner. Facebook ID is just a string of binary digits for our system. (translation note: it implies that we can't deconstruct the ID of other platforms like the design of our system platform, let alone how to design slices, just save them and design them to map to our ID.) therefore, we need to save them, and we need to save them separately in the slice database. Other examples of what we call Modular data slicing (mod shard) include IP addresses, user names, and user emails.

Modular data slicing (mod shard) is similar to the data slicing design of our business system mentioned above. However, you need to query as it is entered. Hash and modular operations are needed to determine the position of the slice. The hash function converts any string into a fixed-length value, and the module is set to the number of slices in the system's existing slice database, which must fall on a certain slice database after taking the module. As a result, its data will be saved on the existing slice database. For example:

Shard = md5 ("1.2.3.4")% 4096

(translation note: I have searched the word mod shard all over the Internet, trying to find a more accurate and authoritative Chinese translation! Fruitless, because the word mod has several meanings, the most recent is module module, module, and it is also the module operator (%). I translated it as a model according to the meaning of the original text. Or it can be translated into a pattern, but the personal feeling is vague. Please correct any inadequacies. In addition, the example given by the original author is based on the IP address, and the hash uses md5, which is old but has the best performance compared with others.

In this example, the fragmentation is 1524. We maintain a configuration file similar to ID sharding:

[{"range": (0511), "master": "msdb001a", "slave": "msdb001b"}, {"range": (512, 1023), "master": "msdb002a", "slave": "msdb002b"}, {"range": (1024, 1535), "master": "msdb003a", "slave": "msdb003b"}, …]

So, to find data with an IP of 1.2.3.4, we will do this:

Conn = MySQLdb.connect (host= "msdb003a") conn.execute ("SELECT data FROM msdb001a.ip_data WHERE ip='1.2.3.4'")

You have lost some good slicing attributes, such as spatial location. You must set the key of the fragment from the beginning (it will not make a key for you). It is best to use an immutable id to represent objects in the system. This way, when users change their user names, you don't have to update many references.

A final reminder

This system has been working well as a data support for Pinterest for three and a half years, and now it looks like it will continue to run. It is intuitive and easy to design and implement such a system. But it's not easy to get it up and running, especially to migrate old data. If your business platform is suffering from rapid growth and you want to slice your own database. It is recommended that you consider setting up a back-end cluster server (pyres is preferred) to script your logic of migrating old data to a sliced database and automate the process. I guarantee that no matter how thoughtful you think and how hard you try, you will lose data or lose the connection between data. I really hate those troublemakers hidden in complex data relationships. Therefore, you need to constantly migrate, fix, and then migrate. You need a lot of patience and effort. Until you are sure that you no longer need to manipulate the data in your slice database for old data migration.

This system is designed for data distribution and slicing, and has done its best to do its best. It itself can not provide you with Atomicity (atomicity), Consistency (consistency) and Isolation (isolation) among the four elements of database transaction ACID. That sounds bad. Don't worry. You may not be able to take advantage of the capabilities provided by the database itself to guarantee this. However, I would like to remind you that everything is under your control, and you just let it run to meet your needs. Simple and direct design is king. The main reason is that it runs so fast! If you are worried about A (atomicity), I (isolation), and C (consistency), write to me, and I have a lot of experience for you to overcome these problems.

There is one last question, how to recover from disaster, huh? We create another service to maintain the slice database, and we save the slice configuration on ZooKeeper. When the single point master server goes down, we have a script to automatically upgrade the slave server corresponding to the master server immediately. After that, run the new machine at the top from the lack of the server as fast as possible. To this day, we have never used a service like automatic disaster recovery.

These are the details of how MySQL uses sharding to solve the storage problem of 50 billion data. Please pay more attention to other related articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.