Why did Uber abandon Postgres and choose to migrate to MySQL? 07/06 Update SLTechnology News&Howtos

Why did Uber abandon Postgres and choose to migrate to MySQL?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

Why Uber chose to migrate to MySQL instead of Postgres, this article introduces the corresponding analysis and answer in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible way.

The early architecture of Uber included a single back-end application developed in Python that used Postgres as the data store. Since then, the architecture of Uber has changed dramatically, becoming a micro-service and adopting a new data platform model. Specifically, where you used to use Postgres, you now use Schemaless, a new database sharding layer built on top of MySQL. In this article, we will explore some of the shortcomings of Postgres and explain why we build Schemaless and other back-end services on top of MySQL.

1. Postgres architecture

We have encountered many limitations of Postgres:

Inefficient write operation

Inefficient data replication

Data corruption problem

Poor copy MVCC support

It is difficult to upgrade to the new version.

We will explore these limitations by analyzing how Postgres's tables and indexes are represented on disk and compare them with MySQL's InnoDB storage engine. Please note that our analysis is mainly based on our experience with the older Postgres 9.2 release series. To the best of our knowledge, the internal architecture discussed in this article has not changed significantly in the newer Postgres releases, and the basic design of version 9.2 has not changed significantly since at least the Postgres 8.3 release (now nearly 10 years old).

Disk representation

A relational database must be able to perform some key tasks:

Provide insert, update, and delete capabilities

Provide the ability to modify the schema

Support for MVCC, so that different database connections have their own transaction views.

How these functions work together is an important part of designing database disk data representation.

One of the core designs of Postgres is immutable data rows. These immutable rows of data are called "tuples" in Postgres. These tuples are uniquely identified by ctid. Conceptually, ctid represents the position of the tuple on disk (that is, the physical disk offset). There may be multiple ctid describing a single row (for example, to support MVCC, there may be multiple versions of a data row, or an old version of a data row has not been recycled by the autovacuum process). A collection of tuples forms a table. The table itself has indexes, which are organized into some kind of data structure (usually a B-tree) that maps index fields to ctid.

In general, these ctid are transparent to users, but understanding how they work can help you understand the disk structure of the Postgres table. To see the ctid of the current row, you can add "ctid" to the column list in the statement:

Uber@ [local] uber= > SELECT ctid, * FROM my_table LIMIT 1;-[RECORD 1]-+-- ctid | (0meme 1). Other fields.

We explain this through a simple user table. For each user, we have an automatically incrementing user ID primary key, the user's first and last name, and the user's year of birth. We also define a compound secondary index for the user's full name (first name and last name) and another secondary index for the user's year of birth. The DDL that creates the table might look like this:

CREATE TABLE users (id SERIAL, first TEXT, last TEXT, birth_year INTEGER, PRIMARY KEY (id)); CREATE INDEX ix_users_first_last ON users (first, last); CREATE INDEX ix_users_birth_year ON users (birth_year)

Three indexes are defined here: one primary key index and two secondary indexes.

We insert the following data into the table, including some influential historical mathematicians:

As mentioned earlier, each line here has an implicit, unique ctid. Therefore, we can consider the internal representation of the table as follows:

The primary key index (mapping id to ctid) is defined as follows:

The B-tree index is defined on the id field, and each node in the B-tree has the value of ctid. Note that in this case, the order of the fields in the auto-incrementing ID,B tree happens to be the same as the order in the table, but not always.

The secondary index looks similar, but the main difference is that the fields are stored in a different order, because the B-tree must be organized in dictionary order. (first,last) the index starts with the alphabetical order of the name:

Similarly, the birth_year indexes are sorted in ascending order, as follows:

For the latter two cases, the ctid fields in the secondary index are not incremented in dictionary order, unlike in the case of automatically incrementing the primary key.

Suppose we need to update a record in this table, for example, we want to update the birth year of al-Khw "rizm". As mentioned earlier, the tuples of rows are immutable. So, to update the record, we added a new tuple to the table. This new tuple has a new ctid, which we call I. Postgres needs to distinguish between the new tuple I and the old tuple D. Internally, Postgres keeps a version field and a pointer to the previous tuple, if any, in each tuple. Therefore, the latest structure of the table is as follows:

As long as there are two versions of the al-Khw "rizm" row, the index must contain entries for both rows. For simplicity, we omit the primary key index and show only the secondary index, as shown below:

We use red to represent old data rows and green to represent new data rows. Postgres uses another version field to determine which tuple is up to date. Based on this field, the database determines which tuple is visible to transactions that are not allowed to view the new version of data.

In Postgres, both the primary and secondary indexes point directly to the tuple offset on disk. When the tuple location changes, all indexes must be updated.

Copy

When we insert a new row in the table, Postgres needs to copy it if streaming replication is enabled. In order to be able to recover after a crash, the database maintains a write-ahead log (WAL) and uses it to implement two-phase commit. Even if streaming is not enabled, the database must maintain WAL because WAL guarantees atomicity and persistence in ACID.

To better understand WAL, we can imagine what would happen if the database crashed unexpectedly (such as a sudden power outage). WAL represents a series of changes that the database plans to make to the contents of tables and indexes on disk. The Postgres daemon compares the WAL data with the actual data on disk when it starts. If the WAL contains data that is not reflected on disk, the database corrects the tuple or index data and rolls back data that appears in the WAL but is not committed in the transaction.

Postgres implements streaming replication by sending the WAL on the primary database to the replica. Each replica database is like a crash recovery, constantly applying WAL updates. The only difference between streaming replication and actual crash recovery is that replicas in "hot standby" mode can provide query services when applying WAL, but Postgres databases that are really in crash recovery mode usually refuse to provide query services until the database instance completes the crash recovery process.

Because WAL is actually designed for crash recovery, it contains the underlying disk update information. WAL contains the disk representation of tuples and their disk offsets (that is, row ctid). If the replica is fully synchronized with the primary database, and the primary database and replica of Postgres are paused, the disk contents of the replica will be exactly the same as those of the primary database. Therefore, if the replica is out of sync with the primary database, it can be repaired with a tool such as rsync.

2. The consequences of the design of Postgres

The design of Postgres has led to the inefficiency of Uber data and caused us a lot of trouble.

Write amplification

The first problem with Postgres is write magnification. Typically, write magnification refers to the problem encountered when writing data to SSD disk: small logical updates (for example, writing a few bytes) are magnified when converted to the physical layer, resulting in higher costs. In the previous example, if we made a small logical update to the year of birth of al-Khw "rizm", we must make at least four physical updates:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Write a new row tuple to the tablespace

Update primary key index

Update (first,last) index

Update the birth_year index.

In fact, these four updates only reflect writes to the primary tablespace. In addition, these writes also need to be reflected in the WAL, so the total number of writes on disk will become more.

What is worth noting here are Update 2 and Update 3. When you update the year of birth of al-Khw "rizm", you don't actually change its primary key, nor do you change its first and last name. However, new row tuples must be created in the database in order to update these indexes. For tables with a large number of secondary indexes, these extra steps can lead to inefficiency. For example, if we define twelve indexes in a table, even if only the fields corresponding to a single index are updated, the update must be propagated to all 12 indexes to reflect the ctid of the new row.

Copy

This write magnification problem naturally translates to the replication layer, because replication occurs at the disk level. Instead of replicating small logical records, such as "change the year of birth of ctid D to 770," the database propagates the previous four WAL entries to the network. Therefore, the write magnification problem also translates into the replication magnification problem, and the Postgres replication data stream quickly becomes very lengthy and may take up a lot of bandwidth.

If Postgres replication occurs only in a single data center, replication bandwidth may not be an issue. Modern network equipment and switches can handle a lot of bandwidth, and many managed service providers provide free or cheap bandwidth within the data center. However, if you want to replicate between data centers, the problem escalates quickly. For example, Uber originally used a physical server in the West Coast hosting center. For disaster preparedness, we have added servers in the East Coast hosting center. So we have a primary Postgres instance (plus replicas) in the Western Digital center and a replica set in the east.

Cascading replication limits the bandwidth between data centers to meet the bandwidth requirements between the primary database and a single replica, although there are many replicas in the second data center. Because of the complexity of Postgres replication protocol, databases that use a large number of indexes will have a large amount of data. The cost of purchasing cross-regional large bandwidth is very high, and even if the money is not a problem, it is impossible to achieve the same effect as local bandwidth. This bandwidth problem also brings trouble to WAL archiving. In addition to sending all WAL updates from the west coast to the east coast, we also archive all WAL to the file storage service to ensure that we can restore data in the event of a disaster. During the early traffic peak, the bandwidth of our write storage service was not fast enough to keep up with the write speed of WAL.

Data corruption

In the process of routinely upgrading the primary database to increase database capacity, we encountered Postgres 9.2x a bug. Some of the replicas mistakenly applied a small number of WAL records because of the error in switching time. Because of this problem, some records that should have been marked as invalid by the version control mechanism are not actually marked as invalid.

The following query shows how this error will affect our user table:

SELECT * FROM users WHERE id = 4

This query will return two records: the original al-Khw "rizm" row (birth year is 780 CE) and the new al-Khw "rizm" row (birth year is 770 CE). If you add ctid to WHERE, we will see different ctid values for the two returned records.

This question is very annoying. First of all, we don't know how many rows of data are affected by this problem. Duplicate results returned by the database can cause application logic failures in many cases. We finally added defensive programming statements to detect tables that would have this problem. This error affects all servers, and corrupted data rows are different on different replica instances. That is, on one of the replica instances, line X may be bad and line Y may be good, but on another copy instance, line X may be good and line Y may be bad. We were unable to determine the number of copies of the data corruption and whether the problem affected the primary database.

As far as we know, this problem occurs with only a few rows of data in each database, but our concern is that because replication occurs at the physical level, the database index may be completely destroyed in the end. It is important that the B-tree index must be periodically rebalanced (rebalance), and these rebalancing operations may completely change the structure of the tree when the subtree is moved to a new disk location. If you move the wrong data, it may cause most of the tree to be completely invalid.

Finally, we found the problem and determined that there were no corrupted rows in the new master database. We fix the data corruption of the replicas by resynchronizing all replicas from the latest snapshot of the primary database, which is a laborious process.

The bugs we encountered only appear in some versions of Postgres 9.2 and have been fixed for a long time. However, we are still worried that such errors will happen again. New versions of Postgres may also have such errors, and due to the way data is replicated, such problems may be propagated to all databases.

Copy MVCC

Postgres does not provide true copy MVCC support. Replicas apply only WAL updates, causing them to have the same copy of disk data as the primary database at all times. This design brings trouble to Uber.

Postgres needs to maintain a copy of the old data for MVCC. If streaming encounters an ongoing transaction and the database update affects rows within the scope of the transaction, the update operation is blocked. In this case, Postgres pauses the WAL thread until the transaction ends. If the transaction takes a long time, this can be a problem because the replica may lag seriously behind the primary database. Therefore, Postgres applies the timeout strategy in this case: if a transaction causes WAL to block for a certain amount of time, Postgres will terminate the transaction.

This design means that replicas are usually a few seconds behind the primary database, making it easy for transactions to be terminated. For example, suppose the developer has written some code that needs to e-mail the receipt to the user. Depending on how it is written, the code may implicitly leave the database transaction open until the e-mail is finished sending. Although it is bad practice to keep database transactions open while performing unrelated blocking IO, most engineers are not database experts and may not be aware of the problem, especially when using ORM frameworks that hide the underlying details.

Upgrade Postgres

Because replication occurs at the physical level, we cannot replicate data between different versions of Postgres. The master database of Postgres 9.3 cannot be replicated to a replica of Postgres 9.2, and the master database of Postgres 9.2 cannot be replicated to a replica of Postgres 9.3.

We follow these steps to upgrade from one Postgres GA version to another:

Shut down the primary database.

Run the pg_upgrade command on the master database, which updates the master database data in place. For large databases, it usually takes hours, and the data cannot be read from the master database during this process.

Start the master database again.

Create the latest snapshot of the primary database. This step completely replicates all the data in the primary database, so it can take hours for a large database.

Erase all replicas and restore the latest snapshot from the primary database to the replica.

Bring the replica back into the replication hierarchy. Wait for the copy to keep up with all updates to the master database.

We started with Postgres 9.1, successfully completed the upgrade process and migrated to Postgres 9.2. However, the process took several hours, and we could not afford to perform the upgrade process again. By the time Postgres 9.3 was released, the growth of Uber greatly increased our dataset, so the upgrade time became longer. Therefore, even though Postgres 9.5 has been released, our Postgres instance is still version 9.2.

If your Postgres is 9.4 or later, you can use something like pgologic, which implements a logical replication layer for Postgres. You can use it to copy data between different Postgres versions, which means you can upgrade from 9.4 to 9.5 without causing extensive downtime. However, this feature is still problematic because it has not yet been integrated into the Postgres mainline. For those who use older versions of Postgres, pgologic is not applicable.

3. MySQL architecture

The above explains some of the limitations of Postgres, and next we will explain why MySQL has become a new tool for Uber engineering teams to store projects, such as Schemaless. In many cases, we find that MySQL is more suitable for our usage scenarios. To understand these differences, we studied the architecture of MySQL and compared it with Postgres. We specifically analyzed MySQL's InnoDB storage engine.

Disk representation of InnoDB

Like Postgres, InnoDB supports advanced features such as MVCC and variable data. The detailed details of the InnoDB disk representation are beyond the scope of this article, and we will focus on the main differences between it and Postgres.

The main architectural difference is that Postgres maps index records directly to locations on disk, while InnoDB uses a secondary structure. The secondary index of InnoDB has a pointer to the primary key value, rather than a pointer to the disk location (such as ctid in Postgres). Therefore, MySQL associates the secondary index with the primary key:

To execute a query based on the (first, last) index, you need to do two lookups. For the first time, search the table to find the primary key of the record. After finding the primary key, search the primary key index to find the disk location corresponding to the data row.

Therefore, when performing a secondary lookup, InnoDB has a slight disadvantage over Postgres, because InnoDB has to search two indexes, while Postgres only needs to search one. However, because the data has been normalized, you only need to update the index records that actually changed when you update the row data. In addition, InnoDB usually updates the row data in place. To support MVCC, if the old transaction needs to reference a row of data, MySQL copies the old row to a special area called the rollback segment.

Let's see what happens when you update the year of birth of al-Khw "rizm". If there is enough space, the birth year field in the row with an id of 4 is updated in place (in fact, this update always occurs in place, because the birth year is an integer that takes up a fixed amount of space). The year of birth index is also updated in place. The old data row will be copied to the rollback segment. Primary key indexes do not need to be updated, and (first, last) indexes do not need to be updated. Even if this table has a large number of indexes, you only need to update the index that contains the birth_year field. Suppose we have indexes based on fields such as signup_date, last_login_time, and so on. We do not need to update these indexes, but we need to update them in Postgres.

This design also makes data cleaning and compression more efficient. The data in the rollback segment can be cleared directly, in contrast, the autovacuum process of Postgres must perform a full table scan to identify which rows can be cleared.

MySQL uses an additional middle tier: the secondary index record points to the primary index record, and the primary index holds the location of the data row on disk. If the data row offset changes, only the primary index needs to be updated.

Copy

MySQL supports many different replication modes:

Statement-based replication will replicate logical SQL statements (it will literally copy SQL statements, for example: UPDATE users SET birth_year = 770 WHERE id = 4)

Row-based replication will replicate changed row records

Hybrid replication mixes the two modes together.

These models have their own advantages and disadvantages. Statement-based replication is usually the most compact, but copies may be required to update a small amount of data with a large number of statements. On the other hand, row-based replication (similar to Postgres WAL replication) is more verbose, but more predictable and efficient to update on replicas.

In MySQL, only the primary index has a pointer to the disk offset of the row. This is of great significance when copying. The MySQL replication stream only needs to contain logical update information about the row. For updates such as "change the timestamp of row X from Tend1 to timestamp 2," the copy automatically infers which indexes need to be modified.

By contrast, the Postgres replication stream contains physical changes, such as "write byte XYZ at disk offset 8382491". When using Postgres, every physical change to the disk needs to be included in the WAL stream. Minor logical changes (such as updating timestamps) also require a lot of disk changes: Postgres must insert new tuples and update all indexes to point to this tuple, so many changes will be put into the WAL stream. This design difference means that MySQL replication binary logs are more compact than PostgreSQL WAL streams.

The mode of replication also has an important impact on the MVCC of the copy. Because the MySQL replication stream has logical updates, the replica can have true MVCC semantics, so the read query to the replica does not block the replication flow. By contrast, the Postgres WAL stream contains physical changes on disk, and the Postgres copy cannot apply replication updates that conflict with the read query, so MVCC cannot be implemented.

MySQL's replication architecture means that even if bug causes table corruption, catastrophic failures are less likely. Because replication occurs at the logical layer, operations such as rebalancing the B-tree never cause index corruption. A typical MySQL replication problem is that statements are skipped (or applied twice), which can result in data loss or invalidity, but not database disruption.

Finally, the replication architecture of MySQL can be easily replicated between different versions of MySQL. MySQL's logical replication format also means that disk changes in the storage engine layer do not affect the replication format. When upgrading a MySQL, it is typical to apply updates to one copy at a time, and after updating all copies, promote one of them to the new master copy. This allows almost zero downtime upgrades, and it's easy to keep MySQL up-to-date.

4. Other advantages of MySQL

So far, we have introduced the disk architecture of Postgres and MySQL. There are other important aspects of MySQL that make its performance significantly better than Postgres.

Buffer pool

First of all, the two databases are cached differently. Postgres allocates some memory for internal caches, but these caches are usually small compared to the total amount of memory on the computer. To improve performance, Postgres allows the kernel to automatically cache recently accessed disk data through the page cache. For example, our largest copy of Postgres has 768 GB of available memory, but only 25 GB is actually used as process RSS memory for Postgres, leaving more than 700 GB of available memory for the Linux page cache.

The problem with this design is that accessing data through page caching is actually more expensive than accessing RSS memory. To find data from disk, the Postgres process issues lseek and read system calls to locate the data. Each of these system calls causes a context switch, which is more expensive than accessing data from the main memory. In fact, Postgres is not even fully optimized in this respect: Postgres does not take advantage of the pread system call, which combines seek and read operations into a single system call.

By contrast, the InnoDB storage engine implements its own LRU through buffer pools. Logically, this is similar to Linux page caching, but it is implemented in user space. Although the design of the InnoDB buffer pool is much more complex than that of Postgres, it has some advantages:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Custom LRU can be implemented. For example, you can detect access patterns that may break LRU and prevent them from causing more problems.

Less context switching. Data accessed through the InnoDB buffer pool does not require user / kernel context switching. In the worst case, TLB misses occur, and the overhead is relatively small and can be mitigated by using large pages.

Connection processing

MySQL implements concurrent connections by connecting one thread at a time. This overhead is relatively low, and each thread has its own stack memory and buffer heap memory allocated to a specific connection. It is not uncommon to use about 10000 concurrent connections in MySQL. In fact, on some of our existing MySQL instances, the number of connections is close to that number.

However, Postgres uses a design that connects one process, which is much more expensive than the design of connecting one thread at a time. Spawning new processes takes up more memory than generating new threads. In addition, IPC between processes is much more expensive than IPC between threads. Instead of using lightweight futex, Postgres 9.2 implements IPC through the System V IPC primitive. Futex is faster than System V IPC because, in general, there is no normal state condition in futex, so there is no need for context switching.

In addition to memory and IPC overhead, Postgres doesn't seem to be able to support a large number of connections well, even if there is enough memory available. We had a big problem using hundreds of active connections in Postgres. The Postgres documentation recommends an out-of-process connection pooling mechanism to handle a large number of connections, but does not specify why. Therefore, we use pgbouncer to handle connection pooling for Postgres. However, our back-end services occasionally have bug, causing them to open too many active connections, resulting in extended downtime.

In the early days of Uber, Postgres provided us with good service, but as the company grew in size, we encountered scalability problems. For now, we still have some old Postgres instances, but most of the databases are built on top of MySQL (usually using the Schemaless tier), or in some special cases we use a NoSQL database like Cassandra.

This is the answer to the question about why Uber gave up Postgres and chose to migrate to MySQL. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.