How the database rebuilds connections from 15000 to less than 100 04/28 Update SLTechnology News&Howtos

How the database rebuilds connections from 15000 to less than 100

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains how the database rebuilds connections from 15000 to less than 100. the content of the explanation is simple and clear, and it is easy to learn and understand. let's study and learn how the database rebuilds connections from 15000 to less than 100.

Where does it all start?

From the beginning, DigitalOcean has been obsessed with simplicity. This is one of its core values: strive for a simple and elegant solution. This applies not only to our products, but also to our technical decisions. This could not have been more obvious in the initial system design.

Like GitHub, Shopify, and Airbnb, DigitalOcean started as a Rails application in 2011. The Rails application (internally known as Cloud) manages all user interactions in UI and public API. Rails is assisted by two Perl services: Scheduler and DOBE (DigitalOcean backend).

Scheduler plans and assigns Droplet to the hypervisor, while DOBE is responsible for creating the actual Droplet virtual machine. When Cloud and Scheduler are running as separate services, DOBE runs on each server in the fleet.

Cloud, Scheduler and DOBE cannot communicate directly. They communicate through the MySQL database. This database serves two purposes: storing data and arranging communications. All three services use a database table as a message queue to pass information.

Each time a user creates a new Droplet, Cloud inserts a new event record into the queue. Scheduler continuously surveys the database every second to find new Droplet events and plans to create them on available hypervisors.

Finally, each DOBE event waits for a new scheduled Droplet to be created and the task completed. In order for these servers to detect all new changes, they all need to investigate the database to find new records in the table.

In terms of system design, unlimited loops and giving each server a direct connection to the database is probably the most basic, simple, and effective-especially for an understaffed technical team. they face tight deadlines and a fast-growing user base.

For four years, database message queues have formed the backbone of the DigitalOcean technology stack. During this period, we adopted a micro-service architecture, replacing HTTPS with gRPC as internal traffic and Golang instead of Perl as back-end service. However, all the roads still lead to the MySQL database.

It is important not to think that something is abnormal and should be replaced just because it is old. Bloomberg and IBM have legacy services written in Fortran and COBOL that generate much more revenue than the entire company. On the other hand, each system has a ratio limit. We need to face.

From 2012 to 2016, DigitalOcean's user traffic grew by more than 10,000%. We have added more products to the product catalog and more services to the infrastructure. This increases the number of events on the database message queue.

The increased demand for Droplet means that Scheduler is working overtime to allocate all of them to the server. Unfortunately, for Scheduler, the number of servers available is not fixed.

In order to keep up with the growing demand for Droplet, we have added more and more servers to handle traffic. Each new hypervisor means another persistent connection to the database. By early 2016, the database had more than 15000 direct connections, each querying for new events every 1 to 5 seconds.

If that's not bad enough, the SQL query that each hypervisor uses to get new Droplet events is becoming more and more complex. It has become a giant with more than 150 rows, spanning 18 tables. It is impressive, precarious and difficult to maintain.

As expected, it was around this period that cracks appeared. A single point of failure and thousands of dependencies compete for shared resources, inevitably leading to a chaotic period. The backlog of table locks and queries results in interruptions and performance degradation.

And because of the tight coupling in the system, there is no clear or simple solution. Cloud, Scheduler, and DOBE are all bottlenecks. Patching only one or two components will only shift the load to the remaining bottlenecks. So, after repeated consideration, the engineers came up with a three-pronged rectification plan:

Reduce the number of direct connections on the database.

Refactoring the scheduler's sorting algorithm to improve availability.

The database for which message queuing is relieved.

Start refactoring

To solve the database dependency problem, DigitalOcean engineers created the event router. The event router acts as an area agent, polling the database on behalf of each DOBE instance in each data center. Instead of thousands of servers for each query database, there will be only a few agents to do the query.

Each event router agent acquires all active events in a specific area and delegates each event to the appropriate hypervisor. Event routers also break down large polling queries into smaller and easier to maintain.

When the event router went online, it drastically reduced the number of database connections from more than 15000 to less than 100.

Next, engineers set their sights on the next target: Scheduler. As mentioned earlier, Scheduler is a Perl script that determines the Droplet that the hypervisor will be responsible for creating. It does this by using a series of queries to rank and sort the server. Every time a user creates a Droplet, Scheduler updates the table rows with the best machine.

Simple as it sounds, Scheduler has some drawbacks. Its logic is complex and difficult to deal with. It is single-threaded and its performance is affected during peak traffic. Finally, there is only one Scheduler instance and it must serve the entire fleet. This is an inevitable bottleneck. To solve these problems, the engineering team created Scheduler V2.

The updated Scheduler has completely revamped the ranking system. Instead of querying the server metrics in the database, it aggregates from the hypervisor and stores it in its own database. In addition, the Scheduler team enables their new services to run under load through concurrency and replication.

Both the event router and Scheduler V2 have made great achievements and solved many architectural failures. Even so, there is an obvious flaw. By early 2017, centralized MySQL message queues were still in use, even busy. It processes up to 400000 new records a day and updates 20 times a second.

Unfortunately, deleting a message queue for a database is not easy. The first step is to prevent the service from accessing it directly. The database needs an abstraction layer. It also needs an API to aggregate the request and execute the query on its behalf. If any service wants to create a new event, it needs to be created through API. So Harpoon was born.

It takes longer to get a stake than you think.

However, building an interface for the event queue is the easiest part. It turns out to be more difficult to get a stake in another team. Integration with Harpoon means that the team must give up access to the database, rewrite parts of the code base, and eventually change the way they have always worked. It's not easy.

Team after team, service after service, Harpoon engineers successfully migrated the entire code base to their new platform. This took about a year, but by the end of 2017, Harpoon was the only publisher of the database message queue.

Now the real work begins. Full control of the event system means that Harpoon is free to redesign Droplet workflows.

The first task of Harpoon is to extract message queuing responsibilities from the database into itself. To do this, Harpoon creates its own internal messaging queue, which consists of RabbitMQ and asynchronous workstations. When Harpoon pushed new events to one side of the queue, the workstation pulled them out of the other side.

Because RabbitMQ replaces database queues, workstations are free to communicate directly with Scheduler and event routers. Therefore, instead of using Scheduler V2 and Event Router to poll for new changes in the database, Harpoon pushes updates directly to the database. When this article was written in 2019, this was where the Droplet event architecture was located.

In the past seven years, DigitalOcean has grown from the foundation of the band Koo to today's established cloud provider. Like other technology companies in transition, DigitalOcean regularly deals with legacy code and technology debt. Whether it's breaking the whole, creating multi-area services, or eliminating a single point of failure, DigitalOcean engineers are always committed to developing elegant and simple solutions.

Thank you for your reading. the above is the content of "how the database rebuilds connections from 15000 to less than 100". After the study of this article, I believe you have a deeper understanding of how the database rebuilds connections from 15000 to less than 100. the specific use still needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.