How do 1.3 trillion data queries respond in milliseconds? 07/01 Update SLTechnology News&Howtos

How do 1.3 trillion data queries respond in milliseconds?

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

As the largest knowledge-sharing platform in China, we currently have 220 million registered users, 30 million questions, and more than 130 million website answers.

As the user base grew, the data size of our application was unattainable. Our Moneta app stores about 1.3 trillion rows of data (storing posts that users have read).

With an estimated 100 billion rows of data being accumulated each month and growing, that number will reach 3 trillion within two years. While maintaining a good user experience, we face serious challenges in scaling the backend.

In this article, I'll delve into how to maintain millisecond query response times on such large amounts of data, and how TiDB, an open source MySQL compatible NewSQL Hybrid Transaction/Analytical Processing (HTAP) database, provides us with the support to gain real-time insights into our data.

I'll cover why we chose TiDB, how we use it, what we've learned, good practices, and some ideas for the future.

Our pain points

This section describes the architecture of our Moneta application, the ideal architecture we are trying to build, and database scalability as our main pain point.

System architecture requirements

The well-known Post Feed service is a key system through which users receive content posted on websites.

The Moneta app on the back end stores posts that users have read and filters them out of the stream of posts on the recommended pages.

The Moneta application has the following characteristics:

·Need for high availability data: Post Feed was the first screen to appear and played an important role in driving user traffic to Zhihu.

·Handling huge amounts of written data: For example, over 40,000 records per second are written during peak hours, and the number of records is increasing by nearly 3 billion records per day.

·Long-term storage of historical data: Currently, approximately 1.3 trillion records are stored in the system. With about 100 billion records accumulating each month and growing, historical data will reach 3 trillion records in about two years.

·Processing high-throughput queries: During peak hours, the system processes queries executed on an average of 12 million posts per second.

·Limit query response times to 90 milliseconds or less: This happens even for long-tail queries that take the longest to execute.

Tolerance of false positives: This means that the system can bring up many interesting posts for users, even if some posts are filtered out incorrectly.

Given these facts, we need an application architecture that:

·High usability: Finding a large number of posts that have already been read is a poor user experience when a user opens a recommendation page.

Excellent system performance: Our applications feature high throughput and stringent response time requirements.

Easy to Scale: As our business grows and our applications evolve, we want our systems to scale easily.

exploration

To build the ideal architecture with the above capabilities, we integrated three key components into the previous architecture:

Proxy: This forwards user requests to available nodes and ensures high availability of the system.

Cache: This temporarily handles requests in memory, so we don't always have to process requests in the database. This can improve system performance.

Storage: Before TiDB, we managed our business data on a separate MySQL. With the explosion of data volumes, a standalone MySQL system is not enough.

We then adopted a MySQL sharding and Master High Availability Manager (MHA) solution, but this solution was not desirable when 100 billion new records were flooding into our database each month.

Disadvantages of MySQL Sharding and MHA

MySQL fragmentation and MHA are not a good solution because MySQL fragmentation and MHA both have their drawbacks.

Disadvantages of MySQL fragmentation:

Application code becomes complex and difficult to maintain.

·Changing existing slice keys is cumbersome.

Upgrading application logic affects application availability.

Disadvantages of MHA:

·We need to script or use third-party tools to implement virtual IP (VIP) configuration.

MHA monitors only the primary database.

To configure MHA, we need to configure password-free secure Shell (SSH). This could lead to potential safety risks.

MHA does not provide read load balancing for slave servers.

MHA can only monitor the availability of the master server (not the slave master server).

Until we discover TiDB and migrate data from MySQL to TiDB, database scalability remains the weakness of the entire system.

What is TiDB?

The TiDB platform is a set of components that, when used together, become a NewSQL database with HTAP functionality.

TiDB Platform Architecture

Inside the TiDB platform, the main components are as follows:

TiDB Server is a stateless SQL layer that processes user SQL queries, accesses data in the storage layer, and returns the corresponding results to the application. It is compatible with MySQL and sits on top of TiKV.

TiKV servers are distributed transaction key store layers where data persists. It uses the Raft consensus protocol for replication to ensure strong data consistency and high availability.

TiSpark clusters are also located on top of TiKV. It is an Apache Spark plug-in that works with the TiDB platform to support complex online analytical processing (OLAP) queries for business intelligence (BI) analysts and data scientists.

Place Driver (PD) Server is a metadata cluster supported by etcd for managing and scheduling TiKV.

In addition to these main components, TiDB also has an ecosystem of tools such as Ansible scripts for rapid deployment, Syncer for migration from MySQL and TiDB data migration.

and TiDB Binlog, which collects logical changes made to TiDB clusters and provides incremental backups. Copy downstream (TiDB, Kafka or MySQL).

The main functions of TiDB include:

·Horizontal scalability.

MySQL compatible syntax.

·Distributed transactions with strong consistency.

Cloud Native Architecture.

·Use HTAP for minimal extraction, transformation, loading (ETL).

·Fault tolerance and Raft recovery.

·Online schema changes.

How we use TiDB

In this section, I'll show you how to run TiDB in Moneta's architecture, along with performance metrics for Moneta applications.

TiDB in our architecture

TiDB Architecture in Moneta Applications

We deployed TiDB in the system, and the overall architecture of the Moneta application changed to:

Top layer: stateless and scalable client APIs and proxies. These components are easy to expand.

·Middle tier: Soft state components and layered Redis cache as main components. When service is interrupted, these components can restore service by restoring data saved in TiDB clusters.

Bottom layer: TiDB clusters store all stateful data. Its components are highly available, and it can restore its services on its own if a node crashes.

In this system, all components are self-recoverable, and the whole system has a global fault monitoring mechanism. We then use Kubernetes to orchestrate the entire system to ensure high availability of the entire service.

TiDB performance indicators

Because we have TiDB in production, our system is highly available and easy to scale, and system performance has improved significantly. For example, a set of performance metrics was adopted for the Moneta app in June 2019.

Writing 40,000 lines of data per second at peak times:

Lines of data written per second (thousands)

Check 30,000 queries and 12 million posts per second during peak hours:

Lines of data written per second (thousands)

The 99th percentile response time is approximately 25 milliseconds and the 999th percentile response time is approximately 50 milliseconds. In fact, the average response time is much less than these numbers, even for long-tailed queries that require stable response times.

99th percentile response time

999th percentile response time

what have we learned

Our migration to TiDB didn't go well, and here we want to share some lessons learned.

Import data faster

We used TiDB Data Migration (DM) to collect MySQL delta Binlog files and then TiDB Lightning to quickly import the data into TiDB clusters.

To our surprise, it took only four days to import these 1.1 trillion records into TiDB. If we write data logically into the system, it could take a month or more. If we had more hardware resources, we could import data faster.

Reduce query latency

After the migration, we tested a small amount of read traffic. When the Moneta app first went live, we found that query latency did not meet our requirements. To address latency issues, we worked with PingCap engineers to tune system performance.

In the process, we have accumulated valuable data and data processing knowledge:

Some queries are sensitive to query latency, while others are not. We deployed a separate TiDB database to handle latency sensitive queries. (Other non-latency sensitive queries are processed in different TiDB databases.)

In this way, large queries and latency-sensitive queries are processed in different databases, and execution of the former does not affect the latter.

For queries that do not have an ideal execution plan, we wrote SQL hints to help the execution engine select the best execution plan.

We use low-precision timestamp Oracle (TSO) and preprocessed statements to reduce network roundtrips.

assess resources

Before we tried TiDB, we didn't analyze how much hardware resources we needed to support the same amount of data on the MySQL side.

To reduce maintenance costs, MySQL was deployed in a single host-single slave topology. In contrast, the Raft protocol implemented in TiDB requires at least three copies.

Therefore, we need more hardware resources to support business data in TiDB, and we need to prepare machine resources in advance.

Once our data center is set up correctly, we can quickly complete the TiDB evaluation.

TiDB 3.0 expectations

At best, anti-spam and Moneta applications have the same architecture. We tried Titan and Table Partition in candidate versions of TiDB 3.0 (TiDB 3.0.0-rc.1 and TiDB 3.0.0-rc.2) in our anti-spam application for production data.

Titan reduced the delay.

Anti-spam applications have been plagued by severe query and write delays.

We heard that TiDB 3.0 will introduce Titan, a key-value storage engine to reduce write amplification of RocksDB (the underlying storage engine in TiKV) when using large values. To try this out, we enabled Titan after TiDB 3.0.0-rc.2 was released.

The following graphs show write and query latencies compared to RocksDB and Titan, respectively:

Writing and query latency in RocksDB and Titan

Statistics show that both write and query latencies dropped dramatically after we enabled Titan. This was truly amazing! When we look at statistics, we cannot believe our eyes.

Table partitioning improves query performance

We also used TiDB 3.0's table partitioning feature in our anti-spam application. Using this feature, we can divide the table into partitions on time.

When a query arrives, it is executed on partitions covering the target time range. This greatly improves our query performance.

Let's consider what happens if we implement TiDB 3.0 in Moneta and anti-spam applications in the future.

③ TiDB 3.0 in Moneta app

TiDB 3.0 has features such as batch messaging in gRPC, multithreaded Raftstore, SQL plan management, and TiFlash. We believe these will add luster to Moneta applications.

Batch messages in gRPC and multithreaded Raftstore

Moneta's write throughput exceeds 40,000 transactions per second (TPS), TiDB 3.0 can send and receive Raft messages in bulk, and can handle Region Raft logic in multiple threads. We believe these features will significantly improve the concurrency of our systems.

SQL Plan Management

As mentioned above, we wrote a number of SQL hints to enable the query optimizer to select the best execution plan.

TiDB 3.0 adds a SQL plan management feature that binds queries to specific execution plans directly in the TiDB server. With this feature, we do not need to modify the query text to inject hints.

⑥TiFlash

At TiDB DevCon 2019, I first heard that TiFlash is TiDB's extended analytics engine.

It uses column-oriented storage techniques to achieve high data compression rates and applies extended Raft consistency algorithms in data replication to ensure data security.

Because we have massive amounts of data with high write throughput, we cannot replicate data to Hadoop for analysis every day using ETL. But for TiFlash, we are optimistic that we can easily analyze our massive data volumes.

TiDB 3.0 in anti-spam applications

Anti-spam applications have higher write throughput compared to the huge historical data size of Moneta applications.

However, it queries only data stored in the last 48 hours. In this application, 8 billion records and 1.5 terabytes of data are added daily.

Because TiDB 3.0 can send and receive Raft messages in batches, and because it can handle Region Raft logic in multiple threads, we can manage our application with fewer nodes.

Previously, we used seven physical nodes, but now we only need five. These features improve performance even when we use commercial hardware.

what's next

TiDB is a MySQL compatible database, so we can use it just like MySQL.

Thanks to TiDB's horizontal scalability, we are now free to scale our database even though we have over a trillion records to cope with.

So far, we have used quite a bit of open source software in our applications. We also learned a lot about using TiDB to handle system problems.

We decided to participate in the development of open source tools and participate in the long-term development of the community. Based on our joint efforts with PingCAP, TiDB will become even stronger.

Author: Sun Xiaoguang

Profile: Head of Zhihu Search Backend, currently responsible for Zhihu Search Backend Architecture Design and Engineering Team Management. He has been engaged in private cloud related product development for many years, focusing on cloud native technology, TiKV project Committer.

Source: itindex.net/

Original link: dzone.com/articles/lesson-learned-from-queries-over-13-trillion-rows-1

In addition: If you want to know more about the knowledge and usage of Tidb database, please pay attention to Mo Tianlun's "Tidb column"(address: https://www.modb.pro/db). In addition, Mo Tianlun has opened many database columns, such as GaussDB, PolarDB, OceanBase, TDSQL, GoldenDB and many other database columns. Welcome to pay attention to learning!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.