How to use Cassandra to store hundreds of millions of online data every day 07/01 Update SLTechnology News&Howtos

How to use Cassandra to store hundreds of millions of online data every day

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces the knowledge of "how to use Cassandra to store hundreds of millions of pieces of online data every day". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Discord voice chat software and our UGC content are growing much faster than we thought. As more and more users join, it brings more chat messages.

In July 2016, there were about 40 million messages a day.

In December 2016, there were more than 100 million per day.

When I wrote this article (January 2017), there were more than 120 million posts a day.

We decided early on to keep the chat history of all users permanently so that users can find their data on any device at any time. This is a growing amount of data with high concurrent access and needs to be kept highly available.

How do we fix this? Our experience is to choose Cassandra as the database!

What are we doing?

The original version of the Discord voice chat software was developed in just two months in 2015. At that stage, MongoDB was one of the best databases that supported fast iterations. All Discord data is stored in the same MongoDB cluster, but we are also designed to support easy migration of all data to a new database (we are not going to use fragmentation of the MongoDB database because it is complex to use and unstable).

It's actually part of our corporate culture: build quickly to verify the features of the product, but also set aside ways to upgrade it to a more powerful version.

Messages are stored in MongoDB, using a single composite index of channel_id and created_at. By November 2015, when 100 million messages were stored, the expected problems began to emerge: no more indexes and data in memory, latency began to get out of control, and it was time to move to a database that was more suitable for the project.

Select the correct database

Before selecting a new database, we must understand the current read / write mode and why there is a problem with our current solution.

Obviously, our reads are very random, and our read / write ratio is 50 / 50.

Voice chat server: it processes only a few messages and sends only a few messages every few days. Within a year, such a server is unlikely to reach 1000 messages. The problem it faces is that even if the number of requests is small, it is difficult to be efficient. Returning 50 messages to a single user will result in many random lookups on disk and disk cache obsolescence.

Private chat server: send a considerable number of messages, which can easily reach 100000 to 1 million messages a year. The data they request is usually recent. The problem with them is that data is less likely to be cached on disk because it is not accessed much and scattered.

Large public chat server: send a large number of messages. Thousands of their members send thousands of messages every day, and they can easily send millions of messages every year. They almost always request messages for the last hour frequently, so the data can be easily hit by the disk cache.

We expect to provide users with more random data reading capabilities in the coming year: check messages that have been mentioned about you within 30 days, then click on a history message, look up pinned messages, and full-text search. All this leads to more random reads!

Let's define the requirements:

Linear scalability- We don't want to wait a few months to reconsider the new extension scheme, or to resplit the data.

Automatic failover (failover)- We don't want our night break to be disturbed. When something goes wrong with the system, we want it to fix it automatically as much as possible.

Low maintenance cost- can start to work as soon as it is configured, and as the data grows, we need to simply add machines to solve the problem.

Technology that has been proven- We like to try new technologies, but not too new.

Predictable performance- requires no warning when the response time of the API exceeds that of the 80ms by 95%. We also don't want to repeat the addition of caching mechanisms in Redis or Memcached.

Non-binary storage-because of the large amount of data, we don't want to do some reading and deserialization before writing the data.

Open source- We want to be in control of our own destiny and don't want to rely on third parties.

Cassandra is the only database that meets all of our above needs.

We can add nodes to extend it, the addition process will not have any impact on the application, and node failures can be tolerated. Some large companies, such as Netflix and Apple, have deployed thousands of Cassandra nodes. The data is continuously stored on disk, which reduces the data access and addressing cost, and the data can be easily distributed on the cluster. It relies on DataStax, but it is still open source and community-driven.

After making a choice, we need to prove that it is actually feasible.

Data model

The best way to describe a Cassandra database to a novice is to describe it as KKV storage, with two Ks forming the primary key.

The first K is the partition key (partition key), which determines which node the data is stored on and where it is located on disk. A partition contains many rows of data, and the position of the rows is determined by the second K, which is the clustering key, which acts as the primary key within the partition and determines how the rows are sorted. You can think of partitions as ordered dictionaries. The combination of these attributes can support very powerful data modeling.

As mentioned earlier, messages are indexed in MongoDB using channel_id and created_at, and channel_id is designed to be a partition key because messages in a channel are frequently queried, but created_at is not a large cluster key because multiple messages within the system may have the same creation time.

Fortunately, the ID of the Discord system uses a number generator similar to Twitter Snowflake [1] (roughly ordered by time), so we can use this ID. The primary key becomes (channel_id, message_id), and the message_id is generated by the Snowflake generator. When loading a channel, we can tell the Cassandra exactly the scope of the scan data.

The following is a simplified pattern of our message table.

Assandra's schema is very different from the relational database schema, and it is very convenient to adjust schema without any temporary performance impact. So we got the best binary storage and relational storage.

When we started importing existing messages into the Cassandra database, we immediately saw a warning on the log indicating that the size of the partition was larger than 100MB. What happened?! Cassandra claims that a single partition can support 2GB! Obviously, just because there is so much support doesn't mean it should be set that big.

Large partitions will bring greater GC pressure to Cassandra during compression, cluster expansion and other operations. A large partition also means that its data cannot be distributed in a cluster. Obviously, we have to limit the size of the partition, because a single channel can exist for many years and the size continues to grow.

We decided to merge our messages in time and put them in a bucket. By analyzing the largest channel, we determine whether a 10-day message in a bucket will exceed the 100mb. Bucket must be merged from message_id or timestamp.

The partitioning keys of the Cassandra database can be compounded, so our new primary key becomes ((channel_id, bucket), message_id).

To make it easier to query the most recent messages, we generated a bucket from the current time to channel_id (also generated by the Snowflake dispatcher, which is older than the first message). Then we query the partitions in turn until enough messages are collected. The disadvantage of this approach is that an inactive channel needs to traverse multiple bucket to collect enough messages to return. In practice, this has proved to work, because for an active channel, querying the first bucket returns enough data.

Importing messages into the Cassandra database is smooth, and we are going to try to migrate to a production environment.

Smoke start

It is always scary to introduce a new system in a production environment, so it is best to test it without affecting the user. We set the code to double read / write to MongoDB and Cassandra.

As soon as we start the system, we receive an error message from the bug tracker saying that author_id is null. How could it be null? This is a required field! Before explaining the problem, let's introduce the background of the problem.

Final consistency

Cassandra is an AP database, which means it sacrifices strong consistency (C) for availability (A), which is exactly what we need. Reading and writing is an anti-pattern in Cassandra (reading is more expensive than writing), and even if you only access certain columns, it essentially becomes an update insert operation (upsert).

You can also write to any node, in the scope of column, it will use the "last write wins" strategy to automatically resolve write conflicts, how does this strategy affect us?

Edit / delete an example of race condition

In the example, when one user edits the message, the other user deletes the same message, and when Cassandra executes the upsert, we leave only the primary key and another column that is updating the text.

There are two possible solutions to this problem:

When editing a message, write the entire message back. This makes it possible to retrieve deleted messages, but it also increases the likelihood of more data column conflicts.

Delete a message from the database when it can be determined that it has been corrupted.

We select the second option, we select a column as required (in this case, author_id), and delete if the message is empty.

In solving this problem, we also noticed that our writing efficiency was very low. Because Cassandra is designed to be ultimately consistent, the data is not deleted immediately when the delete operation is performed, and it must be replicated and deleted to other nodes, even if the other nodes are temporarily unavailable.

In order to facilitate processing, Cassandra processes the deletion into a writing form called "tombstone". In the process of processing, it simply skips the tombstone it encounters. The tombstone exists for a configurable period of time (10 days by default) and will be permanently deleted during compression after the expiration date.

Deleting a column and writing null to a column are exactly the same thing. They all produce tombstones. Because all writes in the Cassandra database are update inserts (upsert), this means that even the first insertion of null generates a tombstone.

In fact, our entire message data contains 16 columns, but the average message length may only be four values. This leads to inserting a new row of data and writing 12 new tombstones to Cassandra for no reason.

The solution to this problem is simple: only write non-null values to the Cassandra database.

Performance

Cassandra is known for writing faster than reading, and what we have observed is indeed the case. The write speed is usually less than 1 millisecond and the read speed is less than 5 milliseconds.

We observed the data access and the performance remained stable during the week of the test. Not surprisingly, we got the database we expected.

Monitor read / write latency through Datadog

When it comes to fast, consistent read performance, here's an example of jumping to a message from a million-message channel a year ago.

Jump to the performance of chat transcripts from a year ago

A huge accident

Everything went well, so we switched it to our main database and eliminated MongoDB within a week. Cassandra worked fine until one day six months later, Cassandra suddenly became unresponsive. We notice that Cassandra begins to have a 10-second GC full pause (Stop-the-world), but we don't know why.

We started the location analysis and found that it took 20 seconds to load a channel. A public channel called "Puzzles & Dragons Subreddit" is the culprit. Because it is an open channel, we also went in to find out.

To our surprise, there is only one message in channel. We also learned that they deleted millions of messages with our API, leaving only one message in channel.

I mentioned above how Cassandra uses a tombstone (mentioned in final consistency) to handle deletions. When a user loads the channel, although there is only one message, Cassandra has to scan millions of tombstones (garbage is generated faster than the virtual machine collects).

We take the following measures to solve the problem:

Because we run Cassandra database repair (an anti-entropy process) every night, we reduced the life cycle of the tombstone from 10 days to 2 days.

We modified the query code to track empty buckets and prevent them from loading in future channel. This means that if a user triggers the query again, in the worst case, the Cassandra database is scanned only in the most recent bucket.

The future

We are currently running a 12-node cluster with a replication factor of 3 and continue to add new nodes according to business needs. I believe this model can last for a long time. But with the development of Discord software, it is believed that one day we may need to store billions of messages every day.

Both Netflix and Apple maintain clusters running thousands of nodes, so we know that there is not much to worry about at this stage. Of course, we also hope that some ideas can be prepared in advance.

Recent work

Upgrade our message cluster from Cassandra 2 to Cassandra 3. Cassandra 3 has a new storage format that can reduce the storage size by more than 50%.

The new version of Cassandra can handle more data on a single node. Currently, we store nearly 1TB compressed data on each node. We believe we can safely scale to 2TB to reduce the number of nodes in the cluster.

Long-term work

Try Scylla [4], which is a Cassandra-compatible database written in C++. During normal operation, our Cassandra node does not actually take up too much CPU, but during off-peak hours, when we run repairs (an anti-entropy process) it becomes quite CPU-intensive, and the repair duration and the amount of data written have increased significantly since the last repair. Scylla claims to have a very short repair time.

Back up the unused Channel as files on Google's cloud storage and load it back if necessary. As a matter of fact, we do not really want to do this, so this plan may not be implemented.

Ref-01: http://ju.outofmemory.cn/entry/311011

This is the end of "how to use Cassandra to store hundreds of millions of pieces of online data every day". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.