How to understand Cassandra database 07/02 Update SLTechnology News&Howtos

How to understand Cassandra database

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article introduces you how to understand the Cassandra database, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Cassandra database, there are actually a lot of technical details worth introducing. Because many of its implementation ideas and relational databases or other NoSQL databases, there are some differences. This difference is also rooted in the idea of database design and implementation. Therefore, many of the derived features are not easy to be compared with other databases.

I classify these features into four categories:

The first category is open source, which does not need to be discussed. The remaining three categories and seven features are the core outline of the selected lecture.

The second category is highly available, fault-tolerant and configurable consistency, which revolves around the characteristics of multi-node redundant data. In other words, if there is only one copy of each row of Cassandra data without a copy, then the second type of feature does not exist.

The third category is distributed, decentralized and scalable. These three characteristics revolve around the separability of the database and the ability of each node to run independently. If you only install a stand-alone Cassandra, then this kind of feature does not exist.

The fourth category is row storage, which describes the most basic storage structure characteristics of the data stored at the bottom of the database, and it is also the first feature that I cut into.

Row storage structure

Any database design and optimization always revolves around one core thing-query optimization. Query is always the core requirement for working with data. Why INSERT? In order to query this data in the future. Why DELETE? Because no longer query, and let other data query faster. Why UPDATE, because it needs real-time query to use. Whether it is the storage structure of the database, such as the design of segments, regions and blocks of ORACLE, or auxiliary storage structure, such as indexes, in the final analysis, in order to query the data needed more quickly. Cassandra is no exception, and by understanding its storage structure, you can better understand how it improves query performance in this storage system, even though it is a database that claims to be better at INSERT.

In the early days of Cassandra, database tables were always called ColumnFamily (column families), and I understood it for a long time as a collection of columns. So, there was a time when I thought of Cassandra as a database with column storage. So why is Cassandra's data model, which can be thought of as row storage (ROW-ORIENTED), but is called ColumnFamily in early tables? Because fundamentally speaking, Cassandra is not a strict row storage, certainly not a column storage, its data is stored in a sparse matrix. Maybe this explanation is a little abstract. So let me first talk about why it is not row storage.

Any traditional row storage database, once DDL defines how many columns the data table has. Then this row of data must store all the column values. Even if there is no value for this column, a null value must be stored, or the application stores a space or 0 to indicate no value. The corresponding storage space for this column must exist, of course, the varchar or compression algorithm in the database will make this storage space as small as possible.

However, Cassandra allows you to include only a few columns for any given row, not all columns for a row of data, and of course KEY columns. This dynamic property in column value storage is not available in traditional row storage databases at all. I guess this may also have the root of the concept of ColumnFamily in the early days.

As mentioned earlier, there was a time when I thought of Cassandra as a database for column storage. However, I have always thought of it as an incomplete column storage database, but a restricted column storage database. Not completely. Where is it? Most column storage databases are created for OLAP, and its advantage is that the performance of aggregating on a particular column is incomparable.

For example, if I have a table with 100 columns, I will ask for a SUM for a certain column. The column storage database can perfectly bypass the extra 99 columns and only take out the one you need to do SUM. However, anyone who has used an Cassandra database knows that doing full-column-level aggregation on any column is simply catastrophic. Just because Cassandra will deploy different KEY to different database nodes / partitions PARTITION (note that this is different from traditional database partitions), any column-level operation will need to be reversed on multiple databases. What's more, when the CQL statement arrives, you need to know if the aggregate is listed on this row of data. Therefore, Cassandra does not have the characteristics of a column storage database.

Why, in the end, Cassandra is still described as a ROW-ORIENTED?

First of all, its storage revolves around Key. Key, which is the unique identifier for a row of data. A row of data exists around the Partition Key and is partially ordered around the Clusterting Key. It can be seen that its ROW-ORIENTED characteristics are still very distinct. What kind of storage structure determines what a database is good at. What is the best thing that DB2 databases are good at storing rows sorted by primary key? Is in the OLTP system, through the primary key to do a single record fast query (Select by Key), which is the most common CQL form of Cassandra. What kind of storage structure also determines what kind of operation will be limited.

After understanding the sparse matrix storage of ROW-ORIENTED for Cassandra databases, let's take a look at the syntax limitations of CQL statements, which are easy to understand. For example: Select statement, Where condition, be sure to send Partition Key (in the case of no secondary index). If not, ALLOW FILTERING must be added syntactically.

Why is that? as mentioned earlier, Partition Key determines where the data is stored, which is like a pointer that points directly to the physical location of this row of data. What ALLOW FILTERING means is that the Cassandra database obtained the record by filtering, not by directly locating it.

Compared with the traditional database, sending Where conditions to Partition Key is like locating records through HASH index. ALLOW FILTERING is like doing a TABLE SCAN first, reading out a large number of records and then filtering out those that meet the WHERE conditions. Then look at Clusterting Key, CQL syntax requirements, scope search, Order by and other grammars all need to use Clusterting Key, which is very easy to understand. After the positioning Partition Key determines the location, the data of the same Partition Key are all stored in Clusterting Key order, so no matter the scope or sorting on this ordered Key column, there is no need for the database engine to sort, just like in the traditional database, when the columns of ORDER BY are consistent with a certain index, it is the same reason that there is no real sorting in the execution plan.

After figuring out the storage structure of Cassandra, let's take a look at how Cassandra makes additions, deletions, modifications and queries on a certain node. No matter how distinct the multi-node characteristics of Cassandra, on a single node, data reading and writing is always the foundation of database performance. No matter how many nodes, if the read and write performance on a single node is not good, then the database will not be fast after all. So let's take a look at how Cassandra reads and writes data.

First translate the passage "Cassandra The Definitive Guide". "in Cassandra, writing data is very fast because of its memtables and SSTables design, so that when it is inserted, it does not need to perform disk reads or searches, which slow down the database. All writes in Cassandra are appended."

Let's take a look at the write steps of Cassandra to interpret its write advantages.

Step one, write Commit Logs. This step is nothing new at all. I think it is almost the same as the REDO Log of a traditional database. No matter what database it is, the writing of this Log is appended. However, if you look at this picture, Commit Logs is written directly on the hard drive. I don't think this description is accurate. No matter when the traditional database REDO LOG or Cassandra Commit Logs, it is first to memory, and then FLUSH to the disk. The strategy of FLUSH is determined by some parameters, such as commitlog_sync. This is very similar to the traditional database, which will not be discussed here, but only need to realize that the higher the frequency of FLUSH action, the less data will be lost when the system crashes, and at the same time, some data insertion performance will be lost. Just like the parameter Innodb_flush_log_at_trx_commit=1 in the Mysql database, Mysql is the safest, but also the slowest.

The second step, Add to memtable, is a critical step, and this step of Cassandra is a complete memory action. If you are in a traditional database, you need to do something like this:

Search the index layer by layer, and if the index block is not in DATA BUFFER, trigger the disk IO.

Locate the data block through the index, and trigger the disk IO if the data block is not in the DATA BUFFER.

Modify the index block, modify the data block, if the concurrent amount of modification is large, locks may occur, and so on.

Of course, in the case of a heap table design like Oracle database, pure INSERT actions are less likely to trigger IO in b, but in UPDATE (also Insert in Cassandra) scenarios, these overhead are indispensable. Why Cassandra can do pure additional writes on Memtable? the Timestamp concept of Cassandra records is inseparable, that is, no matter how many times you write, the database will only be based on the latest Timestamp records. This eliminates the need to lock the record resource. This kind of design, not to mention that there is no lock conflict, even saves the cost of finding the records that need to be locked, and will soon be in this place.

However, this speed comes at a price, and that is data consistency. For example, for a simple requirement, before the data is written, you need to see if the data exists, if so, it cannot be inserted (CQL's IF NOT EXIST syntax), or UPDATE depends on the data condition (WHERE IF Column ='*'). Once this conditional CQL is used, it can be inferred that these advantages do not exist.

Look at step 3, if this line of data is in Row Caches, invalidate it. Pay attention to this place, the record in Row Caches is unchanged. Then the usage scenario of Row Caches is only used when reading data with special hot spots, and it is not suitable for scenarios with high concurrency of hot data modification.

Regular trading, complete these three steps to return to success, do not need to wait for the content of Memtable to close. In other words, the steps that directly affect the performance of the transaction are over. This is not much different from the traditional database. Then the next steps do not directly affect the write ability of the database.

The fourth step is to unload the data, this action is usually asynchronous, and the storage of the SSTable will be expanded in detail later. The fifth step, this is the multi-node feature, is a node exception handling process.

To sum up, writing to a traditional database (including INSERT, UPDATE, Delete) is usually a post-read write process. The writing of Cassandra does not read this action first, which is the root cause of its speed. Once a syntax such as IF NOT EXIST is used, its write performance will also be impaired.

Next, let's take a look at the Cassandra read, which is a multi-node, multi-copy read. Here, let's first look at the situation on one node.

The first step is that if this line of data is in Row Caches, return the data directly, which is easy to understand.

The second step is to check the index in KeyCaches, which can be understood here as a primary key index that stores information that will be used to locate in Memtables or SSTables in the future (the original text in the book is offset location). It is important to note that the values in these steps may be useful in steps 3 and 4, rather than just in step 3. When you see this place, you can find that there is actually a problem with this diagram, that is, it does not point out the maintenance of KeyCaches. Cache is a parameter that can be configured when creating a table. It can be inferred that if the Cache of keys adopts the setting of ALL when we build the table, it should be maintained in Key Caches when a new key value is written into Memtables.

The third step, which needs to be noted, is that for a given table (or column family), only one Memtable is used, so the search is linear. The content in Memtable is the data that has not yet been FLUSH into SSTables. In the query, both its content and the content in SSTables need to be read synchronously, but for a single node, its content is usually updated.

The big difference from the write scenario is that the key step in reading the data is step 4, reading SSTables. Expand here later, take a look at step 5, and if Row Caches is still available, add this record to the Row Caches. Row Caches holds a whole row of data, which, as mentioned earlier, is suitable for storing data read by hot spots.

All databases usually have four routine operations, which are called "additions, deletions, changes and searches". After the introduction of writing and query, here is a brief introduction to the deletion and modification of Cassandra. In a nutshell, the deletion of Cassandra is a modification, and the modification of Cassandra is a write, so Cassandra only writes and queries. If Cassandra writes data all the time, won't it be stored and exploded? Here, we introduce three new concepts (relative to traditional databases)-Tombstones,Timestamps,Compaction.

The deletion of Cassandra is a modification, which is easy to understand. In many business database tables, a logical delete action is often done in order to retain traces. That is, to modify a logo to indicate that the record is deleted or invalidated, but not actually deleted in the database. Cassandra does not delete this line of records as soon as it receives the Delete command. Instead, a Tombstones is recorded for the line, indicating that it has been deleted.

All changes to Cassandra are written. It was mentioned earlier that Cassandra is fast and does not need positioning data. Any Update command, on a traditional database, needs to read this record into memory and lock it. On Cassandra, the Update command becomes an INSERT statement, isn't it re-KEY in the system? Here we have to rely on the Timestamps on the record.

Each query of Cassandra will read out all the duplicate KEY, but will always be subject to the latest Timestamps. This solves the problem of turning all changes into writes. But there are two obvious problems with doing so. First, the data will expand infinitely, eating the disk. Second, data inflation will lead to an increase in the number of duplicate data that the query needs to read, infinite expansion will increase indefinitely, and read performance will be impaired.

So here, we will introduce the concept of Compaction. It's important to note here that this is not the database compression technology we usually talk about, the commonly used Compress. Just because many official documents have translated Compaction into compression, I personally think it should be translated into data collation. Compaction is done asynchronously in the database background, followed by the previous content, such as removing tombstone data, removing data with older timestamps, rearranging SSTable storage files, and so on. This is the way to solve the previous two problems. In a sense, this action is even a bit like the REORG action of the DB2 database. For different database tables, you can choose different CompactionStrategy at the Keyspace level. It is often translated as a compression algorithm, and I think it is more appropriate to translate it into a finishing strategy. I think the compression algorithm should be consistent with the concept of Compress. After all, the Compaction doesn't give consecutive values in the data file, use a RLE algorithm, or build a dictionary, right? After introducing this, let's face the biggest bottleneck of the database.

As long as a database is not an in-memory database, it will always face its biggest performance bottleneck, disk IO. Many of the concepts we mentioned earlier, such as Cache, column storage, indexing, and so on, point to the same essence of optimizing performance, reducing disk IO. When I talked about reading and writing, I skipped step 4. The reading of SSTable is actually the key step that affects performance.

Let's take a look at what SSTable is and what it looks like to read. According to the access order of SSTable, in version 3.0, SSTable contains the following files:

Filter.db this is SSTable's Bloom filter, in a nutshell, it tells you, do you have the Key you want? The Bloom filter works by mapping the values in the dataset to an array and using hash functions to compress the larger dataset into summary strings. By definition, feeds use much less memory than raw data. It is fast and may make a false alarm, but it will not leak. In short, it is possible to tell you that there is, but no. But I will never tell you no, but there is. Be careful! To make a point here, Cassandra maintains a copy of Bloom filter in memory. Therefore, there may not be an actual IO in this step. It is also mentioned in the book that if you increase the memory, you can reduce the false positives of Bloom filters.

Summary.db, which is a sample of the index, is used to speed up reading.

Index.db, which provides row and column offsets in Data.db.

CompressionInfo.db provides metadata about Data.db file compression. What's worth noting here is that it uses the word Compression, and I guess that if Data.db uses compression algorithms, such as dictionary compression, then dictionary data, or similar Compress-related element data, will be stored in this file. This is why this file cannot be bypassed in the access process. Because once the Data.db data is compressed, it must rely on the relevant metadata to decompress the data. As can be seen from the figure, this metadata is relatively fast in memory.

Data.db is the file that stores the actual data and is the only file retained by the Cassandra backup mechanism. It is the only real data, and the rest is auxiliary data. For example, indexes can be rebuilt, dictionaries can be rebuilt, and so on.

Digest.adler32 is used for Data.db verification.

Statistics.db stores statistics about SSTable used by the nodetool tablehistograms command.

TOC.txt lists the file components of this SSTable.

Where 1-5 are files related to the performance of SSTable access data. If the Cache is ALL, the Cassandra can usually navigate directly to the offset of the specific files and data of the SSTable after memory access. Compared with the traditional database, the B-tree index goes down layer by layer, and it is necessary to IO the index blocks that do not exist. This performance should still be very considerable.

At this point, I don't know if you have felt that an important essence of Cassandra is that there is no lock, or there is no conflict and scramble for resources. Through the concept of Timestamps, we can solve the problem that the data can be the same and the Key data is not locked. Although all of our previous content is only around a single-node database introduction. However, the use of Timestamps provides more flexibility and convenience for Cassandra distributed, decentralized, scalable, highly available, fault-tolerant, configurable and consistent.

Distributed, decentralized, scalable

Previously, we divided these six items into two categories, distributed, decentralized, and extensible, which revolve around the independence of KEY. Partition Key, in particular, has a strong independence. Because of its extreme independence, in theory, any different Partition Key data can be placed on different machines to provide services independently, thus making it distributed, decentralized and scalable. Take a look at these features.

Distributed, the Baidu entry is interpreted as a software system based on the network. There are four main characteristics:

Distribution. Distributed system consists of multiple computers, which are geographically dispersed and can be distributed in a unit, a city, a country, or even on a global scale. The function of the whole system is distributed on each node, so the distributed system has the distribution of data processing. A logical database table that is stored separately in multiple Node. Records of different key values are provided with decentralized services by the unable nodes of Cassandra.

Autonomy. Each node in the distributed system contains its own processor and memory, and each node has the function of processing data independently. Usually, they are equal in status and can not only work autonomously, but also use shared communication lines to transmit information and coordinate task processing. Cassandra can be divided into primary and secondary copies only when Partition Key divides the storage location of the Node to which the data belongs. For example, what is the Key value of my Node1 to store, and the others are barely called copies. In fact, Cassandra has multiple master books with equal status, and all of them have the ability to deal with data independently, so they cooperate with each other to deal with tasks, not the traditional concept of master and standby data.

Parallelism. A large task can be divided into several subtasks, which are executed on different hosts. Each Node naturally provides its own Key services, independent and parallel to each other. For different CQL, the query may be completed by different Node. It can also be multiple Node involved in a CQL, and they basically complete the CQL in parallel.

The overall situation. There must be a single, global process communication mechanism in the distributed system, so that any process can communicate with other processes, and there is no distinction between local communication and remote communication. At the same time, there should be an overall protection mechanism. There is a unified set of system calls on all machines in the system, and they must adapt to the distributed environment. Running the same kernel on all CPU makes coordination easier. Cassandra fits this definition perfectly, and Coordinator nodes are not fixed. Each node can accept any CQL and act as a coordinator. What's important is that for an application or client, you don't care how Cassandra later stores and queries data. What it sees from the outside is always a complete logical data table.

With a distributed foundation, Cassandra can run under multiple Node, and multiple Node can be deployed in real different data center computer rooms, different racks, so that you can go to the central Decentralized. With this foundation, data replication can be defined in each data center in conjunction with Cassandra's multicenter replication strategy NetworkTopologyStrategy.

In fact, the word expandable is not very accurate, its focus is actually scalable horizontally. In short, adding Node to the central ring of the figure can improve the processing power of Cassandra. In fact, this is closely related to its distributed characteristics. The splitting granularity of Cassandra is the finest, which can almost reach a Partition KEY in theory. In other words, each Partition KEY can be seen as the smallest unit that can be split and processed independently. Adding data at the same time as adding Node is fine, which makes its horizontal scalability very good.

To make an extreme assumption, if Cassandra has only one data storage, based on the independent characteristics of Key, dividing different Key to different machines to provide services can also be regarded as distributed, decentralized and extensible. But this feature is not perfect and not complete. Because the more machines are divided, the service it provides is incomplete if any machine breaks down.

High availability, fault tolerance, configurable consistency

Next, let's move on to three other features, high availability, fault tolerance, and configurable consistency, which revolve around data redundancy.

Behind any high availability, there must be data redundancy. Traditional databases usually prefer the active and standby mode, that is, when the DOWN of the database node that provides the service is dropped, the standby node begins to provide services. At this time, often fault detection, active / standby switching, the application switching time will become the focus of attention, a better database can be switched in 1 minute or dozens of seconds. However, in today's 7-24-365 environment, the one-minute failure recovery time is usually not very satisfactory to users. When the switching time is compressed to a certain extent, there will be a contradiction, that is, the database abnormal time monitoring threshold. If the setting is too long, the active / standby switching is slow, the setting is too short, and a network jitter may trigger unnecessary miscalculation of the active / standby handover.

Data replication (replicas) of Cassandra is not like traditional backup data, it is more like multiple master data, which is provided with services all the time. In other words, a database node DOWN is dropped, which does not require master / slave switching time at all. In the case of sufficient resources, it is even almost imperceptible (for example, one of the seven replicas is broken).

In Cassandra, how many copies of data replication (replicas), how to store, this strategy can be set according to different Keyspace, equivalent to providing a flexible choice, according to the actual use of database tables and patterns, to determine the data replication strategy.

Data redundancy can be said to be a routine operation in distributed systems, such as parameter data, which is often handled by the method of data redundancy. However, where there is redundancy, there is synchronization. The problem of data consistency is always accompanied by data redundancy. Fortunately, Cassandra has Timestamps to solve the consistency problem, and fault tolerance is only a derivative of consistency. To put it simply, it is just that Cassandra found an old timed error data and fixed it in the background.

Configurable consistency is a particularly important feature of Cassandra. Because its impact is not only for high availability, it also directly affects database performance. As far as the traditional database is concerned, whether to open the database or not also has a direct impact on the transaction performance of OLTP (including Redis). In theory, Cassandra has to wait for more Node to write data, the slower the response time will be. This response time depends on the slowest Node. For transactions to respond more quickly, they need to be done asynchronously. So Cassandra usually does not wait for all Node to respond, wait for how many Node, wait for which Node, is configurable consistency.

The straightness levels of Cassandra in terms of data writing and reading are:

ANY (write only), ONE,TWO,THREE, QUORUM,ALL

LOCAL_ONE, LOCAL_QUORUM,EACH_QUORUM

These levels are easy to understand and do not need to be explained one by one. With regard to high availability and strong consistency, it is always a fish and a bear's paw. Suppose our system uses the fastest way to write, such as writing ANY and reading ONE. Then the possibility that the data read is not the most real-time and accurate data will be greatly increased. As shown in the figure above, there are six nodes writing data, and if any one of them is written successfully, the program will return successfully. Then it is assumed that the remaining five nodes have not finished writing. At this time, a program that reads ONE happens to read one of these five nodes and returns successfully, resulting in data inconsistencies. In order to achieve strong consistency of data, the read and write strategy must be set up to meet such conditions.

WhiteR > RF W-write consistency level R-read consistency level RF- copies

This design of Cassandra is very clever, and it provides excellent tuning flexibility. The essence of database tuning is nothing more than a process of making up for deficiencies, which does not refer to places with good performance to make up for poor performance.

Database or data, in some places, there are some functions that we do not use or use less, and the performance does not need to be so good, which is called excess; in some places, some functions are commonly used, mainly used, and the faster the performance, the better, we call it deficiency. For example, the access form of a database table in many systems is limited. It is possible for a table to be inserted 100 times and read only once, like pipelined data. It is possible to have one table, one insert, 100 reads, like parameter data. There is a great deal of flexibility in this, and we can lose the performance of unpopular operations to protect our main operations. For example, for a read-based table, we can set the write consistency to ALL and the read consistency to ONE. In order to obtain a very efficient system performance.

It should be noted that the replication factor of data is defined in Keyspace, that is, in terms of storage. The consistency of reads is determined by the client. For the same data, different levels of consistency can be used according to different usage scenarios. For example, when the real-time requirement of data is high, you can set it to read QUORUM or ALL. When the real-time requirement is low, you can choose to read ONE.

Summary

So far, I have fully explained the distributed, decentralized, extensible, highly available, fault-tolerant, configurable consistency and row storage features of Cassandra.

In retrospect, we first talked about the row storage structure on a single node of Cassandra, and then introduced distribution, decentralization, and extensibility around the independence of Cassandra data Key. Then it discusses the high availability, fault tolerance, and configurable consistency brought about by Cassandra multi-copy data.

Of course, there are many contents and concepts worth discussing and introducing in Cassandra database, such as Secondary Index, Tokens, Hinted and so on. In addition, in the process of using Cassandra database, there are also monitoring, backup and recovery, performance tuning, security and other contents worthy of attention and learning. I will not introduce them one by one here. If there is an opportunity in the future, let's do a sequel.

On how to understand the Cassandra database to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.