Data cleaning and recovery of tables in Cassandra Cluster 07/13 Update SLTechnology News&Howtos

Data cleaning and recovery of tables in Cassandra Cluster

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "data cleaning and recovery of tables in Cassandra cluster". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "data cleaning and recovery of tables in Cassandra cluster".

Objective: the project team needs to clean up the data of a table in the online cassandra cluster, and verify the feasibility of truncate through experiments.

1. Environmental preparation

Build a three-node cluster in Aliyun environment with 2 replicas

172.26.99.152

172.26.99.153

172.26.99.154

Install java jdk:

If there is an old version left over, you need to delete it first.

(1) check whether the system comes with jdk installed:

Yum list installed | grep java

If you have your own installation of jdk, how to uninstall the system's own java environment:

Yum-y remove java-1.7.0-openjdk*

Yum-y remove tzdata-java.noarch

(2) check the java installation package in the yum library

Yum-y list java*

(3). Use yum to install the java environment (here is the installed jdk-1.8.0. If you install 1.7, the cassandra will report an error when it starts)

Yum install java-1.8.0

(4) check the information of the java version just installed:

Java-version

Mkdir / CAS

Cd / CAS

Tar xzvf apache-cassandra-3.11.1-bin.tar.gz

Mv apache-cassandra-3.11.1 cassandra

Useradd cassandra

Passwd cassandra

Chown-R cassandra.cassandra / CAS

Chmod 755-R / CAS/cassandra

Su-cassandra

Cd / CAS/cassandra/conf

$vi cassandra.yaml

-seeds: "192.26.99.152"-- this line is changed from 127.0.0.1 to the IP of one or more nodes in the cluster. All IP is not recommended. Because the repair method is relatively complicated when the seed node is damaged.

Listen_address: 192.168.73.104-this line is changed to the current IP

Rpc_address: 192.168.73.104-- changed to the IP of the current node

$vi cassandra-env.sh

Parameters that need to be modified in the cassandra-env.sh file:

JVM_OPTS= "$JVM_OPTS-Djava.rmi.server.hostname=192.168.73.104"-this line is commented by default. You need to remove the comment and change the hostname to the current IP.

Configure $JAVA_HOME (java environment variable) and $CASSANDRA_HOME (cassandra environment variable)

Generally speaking, the jdk path installed through yum should be under / usr/lib/jvm/ (for example, / usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-0.b15.el6_8.x86_64 here)

(1) Open the environment variable configuration file and add content:

Cat > > / etc/profile SELECT * FROM system_schema.keyspaces

Keyspace_name | durable_writes | replication

-+-

System_auth | True | {'class':' org.apache.cassandra.locator.SimpleStrategy', 'replication_factor':' 1'}

System_schema | True | {'class':' org.apache.cassandra.locator.LocalStrategy'}

System_distributed | True | {'class':' org.apache.cassandra.locator.SimpleStrategy', 'replication_factor':' 3'}

System | True | {'class':' org.apache.cassandra.locator.LocalStrategy'}

System_traces | True | {'class':' org.apache.cassandra.locator.SimpleStrategy', 'replication_factor':' 2'}

(5 rows)

Cqlsh >

Cqlsh > create keyspace dbrsk WITH replication = {'class':'NetworkTopologyStrategy','datacenter1':2}

Cqlsh > SELECT * FROM system_schema.keyspaces

Keyspace_name | durable_writes | replication

-+-

System_auth | True | {'class':' org.apache.cassandra.locator.SimpleStrategy', 'replication_factor':' 1'}

System_schema | True | {'class':' org.apache.cassandra.locator.LocalStrategy'}

System_distributed | True | {'class':' org.apache.cassandra.locator.SimpleStrategy', 'replication_factor':' 3'}

System | True | {'class':' org.apache.cassandra.locator.LocalStrategy'}

Dbrsk | True | {'class':' org.apache.cassandra.locator.NetworkTopologyStrategy', 'datacenter1':' 2'}

System_traces | True | {'class':' org.apache.cassandra.locator.SimpleStrategy', 'replication_factor':' 2'}

(6 rows)

Look at the table structure of the test from the source database and export the data:

Cqlsh:dbrsk > desc t_card_info

CREATE TABLE dbrsk.t_card_info (

Bankcard text PRIMARY KEY

Bankname text

Cardname text

Cardtype text

City text

Province text

Updatetime bigint

) WITH bloom_filter_fp_chance = 0.00075

AND caching = {'keys':' ALL', 'rows_per_partition':' NONE'}

AND comment = 'card information'

AND compaction = {'class':' org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold':' 322nd, 'min_threshold':' 4'}

AND compression = {'chunk_length_in_kb':' 644th, 'class':' org.apache.cassandra.io.compress.LZ4Compressor'}

AND crc_check_chance = 0.0

AND dclocal_read_repair_chance = 0.0

AND default_time_to_live = 0

AND gc_grace_seconds = 86400

AND max_index_interval = 2048

AND memtable_flush_period_in_ms = 0

AND min_index_interval = 128,

AND read_repair_chance = 0.0

AND speculative_retry = '99PERCENTILEE'

Cqlsh:dbrsk > copy t_card_info to'/ tmp/t_card_info.csv'

Using 16 child processes

Starting copy of dbrsk.t_card_info with columns [bankcard, bankname, cardname, cardtype, city, province, updatetime].

Processed: 2726962 rows; Rate: 5524 rows/s; Avg. Rate: 57918 rows/s

2726962 rows exported to 1 files in 47.165 seconds.

The current library creates a table and imports data:

Cqlsh > use dbrsk

Cqlsh:dbrsk > copy t_card_info from'/ tmp/t_card_info.csv'

Using 1 child processes

Starting copy of dbrsk.t_card_info with columns [bankcard, bankname, cardname, cardtype, city, province, updatetime].

Processed: 690000 rows; Rate: 10883 rows/s; Avg. Rate: 11617 rows/s

Processed: 1410000 rows; Rate: 13012 rows/s; Avg. Rate: 11813 rows/s

Processed: 2115000 rows; Rate: 10324 rows/s; Avg. Rate: 11783 rows/s

Processed: 2726962 rows; Rate: 5305 rows/s; Avg. Rate: 11893 rows/s

2726962 rows imported from 1 files in 3 minutes and 49.299 seconds (0 skipped).

Before importing data:

[root@node2 data] # du-sh *

408K commitlog

1.4M data

4.0K hints

4.0K saved_caches

After importing the data:

[root@node2 data] # du-sh *

155M commitlog

98M data

4.0K hints

4.0K saved_caches

Perform the truncate operation and view the results:

Cqlsh:dbrsk > truncate table t_card_info

[root@node2 dbrsk] # cd t_card_info-9e129520c31c11eab89c515b68839f7c/

[root@node2 t_card_info-9e129520c31c11eab89c515b68839f7c] # ls

Backups snapshots

[root@node2 t_card_info-9e129520c31c11eab89c515b68839f7c] # du-sh *

4.0K backups

103M snapshots

[root@node2 t_card_info-9e129520c31c11eab89c515b68839f7c] # cd snapshots/

[root@node2 snapshots] # ls

Truncated-1594434747140-t_card_info

[root@node2 snapshots] # cd truncated-1594434747140-t_card_info/

[root@node2 truncated-1594434747140-t_card_info] # ls

Manifest.json mc-10-big-Statistics.db mc-11-big-Filter.db mc-12-big-Data.db mc-12-big-TOC.txt mc-9-big-Statistics.db

Mc-10-big-CompressionInfo.db mc-10-big-Summary.db mc-11-big-Index.db mc-12-big-Digest.crc32 mc-9-big-CompressionInfo.db mc-9-big-Summary.db

Mc-10-big-Data.db mc-10-big-TOC.txt mc-11-big-Statistics.db mc-12-big-Filter.db mc-9-big-Data.db mc-9-big-TOC.txt

Mc-10-big-Digest.crc32 mc-11-big-CompressionInfo.db mc-11-big-Summary.db mc-12-big-Index.db mc-9-big-Digest.crc32 schema.cql

Mc-10-big-Filter.db mc-11-big-Data.db mc-11-big-TOC.txt mc-12-big-Statistics.db mc-9-big-Filter.db

Mc-10-big-Index.db mc-11-big-Digest.crc32 mc-12-big-CompressionInfo.db mc-12-big-Summary.db mc-9-big-Index.db

On other nodes, the space footprint is consistent:

[cassandra@node3 t_card_info-9e129520c31c11eab89c515b68839f7c] $du-sh *

4.0K backups

101M snapshots

The data is moved to the snapshots folder

Execute the repair command, and the data in snapshots will not be cleaned up

. / nodetool repair dbrsk

Try to delete the snapshots folder from the operating system. The database can be used normally after deletion.

Re-import the data and delete the table:

Cqlsh:dbrsk > drop table t_card_info

[root@node2 t_card_info-9e129520c31c11eab89c515b68839f7c] # du-sh *

4.0K backups

103M snapshots

[root@node2 t_card_info-9e129520c31c11eab89c515b68839f7c] # ls

Backups snapshots

[root@node2 t_card_info-9e129520c31c11eab89c515b68839f7c] # cd snapshots/

[root@node2 snapshots] # ls

Dropped-1594435864327-t_card_info

As you can see, after the drop and truncte tables, the data is put into the snapshots/droppedxxxxxx snapshots/truncatedxxxxxx under that table, respectively.

So how to recover?

[cassandra@node2 bin] $. / sstableloader-d 172.26.99.152 / tmp/dbrsk/t_card_info

WARN 11:04:46472 Only 31.813GiB free across all data volumes. Consider adding more capacity to your cluster or removing obsolete snapshots

Established connection to initial hosts

Opening sstables and calculating sections to stream

Skipping file mc-21-big-Data.db: table dbrsk.t_card_info doesn't exist

Skipping file mc-22-big-Data.db: table dbrsk.t_card_info doesn't exist

Skipping file mc-23-big-Data.db: table dbrsk.t_card_info doesn't exist

Skipping file mc-24-big-Data.db: table dbrsk.t_card_info doesn't exist

Summary statistics:

Connections per host: 1

Total files transferred: 0

Total bytes transferred: 0.000KiB

Total duration: 2934 ms

Average transfer rate: 0.000KiB/s

Peak transfer rate: 0.000KiB/s

If the table does not exist when using sstableloader directly, an error will be reported. You need to build the table manually:

[cassandra@node2 bin] $. / cqlsh-- request-timeout=90000$ HOSTNAME

Connected to Test Cluster at node2:9042.

[cqlsh 5.0.1 | Cassandra 3.11.1 | CQL spec 3.4.4 | Native protocol v4]

Use HELP for help.

Cqlsh > use dbrsk

Cqlsh:dbrsk > CREATE TABLE dbrsk.t_card_info (

... Bankcard text PRIMARY KEY

... Bankname text

... Cardname text

... Cardtype text

... City text

... Province text

... Updatetime bigint

.) WITH bloom_filter_fp_chance = 0.00075

... AND caching = {'keys':' ALL', 'rows_per_partition':' NONE'}

... AND comment = 'bank card information data'

... AND compaction = {'class':' org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold':' 322nd, 'min_threshold':' 4'}

... AND compression = {'chunk_length_in_kb':' 644th, 'class':' org.apache.cassandra.io.compress.LZ4Compressor'}

... AND crc_check_chance = 0.0

... AND dclocal_read_repair_chance = 0.0

... AND default_time_to_live = 0

... AND gc_grace_seconds = 86400

... AND max_index_interval = 2048

... AND memtable_flush_period_in_ms = 0

... AND min_index_interval = 128,

... AND read_repair_chance = 0.0

... AND speculative_retry = '99PERCENTILEE'

Cqlsh:dbrsk > exit

[cassandra@node2 bin] $. / sstableloader-d 172.26.99.152 / tmp/dbrsk/t_card_info

WARN 11:05:57753 Only 31.813GiB free across all data volumes. Consider adding more capacity to your cluster or removing obsolete snapshots

Established connection to initial hosts

Opening sstables and calculating sections to stream

Streaming relevant part of / tmp/dbrsk/t_card_info/mc-21-big-Data.db / tmp/dbrsk/t_card_info/mc-22-big-Data.db / tmp/dbrsk/t_card_info/mc-23-big-Data.db / tmp/dbrsk/t_card_info/mc-24-big-Data.db to [/ 172.26.99.154, / 172.26.99.152, / 172.26.99.153]

Progress: [/ 172.26.99.154] 0 1.172MiB/s 40% [/ 172.26.99.152] 0 1.172MiB/s (avg: 1.172MiB/s)

Progress: [/ 172.26.99.154] 0 65.484MiB/s 40% [/ 172.26.99.152] 0 65.484MiB/s 46% [/ 172.26.99.153] 0 65.484MiB/s: 3%

Progress: [/ 172.26.99.154] 0 2.578MiB/s 40% [/ 172.26.99.152] 0 2.578MiB/s 46% [/ 172.26.99.153] 0 2.578MiB/s: 3%

……

Progress: [/ 172.26.99.154] 0 245.959MiB/s 4 48% [/ 172.26.99.152] 0 245.959MiB/s 4 16% [/ 172.26.99.153] 0 14% 245.959MiB/s (avg: 8.540MiB/s)

Progress: [/ 172.26.99.154] 0 1012.976MiB/s 4 49% [/ 172.26.99.152] 0 1012.976MiB/s 4 16% [/ 172.26.99.153] 0 1012.976MiB/s 4 13% (avg: 8.651MiB/s)

Progress: [/ 172.26.99.154] 0 1.454GiB/s 4 51% [/ 172.26.99.152] 0 1.454GiB/s 4 16% [/ 172.26.99.153] 0 1.454GiB/s: 25%

Progress: [/ 172.26.99.154] 0 161.665MiB/s 4 54% [/ 172.26.99.152] 0 161.665MiB/s 4 16% [/ 172.26.99.153] 0 161.665MiB/s: 25%

Progress: [/ 172.26.99.154] 0 1.643GiB/s 4 56% [/ 172.26.99.152] 0 1.643GiB/s 4 16% [/ 172.26.99.153] 0 1.643GiB/s (avg: 9.249MiB/s)

Progress: [/ 172.26.99.154] 0 134.745MiB/s 4 56% [/ 172.26.99.152] 0 134.745MiB/s 4 16% [/ 172.26.99.153] 0 134.745MiB/s (avg: 9.254MiB/s)

Progress: [/ 172.26.99.154] 0 1.702GiB/s 4 58% [/ 172.26.99.152] 0 1.702GiB/s 4 16% [/ 172.26.99.153] 0 1.702GiB/s (avg: 9.371MiB/s)

Progress: [/ 172.26.99.154] 0 23.592MiB/s 4 58% [/ 172.26.99.152] 0 23.592MiB/s 4 16% [/ 172.26.99.153] 0 23.592MiB/s (avg: 9.406MiB/s)

……

Progress: [/ 172.26.99.154] 0 0.000KiB/s 4% [/ 172.26.99.152] 0 0.000KiB/s 4% [/ 172.26.99.153] 0 0.000KiB/s: 100%)

Summary statistics:

Connections per host: 1

Total files transferred: 8

Total bytes transferred: 133.156MiB

Total duration: 12314 ms

Average transfer rate: 10.813MiB/s

Peak transfer rate: 17.530MiB/s

Conclusion: the operation of truncate is feasible, and the problems can be recovered, but the recovery time is long.

Neither cassandra's truncate table nor drop table frees up space, but puts it under the snapshot folder.

Thank you for your reading. The above is the content of "data cleaning and recovery of tables in Cassandra Cluster". After the study of this article, I believe you have a deeper understanding of the problem of data cleaning and recovery of tables in Cassandra clusters, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.