What is the maintenance work of GaussDB T distributed cluster database? 07/13 Update SLTechnology News&Howtos

What is the maintenance work of GaussDB T distributed cluster database?

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

In this issue, Xiaobian will bring you about the maintenance work of GaussDB T distributed cluster database. The article is rich in content and analyzed and described from a professional perspective. After reading this article, I hope you can gain something.

We started GaussDB T daily maintenance must-do things. A new day starts with turning on the host. After turning on the virtual machine, it is found that the database installed last time is not self-starting. The relevant processes started by all nodes are only cm_agent processes:

At this point we have to pull up ETCD:

OK, ETCD pulled up successfully, next we pull up the whole cluster:

Cluster pull up successful.

Later we will automatically pull ETCD and cluster to join the self-start, the following start back to the opening theme, daily maintenance began.

I. Cluster status check

The first thing, of course, is to check the resource status of each node in the cluster. As for what to see, we use a diagram to understand the main points:

1. Check whether each node resource is ON LINE, including CM,CN,DN,ETCD, etc. If not, further check the reason.

2. Check whether each node is involved in node switching compared with yesterday, and check the HOST corresponding to the node. If there is an abnormality, the cause needs to be further verified.

Second, check the host resource usage (all hosts)

1. Host directory utilization rate

df -h

CPU, memory and IO usage

There are many ways to check this. Here, vmstat,iostat,free is used. Please pay attention to the position marked by the red box below.

The id column represents the CPU idle rate, and the free column represents free memory in pages.

Explanation: rMB/s and wMB/s are read/write per second, %util all IO processing time in the statistical time, divided by the total statistical time. For example, if the count interval is 1 second and the device has 0.8 seconds to process IO and 0.2 seconds to idle, then %util = 0.8/1 = 80% for the device, so this parameter implies how busy the device is. If this parameter is 100%, it means that the device is running close to full capacity (of course, if it is multi-disk, even if %util is 100%, because of the concurrency capacity of the disk, disk usage may not reach the bottleneck).

Focus on free and available.

Note: The resource check in this section needs to be compared with the baseline. If the discrepancy is too large, the reasons need to be further verified.

III. Verify the status of each node database

Make sure CN and DN are in open state, note that standby DN is in mount state.

IV. Tablespace Usage Check

Before we do the usage check, let's talk about how tablespaces are created.

1. Connect to CN

zsql omm/gaussdb_123@127.0.0.1:8000 –q

2. Create a table space

CREATE TABLESPACE tbs_test1 DATAFILE 'tbs_test1' size 100m SHARD;

Note: When creating a table space, SHARD keyword supports automatic delivery of table space creation statements to CN and DN nodes, and only relative paths are supported; if SHARD keyword is not used, absolute paths can be used, and tables can only be created under this table space normally after this table space is created on all CN and primary DN nodes.

3. Check the data file, we will find that the corresponding table space and data file are created in CN and DN.

Note: Connect the primary DN using the following command.

zsql / as sysdba -D /gaussdb/data/data_dn1 -q

4. Check the utilization of table space

set line 300 set pages 2000 set timing off col tablespace_name for a25 col sum_GB for a15 col free_GB for a15 col use_precent for a15 select b.tablespace_name, round(sum(b.bytes) / 1024 / 1024 / 1024, 0) sum_GB, round(sum(nvl(a.bytes, 0)) / 1024 / 1024 / 1024, 0) free_GB, round((sum(b.bytes) - sum(nvl(a.bytes, 0))) / sum(b.bytes), 4) * 100 use_precent, count(*) from (select tablespace_name, file_id, sum(bytes) bytes from adm_free_space group by tablespace_name, file_id) a, adm_data_files b where a.file_id(+) = b.file_id and a.tablespace_name(+) = b.tablespace_name group by b.tablespace_name having round((sum(b.bytes) - sum(nvl(a.bytes, 0))) / sum(b.bytes), 4) * 100 >= 0 order by 4 desc;

Note: Tablespace usage checks need to be run on all primary CNs and primary DNs.

V. Abnormal waiting event checking

col event form a38

select event,count(*) from DV_SESSIONS where LOCK_WAIT = 'Y' group by event order by 2 desc;

Note: Check for exception waiting events at all primary DNs.

As shown in the figure, there is TX waiting, we can see what the lock source is doing through the following SQL:

select SID,SERIAL#,USERNAME,CURR_SCHEMA,CLIENT_IP,CLIENT_PORT,OSUSER,MACHINE,PROGRAM, STATUS,LOCK_WAIT,EVENT,MODULE,CURRENT_SQL from dv_sessions where sid in (select WAIT_SID from v$session where event like '%TX%');

If the session state is found to be inactive and connected by the application, you can contact the application to check whether it is normal. For example, you can kill us by running ALTER SYSTEM KILL SESSION 'SID,SERIAL#'; to kill the session.

VI. Log check

In the process of database operation, a large number of logs such as operation, audit, DEBUG and alarm will be generated for daily maintenance of database. These logs can be used for problem location and database recovery in the event of a database failure.

Here is a brief introduction to commonly used log types:

1. Operation log

Print GaussDB T database running information, if the database fails, please check zengine.rlog.

Log directory: default to " $GSDB_DATA/log/run/zengine.rlog" or the path run subdirectory corresponding to the parameter log_home. If you want to modify its path, restart takes effect.

CN node:

DN node:

View the log as follows:

2. Slow query log

Prints SQL information for GaussDB 100 database execution times exceeding the threshold (controlled by the LONGSQL_TIMEOUT parameter) to the zengine.lsql log file.

Log directory: Default is " $GSDB_DATA/log/longsql/zengine.lsql".

3. Alarm log

Print GaussDB 100 database operation alarm information. For alarm information, see zenith_alarm.log.

Log directory: " $GSDB_DATA/log/zenith_alarm.log".

4. Operation log

Record user operation information on GaussDB 100 database through ZSQL tool. If you need to know about the action log, check zsql.olog.

Log directory: " $GSDB_DATA/log/oper/zsql.olog".

5. TRACE log

Records information about database session deadlocks. For session deadlock information, see zengine_00003_xx.trc.

Log directory: " $GSDB_DATA/trc/zengine_00003_xx.trc".

Common error codes:

GS-00716：Found %s deadlock in session (%u)

Error cause: concurrent cross-operation of the same batch of data in different sessions, resulting in deadlock.

Solution:

View trace log or run log (deadlock log location varies according to database version);

According to the specific information recorded in the log, including deadlock types, SQL statements, etc., check the business statements.

GS-00715：The snapshot was outdated.

Error cause: snapshot is too old.

Solution:

rerun SQL;

Optimize or split long-running, costly SQL.

GS-00713：No free undo page

Error cause: UNDO table space is insufficient.

Solution:

Increase the size of the UNDO table space;

Release UNDO by killing large transactions.

GS-00305：%s timeout

Error cause: network api timeout.

Solution:

Please ensure that the host network is normal.

GS-00774：Failover in progress, can not be connected

Error cause: When the standby is doing failover, the log sending thread of the host connects to the standby.

Solution:

Stop the main engine, wait for the standby engine to be upgraded to the main engine, and then reduce the original engine to standby.

GS-00839：Flush redo file:%s, offset:%u, size:%lu failed

Error cause: Failed to write redo log file, usually file system or disk problems.

Solution:

Check the operating system or disk.

GaussDB T database maintenance work is a lot, in addition to the above daily must do things, there are session connection failure, buffer flush failure, CN/DN node status exception, CM Server node status exception, active and standby DN node log synchronization delay is too large and so on. Many of these alerts can be processed using Database Manager analytics or implemented using scripts we develop ourselves.

The goal of maintenance is to make the system more stable, and the simpler the maintenance, the less likely the maintainer is to make mistakes.

The above is what is the maintenance work of GaussDB T distributed cluster database shared by Xiaobian. If there is a similar doubt, please refer to the above analysis for understanding. If you want to know more about it, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.