[RAC] basic knowledge related to RAC 07/11 Update SLTechnology News&Howtos

[RAC] basic knowledge related to RAC

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

[RAC] basic knowledge related to RAC

Introduction to 1.CRS

Starting with Oracle 10G, oracle introduces a complete cluster management solution-Cluster-Ready Services, which includes cluster connectivity. Messages and locks. Load management and other frameworks. Thus, RAC can be separated from third-party clusterware, and of course, CRS and third-party clusterware can be used together.

(1) .CRS process

CRS mainly consists of three parts, all of which appear as daemons

CRSD: the main engine of resource availability maintenance. It is used to perform high availability recovery and management operations, such as maintaining OCR and managing application resources. It stores the information state of the cluster and the configuration of OCR. This process runs under root authority.

EVMD: event management daemon. This process is also responsible for starting the racgevt process to manage FAN server-side calls, which run with root privileges

OCSSD: cluster synchronization service process. Manage the membership of the cluster node, which starts in fatal mode, so process failure will cause the cluster to restart to prevent data necrosis. At the same time, CSS also maintains the basic locking function in the cluster and is responsible for monitoring the brain fissure failure of voting disk. It runs with Oracle privileges

In addition, there is a process OPRCD, which is a process monitor in the cluster, which appears only when the CRS on the platform does not use vendor groupware, and there is only one set of background processes per node no matter how many instances are running.

Take a look at these daemons:

Rac1- > cat/etc/inittab

... .

# Run xdm in runlevel 5

X:5:respawn:/etc/X11/prefdm-nodaemon

H2:35:respawn:/etc/init.d/init.evmd run > / dev/null 2 > & 1 / dev/null 2 > & 1 / dev/null 2 > & 1 cat date.sh

#! / bin/sh

While true

Rdate-s rac2 > dev/null 2 > & 1

Sleep 10

Done

3. RAC management and maintenance

Compared with Single instance, the management and maintenance of RAC is a little more complicated. 10g provides us with a powerful EM management tool, which makes a lot of management and maintenance work simple and interface. We should also get used to using EM to do more work efficiently. The following parts of this article will not discuss the management of EM for the time being, focusing on the command line approach.

1.CRS management and maintenance

(1) .CRS related interface commands

CRS plays an important role in 10G RAC system. Oracle also provides some command interfaces for us to diagnose and maintain it.

CRS_*

Under 10G RAC, there are several sets of crs_ commands that maintain CRS resources.

[root@rac2 bin] # ls $ORA_CRS_HOME/bin | grep "crs_" | grep-v bin

Crs_getperm crs_profile crs_register crs_relocate crs_setperm crs_start crs_stat crs_stop crs_unregister

Let's talk about them respectively.

Cluster resource query: CRS_STAT

It can be used to view the health status of resources on each node in RAC, the properties of Resources, and so on.

For example, use the-t option to check the status of the resource:

[root@rac1 ~] # crs_stat-t

Name Type Target State Host

Ora.demo.db application ONLINE ONLINE rac2

Ora....o1.inst application ONLINE ONLINE rac1

Ora....o2.inst application ONLINE ONLINE rac2

Ora....SM1.asm application ONLINE ONLINE rac1

Ora....C1.lsnr application ONLINE ONLINE rac1

Ora.rac1.gsd application ONLINE ONLINE rac1

Ora.rac1.ons application ONLINE ONLINE rac1

Ora.rac1.vip application ONLINE ONLINE rac1

Ora....SM2.asm application ONLINE ONLINE rac2

Ora....C2.lsnr application ONLINE ONLINE rac2

Ora.rac2.gsd application ONLINE ONLINE rac2

Ora.rac2.ons application ONLINE ONLINE rac2

Ora.rac2.vip application ONLINE ONLINE rac2

In favor of the-p option, get the resource configuration attribute.

[root@rac2 bin] # crs_stat-p ora.rac2.vip

NAME=ora.rac2.vip

TYPE=application

ACTION_SCRIPT=/opt/oracle/product/10.2.0/crs_1/bin/racgwrap

ACTIVE_PLACEMENT=1

AUTO_START=1

CHECK_INTERVAL=60

DESCRIPTION=CRS application for VIP on a node

...

USR_ORA_STOP_MODE=immediate

USR_ORA_STOP_TIMEOUT=0

USR_ORA_VIP=192.168.18.112

Use the-p parameter to obtain resource permissions.

[root@rac2 bin] # crs_stat-ls | grep vip

Ora.rac1.vip root oinstall rwxr-xr-

Ora.rac2.vip root oinstall rwxr-xr--

The main parameters are-t/-v/-p/-ls/-f, etc. For more information, please see crs_stat-h

Cluster resources start / stop CRS_START/CRS_STOP

This set of commands is mainly responsible for starting / stopping resources on each node. It can be for global resources (for example: crs_stop-all, which means to stop resources on all nodes), or for a specific resource on the node (for example, crs_start ora.rac2.ons, for ONS on the startup node rac2).

Cluster resource configuration CRS_REGISTER/CRS_UNREGISTER/CRS_PROFILE/CRS_SETPERM

This set of commands is mainly responsible for adding, deleting and configuring cluster resources.

CRS_PROFILE: the profile file used to generate resource (of course, we can also edit it manually or generate it through existing ones). The default path is $ORA_CRS_HOME/crs/profile directory, and the parameter-dir is added to specify the directory manually. The default name is resource_name.cap.

Crs_profile-create resource_name-t application-a.. -r. -o..

Table 3.1 shows how to configure parameters in crs_profile (more, pick some to say):

Parameter name

Description

Parameter instruction (take create as an example)

NAME

Resource name

Crs_profile-create resource_name

TYPE

Resource type (application, generic)

Crs_profile-create resource_name-t...

ACTION_SCRIPT

Used to manage HA scenario scripts

Crs_profile-create

Resource_name-a...

ACTIVE_PLACEMENT

Location / node of resource storage

Crs_profile-create

Resource_name-o-ap...

AUTO_START

Resource self-start

Crs_profile-create

Resource_name-o-as...

CHECK_INTERVAL

Resource monitoring interval

Crs_profile-create

Resource_name-o-ci...

FAILOVER_DELAY

Wait time for resource failover

Crs_profile-create

Resource_name-o-fd...

FAILURE_INTERVAL

Resource restart attempt interval

Crs_profile-create

Resource_name-o-fi...

FAILURE_THRESHOLD

Resource restart attempts (up to 20)

Crs_profile-create

Resource_name-o-ft...

HOSTING_MEMBERS

Resource startup or the primary node selection of failover

Crs_profile-create

Resource_name-h...

PLACEMENT

Node selection mode for resource startup or failover (balanced,balanced,balanced)

Crs_profile-create

Resource_name-p

REQUIRED_RESOURCES

The resource on which the current resource depends

Crs_profile-create

Resource_name-r

RESTART_ATTEMPTS

Number of attempts to start before resource reconfiguration

Crs_profile-create

Resource_name-o-ra...

SCRIPT_TIMEOUT

Time to wait for the result of ACTION_SCRIPT to return

Crs_profile-create

Resource_name-o-st...

USR_ORA_VIP

Vip address

Crs_profile-create vip_name-t application-a $ORA_CRS_HOME/bin/uservip-o oi=... , ov=... , on=...

Crs_profile-update resource_name... Used to update an existing profile (only the profile is updated, not changes to resource properties that are already registered in the crs)

Crs_register is responsible for registering resource with OCR. The way to register is to turn it into profile, and then run

Crs_register resource [- dir...] Command, at the same time, crs_register also has update resource function, the specific method can update the profile file corresponding to resource, and then run crs_register-u resource_name [- dir...] Or directly release crs_register-update resource_name...

For example, I changed the vip on the rac node to start manually.

[root@rac1 crs] # crs_register-update ora.rac1.vip-o as=0

[root@rac1 crs] # crs_stat-p ora.rac1.vip | grep AUTO_START

AUTO_START=0

Crs_unregister is responsible for removing resource from ocr. You need to add the-f parameter if necessary.

Crs_setperm is used to set the permissions of resource (such as setting owner, user's read and write permissions, etc.), change owner with-o parameter, change group with-g, change user rights with-u, here are no more examples.

.CRSCTL

Check the health of crs with crsctl check crs.

[root@rac1 ~] # crsctl check crs

CSS appears healthy

CRS appears healthy

EVM appears healthy

Using crsctl to control CRS service

Crsctl start | stop | enable | disable crs

Start / stop resource with crsctl

[root@rac1 ~] # crsctl stop resources

Stopping resources.

Successfully stopped CRS resources

[root@rac1 ~] # crsctl start resources

Starting resources.

Successfully started CRS resources

Check, add and remove voting disk with crsctl

Let's talk about.

See crsctl help for more.

SRVCTL

SRVCTL is a powerful administrative configuration tool for CRS and RDBMS. For related usage, please refer to srvctl-h

(1) srvctl add/delete.. Add and delete resources. For example, when we migrate a single database instance to rac, we can use this tool to manually register the database or asm instance to OCR.

(2) srvctl status... State monitoring of resources

(3) srvctl start/stop... Start / stop of resources, which can be used interchangeably with crs_start/crs_stop.

(4) srvctl modify.. Redefine the properties of a resource

... ..

(2) Management and maintenance of OCR

Status verification of OCR:

You can use the ocrcheck tool to verify the status of OCR and space usage. Under Lunix, the / etc/oracle/ocr.loc file records the devices used by OCR.

[root@rac1] # ocrcheck

Status of Oracle Cluster Registry is as follows:

Version: 2

Total space (kbytes): 497896

Used space (kbytes): 3996

Available space (kbytes): 493900

ID: 958197763

Device/File Name: / dev/raw/raw5

Device/File integrity check succeeded

Device/File not configured

Cluster registry integrity check succeeded

Add / remove ocrmirror online

OCR supports one image. Adding / deleting images can be done online, mainly by executing commands on an online node.

[root@rac1] # ocrconfig-replace ocrmirror / dev/raw/raw5

[root@rac1 oracle] # cat / etc/oracle/ocr.loc

# Device/file getting replaced by device / dev/raw/raw5

Ocrconfig_loc=/dev/raw/raw1

Ocrmirrorconfig_loc=/dev/raw/raw5

It can be seen that ocr.loc is updated automatically.

When removing ocr or image, you can do so as long as you don't have a path.

When there is an ocr and a mirror in a crs, if the ocr is removed, the image will automatically be transformed into the role of ocr.

[root@rac1] # ocrconfig-replace ocr

[root@rac1] # cat / etc/oracle/ocr.loc

# Device/file / dev/raw/raw1 being deleted

Ocrconfig_loc=/dev/raw/raw5

You can see that the current ocrconfig_loc automatically becomes the previous ocrmirrorconfig_loc device.

Logical backup / restore

Backup command:

Ocrconfig-export [path]

Restore command

Ocrconfig-import [path]

When restoring OCR, you need to stop the crs service of each node. When the restore is complete, restart CRS. (if necessary, be careful to modify the corresponding device for ocr.loc at each node.)

Physical backup / restore

CRSD is responsible for backing up OCR every 4 hours. The default backup path is $ORA_CRS_HOME/cdate/crs.

You can use ocrConfig-showbackup to view backups, and if you want to change the physical backup path, you can use ocrconfig-backuploc [path] to do so.

Physical recovery command:

Ocrconfig-restore [path]

Similarly, when restoring OCR, you need to stop the crs service of each node. When the restore is complete, restart CRS. (if necessary, be careful to modify the corresponding device for ocr.loc at each node.)

Ocrdump

Ocrdump can export ocr information to ascii text, which is used to provide maintenance for Oracle Supoort.

The command is as follows:

Ocrdump

(3) Voting disk management and maintenance

The maintenance of Voting disk is relatively simple.

Votingdisk status query

[root@rac1] # crsctl query css votedisk

0 / dev/raw/raw2

Located 1 votedisk (s).

Add and remove votingdisk online

Oracle recommends that you configure an odd number of votingdisk, add / remove can be done online, and execute the command on a node of an online.

Add the votingdisk command:

Crsctl add css votedisk [path]-force

Delete the votingdisk command:

Crsctl add css votedisk [path]-force

Votingdisk backup recovery

Dd command is used for backup and recovery. When restoring, pay attention to stopping the CRS service on each node.

2.RDBMS management and maintenance

(1) .spfile and related parameters

Most commonly, nodes share the same spfile file and place it on shared storage, while on each node, there is a pfile file in the corresponding directory, and this pfile file points to the spfile on the shared storage.

When we need to modify the paremeter on a node, we need to display the specified sid, for example:

SQL > alter system set sga_target=1024M scope=spfile sid='rac1'

System Altered.

In this way, the sga_target parameter on the node rac1 is modified without affecting the parameter settings on the remaining nodes. If you do not add sid, the default is sid='*', that is, it takes effect on all nodes.

Under RAC, there are some parameters that are different from those of a single instance, as shown below:

① cluster_database

In general, this parameter should be set to true for each instance of rac. In some special cases, such as upgrade, this parameter needs to be set to false.

② db_name/db_unique_name/instance_name

The db_name of each node needs to be consistent, and so does the db_unique_name (which is different from standby). Instance_name is configured as the instance name of each node.

③ instance_number

This parameter represents the instance number of the instance on the node.

④ thread

This parameter is used to indicate the redo thread used by the instance. The thread number is not directly associated with the node number / instance number.

⑤ local_listener

This parameter is used to register the monitor manually. To resolve ORA-12514 errors, you can set this parameter.

⑥ remote_listener

This parameter is used for server-side load balancing configuration.

⑦ cluster_interconnects

This parameter is used to specify the network for IPC communication in the cluster. If there are multiple networks in the cluster for high-speed interconnection, you need to configure this parameter. Separate multiple IP addresses with colons. For the interconnection addresses currently used in the cluster, you can query the view gv$cluster_interconnects or click oradebug ipc to see.

⑧ max_commit_propagation_delay

This parameter is used to configure the generation mechanism of SCN. Under rac, there are two modes of SCN synchronization: (1) Lamport Scheme. In this mode, GES manages the propagation synchronization of SCN, and max_commit_propagation_delay represents the maximum time allowed for SCN synchronization. In this mode, the global SCN is not completely synchronized, which may have a certain impact on the application in highly concurrent OLTP systems. (2) Broadcast on Commit scheme. In this mode, as soon as a transaction publishes commit on any instance, the SCN is synchronized to the global immediately.

Under 10g R1, the default value of this parameter is 700, which is in Lamport Scheme mode. Under 10g R2, the default value of this parameter is 0, which is in Broadcast on Commit scheme mode (if you set a value less than 700, this mode will be used). In what way, you can learn from alert.log. This parameter value needs to be consistent with each node.

(2)。 Redo/Undo management

Redo management under RAC

As with single-instance systems, each node instance requires at least 2 sets of logfile. Each node instance has its own independent redo log thread (defined by the initialization parameter thread), for example:

SQL > select b.THREADPERIMA. GROUPLARY. STATUSrect. MEMBERREB.BYTES.BYTES.b.ARCHIVED.b.STATUS from v$logfile adyvicilog b where a.GROUP#=b.GROUP#

THREAD# GROUP# STATUS MEMBER BYTES ARCHIVED STATUS

1 1 STALE + DATA/demo/onlinelog/group_1.257.660614753 52428800 YES INACTIVE

1 2 + DATA/demo/onlinelog/group_2.258.660614755 52428800 NO CURRENT

2 3 + DATA/demo/onlinelog/group_3.265.660615545 52428800 NO CURRENT

24 STALE + DATA/demo/onlinelog/group_4.266.660615543 52428800 YES INACTIVE

Redo logs need to be deployed to shared storage and must be accessible to all node instances in the cluster. When a node instance performs instance / media recovery, the instances on that node will be able to apply the redo log files on all node instances under the cluster (if necessary), thus ensuring that the recovery can be performed on any available node.

The difference between alter system switch logfile and alter system archive log current under RAC

Alter system switch logfile only logs switching and archiving the corresponding redo thread on the current publishing node.

Alter system archive log current switches and archives the redo thread on all node instances in the cluster (when the node instance is available, it is archived to the archive destination of each node host, and when the node is not available, the thread log is archived to the archive destination of the command issuing node)

Undo management under RAC

Each node instance under RAC also needs its own separate undo table space. Specified by the initialization parameter * .Undo_tablespace. Like REDO, UNDO tablespaces need to be deployed to shared storage. Although the use of UNDO on each node is independent, it needs to be accessed by other node instances in the cluster to meet the requirements of construction read consistency.

SQL > alter system set undo_tablespace=undo1 sid='demo1'

SQL > alter system set undo_tablespace=undo2 sid='demo2'

(3) .Archivelog / flashback configuration management

Under RAC, Archivelog can be placed on a local disk or on shared storage. You need to have a reasonable deployment for Archivelog placement, and if you place it on a local disk, it will increase the complexity of backup and recovery.

The flash zone must be deployed to shared storage. Before enabling it, you need to configure parameters such as db_recovery_file_dest, db_recovery_file_dest_size, db_flashback_retention_target, and so on.

Let's start archiving and flashback on a non-archiving and non-flashback database.

? Change related parameters

SQL > alter system set log_archive_dest_1='location=/archive/demo1' sid='demo1'

System altered

SQL > alter system set log_archive_dest_1='location=/archive/demo2' sid='demo2'

System altered

SQL > alter system set db_recovery_file_dest_size=512M

System altered

SQL > alter system set db_recovery_file_dest='+DG1'

System altered

? Stop all node instances. The opening process is completed on an instance.

Rac1- > srvctl stop instance-d demo-I demo1

Rac1- > srvctl stop instance-d demo-I demo2

Rac1- > sqlplus / nolog

SQL*Plus: Release 10.2.0.1.0-Production on Sun Aug 3 22:06:50 2008

SQL > conn / as sysdba

Connected to an idle instance.

SQL > startup mount

ORACLE instance started.

Total System Global Area 167772160 bytes

Fixed Size 1218316 bytes

Variable Size 100665588 bytes

Database Buffers 62914560 bytes

Redo Buffers 2973696 bytes

Database mounted.

SQL > alter database archivelog

Database altered.

SQL > alter database flashback on

Database altered.

SQL > alter database open

Database altered.

SQL > select NAME,LOG_MODE,FLASHBACK_ON from v$database

NAME LOG_MODE FLASHBACK_ON

DEMO ARCHIVELOG YES

At 10g, it is not necessary to set the initialization parameter cluster_database=false like 9i to enable archiving and flashback. This undoubtedly simplifies the operation.

(4) RAC Management under ASM

Parameter files under ASM

Under RAC, there is an ASM instance running on each node, and rdbms instance runs on this asm instance. The Asm instance is local. Like the rdbms instance, it needs to have a parameter file, which is in the corresponding directory of each node.

The following is the pfile file under my ASM instance:

Cluster_database=true

Background_dump_dest=/opt/oracle/admin/+ASM/bdump

Core_dump_dest=/opt/oracle/admin/+ASM/cdump

User_dump_dest=/opt/oracle/admin/+ASM/udump

Instance_type=asm

Large_pool_size=12M

Remote_login_passwordfile=exclusive

Asm_diskgroups='DG1'

+ ASM2.instance_number=2

+ ASM1.instance_number=1

Briefly introduce several important parameters in asm instances:

Instance_type: used to indicate whether the instance is of type ASM or RDBMS

Asm_diskgroups:ASM disk group, mount automatically when the asm instance is started

Asm_diskstring: this parameter is used to describe the disk device on which the diskgroup can be created. The default is NULL.

Asm_power_limit: this parameter is used to set the number of process ARBx and is responsible for controlling the speed of load balancing operations. Values range from 0 to 11. The default value is 1.

? A data dictionary used to record ASM instance information.

V$ASM_DISK/ V$ASM_DISK_STAT: records disk information that can be recognized by the ASM instance, but these disks are not necessarily being used by the instance.

V$ASM_DISKGROUP/ V$ASM_DISKGROUP_STAT: record the diskgroup information under asm.

V$ASM_ALIAS: records the alias information for the diskgroup file.

V$ASM_FILE: record the file information in diskgroup.

V$ASM_OPERATION: records a long-running operation information currently running in the ASM instance.

V$ASM_TEMPLATE: record the diskgroup template.

V$ASM_CLIENT: record the rdbms instance information that uses the diskgroup under this asm instance.

? ASM disk group / file management operation under RAC

Add and delete disk groups online under RAC

If you add a diskgroup on one node, the other nodes on the cluster will not automatically mount the newly added diskgroup, which needs to be executed manually.

Node 1:

SQL > show parameter asm_diskgroups

NAME TYPE VALUE

Asm_diskgroups string DATA, DG1

SQL > CREATE DISKGROUP DATA2 NORMAL REDUNDANCY

FAILGROUP DATA2_gp1 DISK'/ dev/raw/raw6' FAILGROUP DATA2_gp2 DISK'/ dev/raw/raw7'

Diskgroup created.

SQL > show parameter asm_diskgroups

NAME TYPE VALUE

Asm_diskgroups string DATA, DG1, DATA2

At this point, observe node 2, and the newly added disk group is not mount.

SQL > show parameter asm_diskgroups

NAME TYPE VALUE

Asm_diskgroups string DATA, DG1

SQL > select group_number,type,state,type,total_mb,free_mb from v$asm_diskgroup_stat

GROUP_NUMBER STATE TYPE TOTAL_MB FREE_MB

1 CONNECTED EXTERN 5726 4217

2 CONNECTED EXTERN 415 297

0 DISMOUNTED 0 0

SQL > alter diskgroup DATA2 mount

When you delete a diskgroup, leave one node diskgroup in the mount state, set the diskgroup dismount on the remaining nodes, and then execute the delete command.

. Add and delete disks online

Adding a disk online and deleting a disk under RAC is no different from a single instance. It is important to note that this action causes the disk group to rebalance and ensure that there is enough space left for the disk group when the disk is deleted.

Node 1:

SQL > alter diskgroup dg6 add disk'/ dev/raw/raw7' name dg6_disk7

Diskgroup altered.

Query on Node 2:

SQL > select GROUP_NUMBER,path,NAME,MOUNT_STATUS,HEADER_STATUS,MODE_STATUS,STATE from v$asm_disk_stat where NAME is not null

GROUP_NUMBER PATH NAME MOUNT_S HEADER_STATU MODE_ST STATE

1 / dev/raw/raw3 DATA_0000 CACHED MEMBER ONLINE NORMAL

2 / dev/raw/raw4 DG1_0000 CACHED MEMBER ONLINE NORMAL

3 / dev/raw/raw6 DG6_0001 CACHED MEMBER ONLINE NORMAL

3 / dev/raw/raw7 DG6_DISK7 CACHED MEMBER ONLINE NORMAL

Delete the disk to operate at a certain node without giving an example.

I won't enumerate more administrative commands about ASM.

3.Database backup / restore

The backup recovery under RAC is essentially not much different from the backup recovery of a single instance. It should be noted that the current node is visible to all data files / archive logs during backup / restore. In a RAC system where data files and archive logs are all placed on shared storage, the backup and recovery process is almost the same as in a single instance. If the archive log is on a local disk, you need to pay attention to it. Let's simulate the backup and restore process respectively.

(1) .Archivelog visible backup / restore to each node

In this mode, backup and restore can be performed on any available node, which is not much different from a single instance.

? Back up the database

RMAN > run {allocate channel orademo type disk

Backup database format'/ backup/database/db_%s_%p_%t' plus archivelog format'/ backup/database/arch_%s_%p_%t' delete input

Backup current controlfile format'/ backup/database/contr_%s_%p_%t';}

Allocated channel: orademo

Channel orademo: sid=130 instance=demo2 devtype=DISK

Starting backup at 03-MAY-08

Current log archived

Channel orademo: starting archive log backupset

Channel orademo: specifying archive log (s) in backup set

Input archive log thread=1 sequence=5 recid=70 stamp=661823848

Input archive log thread=1 sequence=6 recid=72 stamp=661823865

... ..

Finished backup at 03-MAY-08

Released channel: orademo

? Add data to test the effect of recovery

SQL > create table kevinyuan.test_b as select * from dba_data_files

Table created

SQL > alter system switch logfile

System altered

SQL > insert into kevinyuan.test_b select * from dba_data_files

6 rows inserted

SQL > commit

Commit complete

SQL > select count (*) from kevinyuan.test_b

COUNT (*)

twelve

? Simulated failure / recovery

RMAN > run {restore controlfile from'/ backup/database/contr_16_1_661823935'

Sql 'alter database mount'

Restore database

Recover database

Sql 'alter database open resetlogs';}

Starting restore at 04-MAY-08

Allocated channel: ORA_DISK_1

. ..

Archive log filename=+DATA/demo/onlinelog/group_4.266.660615543 thread=2 sequence=11

Archive log filename=+DATA/demo/onlinelog/group_3.265.660615545 thread=2 sequence=12

Media recovery complete, elapsed time: 00:00:00

Finished recover at 04-MAY-08

Sql statement: alter database open resetlogs

? After the recovery, let's take a look at the verification data:

SQL > select count (*) from kevinyuan.test_b

COUNT (*)

twelve

(2)。 Archivelog invisible backup / restore to each node

If arhivelog uses a local disk, the archive log is not visible to any node. When backing up archivelog, if you use a backup scheme similar to the above, it will inevitably cause some archive logs to throw exceptions because they cannot access. The following backup methods can be used to enable the backup channel to access the archive log information recorded in all data dictionaries.

During recovery, the relevant backup slices / sets and archive log files generated by all the copy nodes are sent to the node to be restored, and the restore/recover operation can be performed on one node.

Simulate this operation.

SQL > alter system set log_archive_dest_1='location=/archive/demo1/' sid='demo1'

System altered

SQL > alter system set log_archive_dest_1='location=/archive/demo2/' sid='demo2'

System altered

(1) backup the database

RMAN > run {allocate channel orademo1 type disk connect sys/kevinyuan@demo1

Allocate channel orademo2 type disk connect

Sys/kevinyuan@demo2

Backup database format'/ backup/database/db_%s_%p_%t'

Plus archivelog format'/ backup/database/arch_%s_%p_%t' delete

Input

Backup current controlfile format

'/ backup/database/contr_%s_%p_%t;}

Allocated channel:

Orademo1

Channel orademo1: sid=133 instance=demo1 devtype=DISK

Allocated

Channel: orademo2

Channel orademo2: sid=151 instance=demo2

Devtype=DISK

Starting backup at 04-MAY-08

Current log archived

Channel

Orademo2: starting archive log backupset

Channel orademo2: specifying archive

Log (s) in backup set

Input archive log thread=2 sequence=4 recid=89

Stamp=661826286

... .

Channel orademo1: finished

Piece 1 at 04-MAY-08

Piece handle=/backup/database/contr_28_1_661826504

Tag=TAG20080504T004130 comment=NONE

Channel orademo1: backup set complete

Elapsed time: 00:00:09

Finished backup at 04-MAY-08

Released channel:

Orademo1

Released channel:

Orademo2

(2) the backup / archive log files on COPY node 2 are placed in the corresponding directory of node 1.

Rac2- > scp / backup/database/* rac1:/backup/database/

Rac2- > scp / archive/demo2/* rac1:/archive/demo1

(3) restore database

RMAN > run {restore controlfile from'/ backup/database/contr_28_1_661826504'

Sql 'alter database mount'

Restore database

Recover database

Sql 'alter database open resetlogs';}

Starting restore at 04-MAY-08

Using target database

Control file instead of recovery catalog

Allocated channel:

ORA_DISK_1

Channel ORA_DISK_1: sid=147 instance=demo1 devtype=DISK

Channel

ORA_DISK_1: restoring control file

Channel ORA_DISK_1: restore complete

Elapsed time: 00:00:20

Archive log

Filename=+DATA/demo/onlinelog/group_3.265.660615545 thread=2

Sequence=7

Archive log

Filename=+DATA/demo/onlinelog/group_4.266.660615543

Thread=2 sequence=8

Media recovery complete, elapsed time:

00:00:06

Finished recover at 04-MAY-08

Sql statement: alter database open

Resetlogs

At this point, the recovery is complete.

The backup of production library requires careful deployment and simulation testing, and different database types also need to develop different schemes for implementation. For DATABASE, backup is heavier than Mount Tai, and you can't take any chances.

Service.Failover and Load Balance

1.Service

Service is a very important concept in rac system, which provides highly available and diversified solutions for applications. In practice, we can create different kinds of service to meet the different needs of our applications.

Under 10gR2, services can be created in the following ways.

(1)。 Use dbca

(2)。 Use srvctl

Node1- > srvctl add service-d demo-s srv_1-r node1- a node2

Node1- > srvctl start service-d demo-s srv_1

Node1- > crs_stat-t

Name Type Target State Host

Ora.demo.db application ONLINE ONLINE node1

Ora... .o1.inst application ONLINE ONLINE node1

Ora... .o2.inst application ONLINE OFFLINE

Ora... .rv _ 1.cs application ONLINE ONLINE node1

Ora... .mo1.srv application ONLINE ONLINE node1

SQL > show parameter service

NAME TYPE VALUE

Service_names string demo,srv_1

(3)。 Create using the dbms_service command

10g provides dbms_service for managing services and extending functions.

SQL > EXEC DBMS_SERVICE.CREATE_SERVICE (SERVICE_NAME= > 'srv_2',NETWORK_NAME= >' srv_2')

PL/SQL procedure successfully completed

SQL > exec DBMS_SERVICE.START_SERVICE (service_name = > 'srv_2',instance_name = >' demo1')

PL/SQL procedure successfully completed

SQL > show parameter service

NAME TYPE VALUE

Service_names string demo,srv_2

(4)。 Others, etc.

No matter which way is adopted, the essence is to dynamically register the service with lisnter by modifying service_names.

2. Failover and load banance

RAC provides high-performance and high-availability services for applications. For users, the core functions are failover and load banance.

(1) Failover

In the 10gR2 version, there are two ways to implement Failover, one is TAF (Transparent Application Failover), the other is FCF (Fast Connection Failover).

TAF and implementation:

TAF is a transparent failover in net layer and a passive failover mode, which depends on VIP. Taf policies can be configured on both the client side and the server side.

Client-side taf configuration

Here is a simple tnsnames.ora content with taf function

Demo =

(DESCRIPTION =

(FAILOVER=ON)

(ADDRESS= (PROTOCOL=TCP) (HOST=10.194.129.145) (PORT=1521))

(ADDRESS= (PROTOCOL=TCP) (HOST=10.194.129.146) (PORT=1521))

(CONNECT_DATA =

(SERVICE_NAME = demo)

(SERVER=DEDICATED)

(FAILOVER_MODE= (TYPE=SELECT)

(METHOD=BASIC)

(RETRIES=50)

(DELAY=5)

)

Parameter description for controlling TAF policy:

Parameters.

Description

FAILOVER

Failover control switch (on/off), if off, does not provide failover, but attempts the address list in turn when connecting until it is found to be available

TYPE

Two types: session / select

Session: provides session-level failover.

Select: provides select-level failover, which is transparent to query statements, but requires rollback operations for transaction class processing

METHOD

Two types: basic/preconnect

Basic:client connects to only one node at the same time and jumps to another node on failover

Preconnect: it needs to be used in conjunction with backup. Client is connected to both the primary node and the backup node.

BACKUP

Backup connection configuration in Preconnect mode

RETRIES

Number of retries on failover

DELAY

Interval between retries on failover

Server-side TAF configuration

10gR2 provides TAF configuration on the server side, and you need to call the dbms_service package to make changes on the instance.

SQL > exec dbms_service.modify_service (service_name = > 'DEMO',failover_method = >' BASIC',failover_type = > 'SELECT',failover_retries = > 180 minus failoverdelay = > 5)

The client connection string can be modified as follows:

Demo =

(DESCRIPTION =

(ADDRESS= (PROTOCOL=TCP) (HOST=10.194.129.145) (PORT=1521))

(ADDRESS= (PROTOCOL=TCP) (HOST=10.194.129.146) (PORT=1521))

(CONNECT_DATA =

(SERVICE_NAME = demo)

(SERVER=DEDICATED)

)

FCF and its implementation

FCF is a new failover mechanism introduced by 10g. It relies on the ons process of each node and broadcasts FAN events to obtain the operation of each node. It is a presupposition judgment and supports JDBC/OCI/ODP.NET.

(1) .ons configuration

The onsctl tool configures the local / remote nodes and ports of each node. Configuration file path: $ORACLE_HOME/opmn/ons.config.

Use onsctl debug to track whether the ons process is running properly.

(2)。 Configure connection pooling (take jdbc as an example)

Need connection pool to support Implicit Connection Cache, set FastConnectionFailoverEnabled=true.

Ojdbc14.jar / ons.jar et al are added to CLASSPATH. Specific code can be found in the online documentation or metalink-related documentation.

(2) Load Balance

The 10g load balance has a great functional improvement compared with the previous version. According to Load Balancing Advisory, it provides the strategy of Runtime Connection Load Balancing, but I think this is also a relative contradiction. The more detailed the load balancing mechanism is, the more likely it is to aggravate cache fusion, which is a test for the overall performance of rac.

There are two main ways to implement load balance: one is Connection Load Balancing (CLB), and the other is Runtime Connection Load Balancing (RCLB).

CLB is divided into client-side client-side and server-side server-side.

Client-side needs to be implemented by adding LOAD_BALANCE=ON to tnsname.ora to provide a load balancing scheme based on the number of connections.

Server-side needs to modify the remote_listener parameter so that listener can listen to all nodes in the cluster and collect load information on nodes through PMON.

FCF supports RCLB by default, and RCLB provides better service to connections through load balancing advisory events. RCLB has two load balancing options to choose from-based on overall service name and based on overall Throughput. You can set up a specific goal scheme through dbms_service.

SQL > exec dbms_service.modify_service (service_name = > 'TEST', goal = > DBMS_SERVICE.GOAL_SERVICE_TIME)

As for the specific differences between the two ways, it has not been obviously reflected in my test.

This part of Load Balanc is where I have the most doubts. I have consulted a lot of documents with different opinions, and there is no detailed case proof. I also hope that friends who have studied here can make corrections.

Other maintenance implementation related / cases under RAC

This section focuses on some practical cases related to RAC engineering maintenance, and gives the following examples.

1. Changes to hostnames in the cluster

two。 Change of IP address in the cluster

3. Addition / deletion of nodes in a cluster

4. Upgrade: 9i rac upgrade 10g rac

5.rac + dg building

6. Other

Changes to hostnames in the cluster

The following is an actual example, and the following table shows the comparison of host names before and after the change.

Hostname:node1/node2-- > td1/td2

Private_name:node1_priv/node2_priv-- > td1_priv/td2_priv

Vip_name:node1_vip/node2_vip-- > td1_vip/td2_vip

1. Generate the cap file of listener

Node1- > crs_stat-p ora.node1.LISTENER_NODE1.lsnr > / home/oracle/ora.node1.LISTENER_NODE1.lsnr.cap

Node1- > crs_stat-p ora.node2.LISTENER_NODE2.lsnr > / home/oracle/ora.node2.LISTENER_NODE2.lsnr.cap

two。 Stop all resources, back up ocr, votingdisk and reformat

Backup OCR

[root@node1 backup] # ocrcheck

Status of Oracle Cluster Registry is as follows:

Version: 2

Total space (kbytes): 104176

Used space (kbytes): 4424

Available space (kbytes): 99752

ID: 2042344313

Device/File Name: / dev/raw/raw1

Device/File integrity check succeeded

Device/File not configured

Cluster registry integrity check succeeded

[root@node1 init.d] # ocrconfig-export / backup/ocr_1.bak

Backup votedisk

[root@node1 ~] # crsctl query css votedisk

0. 0 / dev/raw/raw110

Located 1 votedisk (s).

[root@node1 ~] # dd if=/dev/raw/raw110 of=/backup/votedisk.bak

Reformat

[root@td01 ~] # dd if=/dev/zero of=/dev/raw/raw1 bs=1024k count=1

[root@td01 ~] # dd if=/dev/zero of=/dev/raw/raw110 bs=1024k count=1

Modify hostname and edit related files on 3.OS, restart the host (steps are brief)

4. Reconfigure cluster mutual trust. (brief steps)

5. Edit the $ORA_CRS_HOME/ install/rootconfig file and modify the following for your actual situation.

ORA_CRS_HOME=/opt/oracle/product/10.2.0/crs_1

CRS_ORACLE_OWNER=oracle

CRS_DBA_GROUP=oinstall

CRS_VNDR_CLUSTER=false

CRS_OCR_LOCATIONS=/dev/raw/raw1

CRS_CLUSTER_NAME=crs

CRS_HOST_NAME_LIST=td1,1,td2,2

CRS_NODE_NAME_LIST=td1,1,td2,2

CRS_PRIVATE_NAME_LIST=td1-priv,1,td2-priv,2

CRS_LANGUAGE_ID='AMERICAN_AMERICA.WE8ISO8859P1'

CRS_VOTING_DISKS=/dev/raw/raw110

CRS_NODELIST=td1,td2

CRS_NODEVIPS='td1/td1-vip/255.255.255.0/eth0,td2/td2-vip/255.255.255.0/eth0'

Execute in turn on each node:

[root@td2 install] # / opt/oracle/product/10.2.0/crs_1/install/rootconfig

Checking to see if Oracle CRS stack is already configured

Setting the permissions on OCR backup directory

Setting up NS directories

Oracle Cluster Registry configuration upgraded successfully

WARNING: directory'/ opt/oracle/product/10.2.0' is not owned by root

WARNING: directory'/ opt/oracle/product' is not owned by root

WARNING: directory'/ opt/oracle' is not owned by root

WARNING: directory'/ opt' is not owned by root

Clscfg: EXISTING configuration version 3 detected.

Clscfg: version 3 is 10G Release 2.

Successfully accumulated necessary OCR keys.

Using ports: CSS=49895 CRS=49896 EVMC=49898 and EVMR=49897.

Node:

Node 1: td1 td1-priv td1

Node 2: td2 td2-priv td2

Clscfg: Arguments check out successfully.

NO KEYS WERE WRITTEN. Supply-force parameter to override.

-force is destructive and will destroy any previous cluster

Configuration.

Oracle Cluster Registry for cluster has already been initialized

Startup will be queued to init within 30 seconds.

Adding daemons to inittab

Expecting the CRS daemons to be up within 600 seconds.

CSS is active on these nodes.

Td1

Td2

CSS is active on all nodes.

Waiting for the Oracle CRSD and EVMD to start

Oracle CRS stack installed and running under init (1m)

Running vipca (silent) for configuring nodeapps

Creating VIP application resource on (2) nodes...

Creating GSD application resource on (2) nodes...

Creating ONS application resource on (2) nodes...

Starting VIP application resource on (2) nodes...

Starting GSD application resource on (2) nodes...

Starting ONS application resource on (2) nodes...

If it is version 10.2.0.1, when executing on the last node, an exception will be thrown because of vip's bug, and the VIPCA graphical interface will be called on that node.

So gsd, ons, and vip are all registered with OCR.

[root@td1 install] # crs_stat-t

Name Type Target State Host

Ora.td1.gsd application ONLINE ONLINE td1

Ora.td1.ons application ONLINE ONLINE td1

Ora.td1.vip application ONLINE ONLINE td1

Ora.td2.gsd application ONLINE ONLINE td2

Ora.td2.ons application ONLINE ONLINE td2

Ora.td2.vip application ONLINE ONLINE td2

6. Configure a public / private network using oifcfg

Td1- > oifcfg setif-global eth0/10.194.129.0:public

Td1- > oifcfg setif-global eth2/10.10.10.0:cluster_interconnect

7. Register other resources to the cluster

(1) registering to listen to the cluster

Modify the listening configuration file lisntener.ora and edit the generated cap file (renamed) to change the hostname used in it.

Td1- > crs_register ora.td1.LISTENER_TD1.lsnr-dir / home/oracle

Td1- > crs_register ora.td2.LISTENER_TD2.lsnr-dir / home/oracle

Or use netca graphical interface to configure snooping.

(2) register the ASM instance to the cluster (if ASM is used)

Td1- > srvctl add asm-n td1- I ASM1-o $ORACLE_HOME

Td1- > srvctl add asm-n td2-I ASM2-o $ORACLE_HOME

(3) register instance/database to the cluster

Td1- > srvctl add database-d demo-o $ORACLE_HOME

Td1- > srvctl add instance-d demo-I demo1-n td1

Td1- > srvctl add instance-d demo-I demo2-n td2

Verify:

Td1- > crs_stat-t

Name Type Target State Host

Ora.demo.db application ONLINE ONLINE td1

Ora... .o1.inst application ONLINE ONLINE td1

Ora... .o2.inst application ONLINE ONLINE td2

Ora... .SM1.asm application ONLINE ONLINE td1

Ora... .D1.lsnr application ONLINE ONLINE td1

Ora.td1.gsd application ONLINE ONLINE td1

Ora.td1.ons application ONLINE ONLINE td1

Ora.td1.vip application ONLINE ONLINE td1

Ora... .SM2.asm application ONLINE ONLINE td2

Ora... .D2.lsnr application ONLINE ONLINE td2

Ora.td2.gsd application ONLINE ONLINE td2

Ora.td2.ons application ONLINE ONLINE td2

Ora.td2.vip application ONLINE ONLINE td2

SQL > select * from v$cluster_interconnects

NAME IP_ADDRESS IS_PUBLIC SOURCE

Eth2 10.10.10.145 NO Oracle Cluster Repository

If OCFS is used as the shared file format, check the configuration of the corresponding OCFS before starting the database and make sure that ocfs can be mounted and used properly.

two。 Change of IP address in the cluster

The change of IP address is relatively easier than that of hostname. No special action is required for public/private IP changes on the same network segment. If it is a different network segment, you need to use oifcfg processing. Because VIP is registered with OCR as a resource, any changes to VIP need to be handled by calling the relevant command.

All of the above processes need to be operated on the basis of stopping the cluster resources.

The following is the public/pricate/vip before and after modification

Public IP 10.194.129.145swap 146-> 10.194.128.145Compact 146

Privite IP 10.10.10.145swap 146-> 10.10.1.145swap 146

Virtual IP 10.194.129.147Compare 148-> 10.194.128.147Compare 148

1. Stop all resources

The database is closed normally, and the rest of the resources are stopped using crs_stop.

two。 Re-modify the network card ip/gateway/host file, restart the network and other related services

3. Using oifcfg to change public/private ip

View current u

Td1- > oifcfg getif

Eth2 10.10.10.0 global cluster_interconnect

Eth0 10.194.129.0 global public

Delete current

Td1- > oifcfg delif-global eth0

Td1- > oifcfg delif-global eth2

Re-add

Td1- > oifcfg setif-global eth0/10.194.128.0:public

Td1- > oifcfg setif-global eth2/10.10.1.0:cluster_interconnect

4. Update vip configuration registered to OCR (root user)

[root@td1] # crs_register-update ora.td1.vip-o oi=eth0,ov=10.194.128.147,on=255.255.255.0

[root@td1] # crs_register-update ora.td2.vip-o oi=eth0,ov=10.194.128.148,on=255.255.255.0

Or use (root user)

[root@td1] # srvctl modify nodeapps-n td1-A 10.194.128.147/255.255.255.0/eth0

[root@td1] # srvctl modify nodeapps-n td2-A 10.194.128.148/255.255.255.0/eth0

5. If you use ocfs, modify the ocfs configuration file (/ etc/ocfs/cluster.conf) to verify that it is available after modification

6. Modify the listening listener configuration file

7. Start the resources of each node in the cluster and verify

Td1- > crs_start-all

SQL > select * from v$cluster_interconnects

NAME IP_ADDRESS IS_PUBLIC SOURCE

Eth2 10.10.1.145 NO Oracle Cluster Repository

. Deletion / addition of nodes in a cluster

Compared with the node deletion / addition of 9i, 10g is relatively troublesome to add and delete nodes, but the operation is more standardized.

Because of the existence of clusterware, it is necessary to call a series of interface commands to add / remove resources from OCR. This article will not describe this case in detail. See the official oracle online document RAC-Adding and Deleting Nodes and Instances on UNIX-Based Systems.

. Upgrade and Migration

The migration and upgrade of RAC is not much more complex than a single instance. For a novice to rac, there is no need to think that this is a complicated thing, of course, as long as you have enough basic knowledge of single instance and a deep understanding of it.

For example, with rman backup, we can easily migrate an instance running on a single node to a rac environment. All you need to do is migrate the database (you can imagine it as a single instance to a single instance), then edit the parameter file, add the redo and undo necessary for other nodes to start db, and register the database resources with the cluster for management.

If you have downtime limits on your migration or upgrade, in most cases the focus is not on the RAC architecture being operated on, but on how to develop your MAA strategy. For example, you need to use some features such as tablespace transfer or advanced replication / streaming to compress downtime, which may make the whole construction more difficult because of the RAC architecture, but most of the time that's not the point.

Next, we provide a case of silent upgrade from 9i RAC to 10G RAC on the same machine. For details, please refer to my article on blog http://www.easyora.net/blog/9i_rac_upgrade_10g_rac.html.

. Highly available Architecture: RAC+DG

It should be said that there is still a considerable market for rac+dg enterprise architecture in terms of high availability and disaster preparedness.

In terms of construction and management, the process of rac (master) + DG (slave) is not much different from that of a single instance. The following aspects need to be noted (but not limited to the following):

Detection problems of 1.log gap

Note that the fal_server and fal_clicent parameters are configured correctly, especially when the rac main library is archived to their respective nodes, the standby side gap server needs to cover each master library node.

Considerations for 2.switchover/failover

When you do any switching, you need to keep only one instance of alive on the rac side. After the other instances are closed, the switching procedure is basically the same as that of a single-node dg.

3.standby logfile problem

If you use LGWR to transfer logs, you need to add standby logfile logs on the backup side of the database. It is important to note that the thread of the added standby logfile should be the same as the main library. If your primary database node has 3 instances, you need to add 3 groups of backup logs with the same thread number as the rac primary database, and at least 2 sets of logs for each thread.

VI. Optimization of RAC monitoring

1. Thinking and waiting for an explanation of the event

In view of the complexity of RAC system, the optimization of RAC puts forward higher difficulties and requirements than the optimization of single example. In most cases, the optimization method on a single instance is also applicable under the RAC structure.

Two core issues of RAC optimization:

(1)。 Reduce the pressure on shared pool: reduce contention for data dictionaries and reduce hard parsing.

Because row cache/library cache is global, frequent data dictionary contention / hard parsing can have more serious performance consequences than a single instance in a RAC environment.

(2)。 Reduce global block transfer and contention due to Cache fusion

Frequent Cache fusion brings global contention on a series of data blocks. How to reduce logical reading and data sharing among instances is a new requirement of RAC system for application design and deployment.

Cache fusion performance is a very important aspect that affects the performance of RAC system. Avg global cache cr block receive time and avg global cache current block receive time are two important metrics of cache fusion. The following is the threshold of these two metrics given by oracle:

Name

Lower Bound

Typical

Upper Bound

Avg global cache cr block receive time (ms)

0.3

four

twelve

Avg global cache current block receive time (ms)

0.3

eight

thirty

Global wait event under RAC:

SQL > select * from v$event_name where NAME like 'gc%' and WAIT_CLASS='Cluster'

There are more than 40 non-idle global wait times under 10G R2, and the most common noteworthy wait events are as follows:

Gc current/cr request

This wait event represents the time it takes for a resource to read from the remote instance to the local instance. The occurrence of this event does not mean anything. If the waiting time is too long, it may indicate that there is a problem with the intranet or that there is serious block contention.

Gc buffer busy

The global extension of buffer busy waits. The occurrence of this wait time may generally be a block contention problem.

Enquenue class

In RAC, the common Enquenue is enq: HW-contention/enq: TX-index contention/enq and so on, which is easy to appear in the insert environment with high concurrency across nodes.

Events such as gc current-2way/3way.gc current/cr grant, which simply provide details or results of block and message transfers, generally do not require much attention.

two。 Performance diagnosis

The adjustment of performance is difficult to give a formula, but the guiding ideology can achieve a great deal of unity.

AWR/ASH and other reports can be used as a powerful performance acquisition and diagnosis tool in RAC system. Compared with the report of a single instance, the RAC Statistics section of AWR provides us with detailed GES and GCS performance samples, combined with global wait events to locate symptoms on cluster problems.

Under the RAC structure, the Segment Statistics part is what we need to pay more attention to. If you are still used to using STATSPACK for performance collection, it is recommended that you set the collection level to at least 7. This section provides us with detailed Segment-level activities, which helps us to locate the global HOT table / HOT index and analyze the root causes of global resource queuing contention.

Pay attention to the role of the series of views at the beginning of the DBA_HIS, which will help us to position the problem in more detail, even to the SQL level. Cases of poor SQL efficiency dragging down system performance abound, and this is often more common in RAC. Dba_hist_active_sess_history can be used as a good entry point, such as getting execution text by associating dba_hist_sqltext, obtaining execution plan tree by associating dba_hist_sql_plan, and sometimes directly finding the culprit that caused the waiting event.

Common contention and resolution in RAC:

① Sequence and index contention

Sequence is a place that is easy to cause contention in RAC, especially using sequence as index, which can easily cause contention of index blocks and cross-instance transmission of CR copies in the case of high concurrency of multi-node insert.

You need to increase the Sequence's cache value as much as possible and set the sequence to noorder.

② undo block considerations

The construction cost of CR under RAC is higher than that of single instance. If the active transactions in a block are distributed among several instances, it is necessary to merge the undo on several instances to construct the required CR, especially the insertion of highly concurrent indexed keys, which is easy to cause undo block contention.

Try to use small transactions.

③ HW considerations

High concurrency of insert across nodes can cause contention of high water mark, which can be slowed down by using larger extent/ and using ASSM and partition technology.

④ Hot Block

The problem of global hotspot blocks has a great impact on the RAC system, so the concurrent changes of blocks across instances can be reduced as much as possible, and the contention can be alleviated by appropriate partitioning.

A good application design is an important prerequisite for RAC to exert its power. Deploying different applications according to different nodes can effectively reduce the contention of global resources, and is also very important to the stability of RAC performance.

Introduction to the concept of Cluster Notes on Cluster terminology

Service hardware: hardware that provides computing services, such as PC machines and PC servers.

Service entity: service entity usually refers to service software and service hardware.

Node (node): an independent host running the Heartbeat process is called a node. The node is the core component of the HA, and each node runs the operating system and Heartbeat software services.

Resource: a resource is an entity that a node can control, and when a node fails, these resources can be taken over by other nodes. Such as: disk partitions, file systems, IP addresses, application services, shared storage

Event: events are things that can happen in a cluster, such as node system failures, network connectivity failures, network card failures, application failures, and so on. These events will lead to the transfer of resources of the node, and the testing of HA is based on these events.

What is a cluster

A cluster is a group of computers that provide users with a set of network resources as a whole. These individual computer systems are the cluster nodes (node). Clusters provide the following key features.

(1) scalability. The performance of the cluster is not limited to a single service entity, and new service entities can be dynamically added to the cluster to enhance the performance of the cluster.

(2) High availability. Clusters protect clients from "out of service" warnings through redundancy of service entities. When one node server fails, the applications running on that server are automatically taken over on another node server. Eliminating a single point of failure is important to enhance data availability, accessibility, and reliability.

(3) load balancing. Load balancing can distribute tasks evenly to computing and network resources in the cluster environment, in order to improve data throughput.

(4) error recovery. If one of the servers in the cluster is unavailable due to a failure or maintenance need, resources and applications are transferred to the available cluster nodes. This process in which resources in one node cannot work and resources in another available node can transparently take over and continue to complete the task is called error recovery.

The connections and differences between distribution and cluster are as follows:

(1) Distribution refers to the distribution of different businesses in different places.

(2) clustering refers to the centralization of several servers to achieve the same business.

(3) every distributed node can be a cluster, and a cluster is not necessarily distributed. And distributed, in a narrow sense, is also similar to the cluster, but its organization is relatively loose, unlike the cluster, there is a certain organization, one server is down, other servers can come up. Each distributed node completes a different business, and if a node goes down, the service is inaccessible.

Clusters are mainly divided into three categories:

HA: high availability Cluster (High Availability Cluster).

LBC: load balancing Cluster / load balancing system (Load Balance Cluster)

HPC: scientific computing cluster (High Performance Computing Cluster) / high performance computing (High Performance Computing) cluster.

Why set up a database cluster

With the rapid development of economy, the rapid expansion of enterprise scale, the explosive growth of the number of enterprise users and the amount of data, it has posed a severe test to the database. For all databases, in addition to recording the correct processing results, they also face the following challenges.

How to improve the processing speed and balance the load of the database. How to ensure the availability of the database, data security, and how to achieve the scalability of the data cluster. How to solve these problems comprehensively has become the focus of many enterprises.

The same is true for setting up clusters in databases for the following reasons:

(1) with the growth of enterprises, the volume of business increases, the amount of access to the database and the amount of data increases rapidly, and its processing capacity and computing speed increase accordingly, making it impossible for a single device to bear.

(2) under the above circumstances, if you throw away the existing equipment and do a large number of hardware upgrades, it is bound to result in a waste of existing resources, and the next time the business volume increases, you will face another high investment in hardware upgrading. Therefore, people hope to set up clusters through several small and medium-sized servers to achieve load balancing and continuous expansion of the database; when higher database processing speed is needed, as long as a simple increase in the database server can be expanded.

(3) Database, as the core of the information system, plays a very important role. A single device can not guarantee the continuous operation of the system at all. If a system failure occurs, it will seriously affect the normal operation of the system, and even bring huge economic losses. Therefore, people hope to set up a database cluster to achieve the high availability of the database. When a node fails, the system will automatically detect and transfer the application of the fault node to ensure the continuous work of the database.

(4) the database of the enterprise holds the important information of the enterprise, and some core data is even related to the lifeblood of the enterprise. A single device can not guarantee the security of the database at all. Once lost, it is difficult to find it again. Therefore, people hope that through the establishment of database clusters, to achieve the redundancy of data sets, by backing up data to ensure security.

Classification of database clusters

Database cluster technology is a technology that combines multiple servers to form a cluster to achieve better comprehensive performance than a single large server. this technology can not only meet the needs of applications, but also greatly save the cost of investment. Database cluster technology belongs to two types of systems: the cluster technology based on database engine and the cluster technology based on database gateway (middleware). In the aspect of database cluster products, it mainly includes Oracle RAC, Microsoft MSCS, IBM DB2UDB, Sybase ASE based on database engine cluster technology, and ICX-UDS based on database gateway (middleware) cluster technology.

Generally speaking, the direction that database cluster software focuses on and the problems it tries to solve are divided into three categories:

Load balancing cluster (Load Balance Cluster,LBC) focuses on the horizontal expansion of the database to improve the performance of the database. L High availability Cluster (High Availability Cluster,HAC) focuses on ensuring the continuity of database applications. Most database clusters focus on this. L High Security Cluster (High Security Cluster,HSC) focuses on disaster recovery.

Only Oracle RAC can achieve the above three aspects.

Scalable distributed database architecture

1. Oracle RAC:

The most important feature of its architecture is the shared storage architecture (Shared-storage). The whole RAC cluster is built on a shared storage device, and the nodes are interconnected with high-speed network. OracleRAC provides very good high availability features, such as load balancing and application transparent chunking (TAF). Its biggest advantage is that it is completely transparent to applications, and applications can switch to RAC clusters without modification. However, the scalability of RAC is limited. First of all, because the whole cluster depends on the underlying shared storage, the ability and availability of shared storage determine the capabilities that can be provided by the whole cluster. For Imax O-intensive applications, this mechanism determines that the subsequent expansion can only be Scale up (upward expansion), and the hardware cost, developers' requirements and maintenance costs are relatively high. Oracle is obviously aware of this problem. In Oracle's MAA (Maximum Availability Architecture) architecture, ASM is used to integrate multiple storage devices, so that the shared storage devices at the bottom of RAC have the ability to scale linearly, and the whole cluster no longer depends on the processing power and availability of large storage.

Another problem of RAC is that with the continuous increase of the number of nodes, the cost of communication between nodes will also increase. When it reaches a certain limit, increasing the number of nodes may not lead to performance improvement, or even lead to performance degradation. The main reason for this problem is that Oracle RAC is transparent to applications, and applications can connect to any node in the cluster for processing. When applications on different nodes compete for resources, the communication overhead between RAC nodes will seriously affect the processing capacity of the cluster. So there are two suggestions for using ORACLE RAC:

Communication between nodes uses high-speed interconnection network; l different applications are distributed on different nodes as far as possible.

For this reason, Oracle RAC can usually achieve good scalability in DSS environment (decision support system Decision Support System, referred to as DSS), because DSS environment is easy to distribute different tasks on different computing nodes, while for OLTP applications (On-Line Transaction Processing online transaction processing system), Oracle RAC is more often used to improve availability than scalability.

(2) MySQL Cluster

MySQL cluster is completely different from Oracle RAC in that it uses a shared-less architecture called Shared nothing (shared nothing architecture). The whole cluster is composed of management node (ndb_mgmd), processing node (mysqld) and storage node (ndbd), and there is no shared storage device. MySQL cluster is mainly implemented by NDB storage engine. NDB storage engine is a memory storage engine, which requires that all data must be loaded into memory. The data is automatically distributed on different storage nodes in the cluster, and each storage node stores only one fragment of the complete data. At the same time, users can set up the same data to be saved on a number of different storage nodes to ensure that a single point of failure will not cause data loss. MySQL cluster is mainly composed of three parts:

L SQL server node l NDB data storage node l monitor and manage node

Such layering is also related to the architecture in which MySQL itself separates SQL processing from storage. The advantage of MySQL cluster is that it is a distributed database cluster, processing nodes and storage nodes can be increased linearly, the whole cluster has no single point of failure, availability and scalability can be very high, more suitable for OLTP applications. But the problem is:

L NDB ("NDB" is an "in-memory" storage engine with high availability and good data consistency. The storage engine must require all the data to be loaded into memory, which is quite limited, but the current new version of NDB has improved this to allow index data to be loaded only in memory, and the data can be saved on disk. The performance of the current MySQL cluster is not ideal because the data is distributed to different storage nodes according to the primary key hash. If the application does not obtain the data through the primary key, it must scan all the storage nodes and return the results to the processing node for processing. Moreover, the write operation requires writing multiple copies of data to different storage nodes at the same time, which requires a high level of network between nodes.

Although the current performance of MySQL cluster is not ideal, the architecture of share nothing must be the trend in the future. After Oracle took over MySQL, it is also making great efforts to develop MySQL cluster. I have great expectations for the prospect of MySQL cluster.

(3) distributed database architecture

The data table partitioning function (Sharding) comes after MySQL 5. Sharding is not a function attached to a particular database software, but an abstract treatment based on specific technical details. It is a solution of horizontal expansion (Scale Out, or horizontal expansion, outward expansion). Its main purpose is to break through the limitation of Icano capability of single-node database server and solve the problem of database expansibility. For example, Oracle's RAC uses a shared storage mechanism. For Imax O-intensive applications, the bottleneck is easy to fall on storage. This mechanism determines that subsequent capacity expansion can only be Scale Up (upward expansion), and the hardware cost, developers' requirements and maintenance costs are relatively high. Sharding is basically an extensibility solution for open source databases, and it is rare to hear of commercial databases for Sharding. At present, the trend in the industry is basically to embrace Scale Out and gradually liberate from Scale Up.

The advantage of Sharding architecture is that the cluster has strong scalability, almost linear expansion, and the availability of the whole cluster is also very high, and some nodes fail, which will not affect the services provided by other nodes. Sharding is simple in principle and easy to implement, so it is a very good solution to database expansibility. Sharding is not the silver bullet of the database expansion scheme, and it also has its unsuitable scenarios, such as dealing with transactional applications, which may cause complexity of the application architecture or limit the function of the system, which is also its defect. Read-write separation is an important idea in building distributed systems. The overall processing capacity of many systems can not keep pace with the growth of business, so it is bound to bring bottlenecks, simple hardware upgrading can not be done once and for all. According to the characteristics of the business type, we need to make a series of adjustments from the architecture pattern, such as the division of the business module, the split of the database and so on. Centralized and distributed are two opposing models, and the application characteristics of different industries also determine the idea of architecture. For example, some portal sites in the Internet industry, for technical and cost considerations, more use of open source database products (such as MYSQL), because most of them are typical requests for reading more and writing less, which provides conditions for the popularity of MYSQL and its replication technology. Compared with some traditional transaction-intensive industries, such as telecommunications and financial industry, taking into account the single point of processing capacity and reliability, stability and other issues, more commercial databases may be used, such as DB2, Oracle and so on. As far as the database level is concerned, most of the core libraries of the traditional industry adopt a centralized architecture idea and use high-equipped minicomputers as the host carrier. Because of the strong processing power of the database itself and the host, the database side can generally support the operation of the business. Therefore, the read-write separation architecture of Oracle is relatively less than that of MYSQL. The read-write separation architecture makes use of the replication technology of the database, which distributes reads and writes on different processing nodes, so as to improve availability and scalability. The most common approach is to use MySQL Replication technology, Master DB to undertake write operations, copy data changes to multiple Slave DB, and undertake read operations. This architecture is suitable for read-intensive-type applications, and read performance can increase linearly by increasing the number of Slave DB. In order to avoid the single point of failure of Master DB, the cluster generally uses two Master DB as dual-computer hot backup, so the read and write availability of the whole cluster is very high. In addition to MySQL,Oracle 's ability to provide Active Standby from 11g, it also has the foundation to implement a read-write separation architecture. The disadvantage of the read-write separation architecture is that whether it is Master or Slave, each node must keep complete data. For example, in the case of a large amount of data, the scalability of the cluster is still limited by the storage capacity of a single node, and the read-write separation architecture is not suitable for Write-intensive applications.

Using the idea of separation of Oracle reading and writing, Writer DB and Reader DB use log replication software to achieve real-time synchronization; Writer DB is responsible for real-time query and transaction processing related to transactions, Reader DB is responsible for read-only access, dealing with some non-real-time transaction details, summary query of statements, and so on. At the same time, in order to meet the requirements of high availability and expansibility, we should extend the reader and writer properly. For example, Writer DB adopts the architecture mode of HA or RAC. At present, in addition to the cluster products of database vendors, there are two main methods to solve the scalability of database: data fragmentation and read-write separation. The principle of data slicing (Sharding) is to split data horizontally, similar to the principle of hash partition. Multiple sets of access routing and Reader DB can be solved through application architecture, and the pressure of reading library can be effectively shared by load balancing or business separation.

For the database architecture mode of Shared-nothing, one of the core problems is the real-time synchronization of read-write libraries; in addition, although Reader DB is only responsible for business queries, it does not mean that the database is functionally read-only. Read-only is from the application point of view, in order to ensure data consistency and conflict considerations, because the query business module may involve some intermediate processing, if it needs to be processed in the database (depending on application requirements and design), so Reader DB still needs to be writable functionally. Let's talk about the technology selection of data synchronization:

There are many technologies that can realize real-time data synchronization, such as OS layer (such as VERITAS VVR), storage replication (most of the middle and high end storage supports), application distribution or database layer technology. Because data synchronization may not be a single DB entire database synchronization, it will involve business data selection and multi-source integration, so OS replication and storage replication are not the first choice for read-write separation. Log-based Oracle replication technology, Oracle its own components can be implemented, but also mature commercial software. The choice between commercial stand-alone products or Oracle's own component functions depends on many factors. For example, the corresponding technical operation and maintenance capability of the team, the input cost of the project, the load degree of the business system, and so on.

Oracle uses its own component features, except Logical Standby, Stream and 11g Physical Standby (Active Data Guard). By comparison, Stream is the most flexible, but the most unstable. 11g Physical Standby supports recovery and read-only parallelism, but it is the most limited in read-write separation scenarios because it is not a logical application mechanism for logs. If the technical team has enough knowledge of the relevant technology, and the processing capacity of the selection scheme can support the requirements of data synchronization, it is feasible to use the components of Oracle itself. The choice of commercial products, more out of stability, processing capacity and other considerations. There are several mature Oracle replication software on the market, whether it is the old Shareplex, the RealSync of the local DSG company and the DDS of the Nine Bridge Company, or the upstart Goldengate of Oracle, are all available targets. With the acquisition and promotion of GoldenGate by Oracle, I think GoldenGate will be popular in disaster recovery, data distribution and synchronization. Of course, to build a reliable distributed read-write separation system, we still need to do a lot of design on the application, which is beyond the scope of this article.

(4) CAP and BASE theory

Distributed domain CAP theory:

L Consistency (consistency), consistent data update, all data changes are synchronous l Availability (availability), good response performance l Partition tolerance (partition fault tolerance) reliability

Theorem: any distributed system can only satisfy two points at the same time, not all three.

Advice: instead of wasting energy on how to design a perfect distributed system that meets all three, architects should make choices.

The ACID model of relational databases with high consistency and availability is difficult to partition:

L Atomicity atomicity: all operations in a transaction must be completed, or none must be completed. L Consistency consistency. The database should be in a consistent state at the beginning or end of the transaction. L Isolation isolation layer. The transaction will assume that it is the only one operating the database and does not know each other. L Durability. Once the transaction is complete, it cannot be returned.

(5) Cross-database transactions

2PC (two-phase commit), 2PC is the anti-scalability pattern (Pat Helland) is anti-scalable mode, that is to say, it is very difficult for traditional relational databases to achieve a distributed database cluster, and the scalability of relational databases is very limited. In recent years, the growing NoSQL (non-relational database) movement is to sacrifice strong consistency, adopt the BASE model, and design the distributed system with the idea of final consistency, so that the system can achieve high availability and expansibility. So, is it possible to implement a distributed database cluster that not only ensures availability and consistency, but also provides good scalability?

The main implementation of BASE idea is to divide database sharding fragments by function. BASE idea mainly emphasizes basic usability. If you need High availability, that is, pure high performance, then you have to sacrifice consistency or fault tolerance. The scheme of BASE idea still has potential in performance.

Common ground: all are options other than relational database SQL. With the distribution of data, any model can persist itself, separate data processing from data storage, separate read from write, storage can be asynchronous or synchronous, depending on the degree of consistency required. L differences: Key-Value storage products such as NOSQL are products that meet with relational databases, BOX, which can be suitable for non-Java, such as PHP RUBY, and can be used with them. Domain model + distributed cache + storage is a complex architectural solution, not a product, but this approach is more flexible and should be mastered by architects.

At present, there are many distributed database products, but most of them are oriented to DSS applications, because compared with OLTP applications, DSS applications are easier to achieve distributed expansion, such as Greenplum based on PostgreSQL, which well solves the problems of availability and scalability, and provides strong parallel computing capabilities. For OLTP applications, business characteristics determine its requirements: high availability, consistency, short response time, support for transactions and join, and so on. Database and NoSQL as more and more NoSQL products emerge, they have many features that relational databases do not have, and can do well in terms of availability and scalability.

First, the application scenarios of NoSQL are very limited, and a certain type of NoSQL is only designed for specific types of application scenarios. Relational databases are much more general-purpose, and you have to figure out whether your application scenario is appropriate to use NoSQL.

Second, the use of relational database with application architecture, such as Sharding and read-write separation technology, can also build a highly available and scalable distributed database cluster.

Third, relational database manufacturers are still very strong, there are a large number of users around the world, demand is bound to promote the emergence of new products.

Fourth, the rapid development of hardware, such as flash memory technology continues to mature, the future flash memory can be used as the cache between disk and memory, or completely replace the disk. While the price of memory is getting lower and lower, the capacity is getting larger and larger, and the application of In-memory cache or database is more and more widely, which can improve the performance of the application by an order of magnitude. The IO problem faced by the database will be greatly improved.

Three High availability Cluster schemes of Oracle

1 RAC (Real Application Clusters)

Multiple Oracle servers form a shared Cache, while these Oracle servers share a network-based storage. This system can tolerate single-machine / multi-machine failures. However, multiple nodes in the system need high-speed network interconnection, which basically means that everything is placed in a computer room, or in a data center. If there is something wrong with the computer room, such as the network, it will be broken. Therefore, using RAC alone can not meet the needs of the important business of ordinary Internet companies, which requires multiple computer rooms to tolerate accidents in a single computer room.

2 Data Guard. (the main function is redundancy)

Data Guard is suitable for multiple computer rooms. One computer room has a production database, and another computer room deploys a standby database. Standby database is divided into physical and logical. The physical standby database is mainly used for switching after production failure. On the other hand, logical standby database can share the read load of production database in peacetime.

3 MAA

MAA (Maximum Availability Architecture) is not an independent third, but a combination of the first two to provide the highest availability. RAC clusters are deployed in each computer room, and multiple computer rooms are synchronized with Data Guard.

Overview of RAC

Shared storage file systems (NFS), or even clustered file systems (such as OCFS2) are mainly used in storage area networks (all nodes directly access storage on the shared file system), which causes nodes to fail without affecting access to the file system from other nodes. In general, shared disk file systems are used in highly available clusters.

The core of Oracle RAC is the shared disk subsystem. All nodes in the cluster must be able to access all data, redo log files, control files and parameter files. Data disks must be available globally, allowing all nodes to access the database. Each node has its own redo log and control files, but other nodes must be able to access them in order to recover in the event of a system failure on that node.

Oracle RAC runs on top of a cluster, providing the highest level of availability, scalability, and low-cost computing power for Oracle databases. If one node in the cluster fails, Oracle will continue to run on the rest of the nodes. The main innovation of Oracle is a technology called cache consolidation. Cache merging enables nodes in the cluster to efficiently synchronize their memory caches through high-speed cluster interconnection, thus minimizing disk I and O. The most important advantage of caching is that it enables the disks of all nodes in the cluster to share access to all data. Data does not need to be partitioned between nodes. Oracle is the only vendor that provides open system databases with this capability. Other database software that claims to run on the cluster needs to partition the database data, which is impractical. The enterprise grid is the data center of the future, built on a large configuration of standardized commercial components, including processors, networks, and storage. Oracle RAC's cache merging technology provides the highest levels of availability and scalability. Oracle Database 10g and OracleRAC 10g significantly reduce operating costs and increase flexibility, giving the system greater adaptability, foresight, and flexibility. Dynamic provision of nodes, storage, CPU, and memory can achieve the required service levels while continuously reducing costs through increased utilization.

RAC integrated groupware management

Oracle RAC 10g provides a fully integrated clusterware management solution on all platforms on which Oracle Database 10g runs. This groupware feature includes cluster connectivity, message processing services and locking, cluster control and recovery, and a workload management framework (discussed below). The integrated clusterware management of Oracle RAC 10g has the following advantages:

(1) the cost is low. Oracle provides this feature free of charge.

(2) support from a single manufacturer. The problem of mutual prevarication has been eliminated.

(3) it is easier to install, configure and maintain continuously. Oracle RAC 10g Clusterware uses standard Oracle database management tools for installation, configuration, and maintenance. No other integration steps are required for this process.

(4) the quality of all platforms is consistent. Oracle tests the new software version more rigorously than third-party products.

(e) the functions of all platforms are consistent. For example, some third-party clusterware products limit the number of nodes that can be supported in the cluster. With Oracle RAC 10g, all platforms can support up to 64 nodes. Users can also get a consistent response experience on all platforms, effectively addressing high availability challenges, including server node failures, interconnection failures, and Imax O isolation phenomena.

(6) support advanced functions. This includes the integration of monitoring and notification capabilities to enable rapid and coordinated recovery between the database and the application layer in the event of a failure.

The architecture of RAC

RAC is a cluster solution for Oracle databases, which has the ability to coordinate the operation of two or more database nodes. The RAC structure diagram shown in the following figure:

Cluster Manager (Cluster Manager) integrates other modules in the cluster system and provides communication between cluster nodes through high-speed internal connections. The connection between the nodes is interconnected by the heartbeat line, and the information function on the heartbeat line determines the logical node member information and node updates of the cluster, as well as the running state of the node at a certain point in time to ensure the normal operation of the cluster system. The communication layer manages the communication between nodes. Its responsibility is to configure, interconnect the node information in the cluster, use the information generated by the heartbeat mechanism in the cluster manager, and the communication layer is responsible for the transmission to ensure the correct arrival of the information. There are also some cluster monitoring processes that constantly verify the health of different areas of the system. For example, heartbeat monitoring constantly verifies whether the heartbeat mechanism is working well. In an application environment, all servers use and manage the same database in order to disperse the workload of each server. Hardware requires at least two or more servers and a shared storage device; at the same time, two types of software are required, one is cluster software, and the other is the RAC component in the Oracle database. At the same time, the OS on all servers should be the same type of OS. According to the configuration policy of load balancer, when a client sends a request to the listener of a service, the server will send the request to the local RAC component for processing, or it may send it to the RAC component of another server for processing. After processing the request, RAC will access the shared storage device through cluster software. Logically, each node participating in the cluster has a separate instance that accesses the same database. Nodes communicate with each other through the communication layer (Communication Layer) of the cluster software. At the same time, in order to reduce the consumption of cache O, there is a global caching service, so each database instance retains a copy of the same database cache. The features in RAC are as follows:

L? Each node instance has its own SGA;l? The instance of each node has its own background process. Each node has its own redo logsl? Each node instance has its own undo tablespace l? All nodes share the structure and mechanism of datafiles and controlfilesRAC

Before Oracle9i, RAC was called OPS (Oracle Parallel Server). A big difference between RAC and OPS is that RAC uses Cache Fusion (High Cache merge) technology. The data blocks that have been taken out by nodes can be updated by another node before they are written to disk before they are written to disk, and then written to disk in the final version. In OPS, data requests between nodes need to be written to disk before the requesting node can read the data. When using Cache Fusion, the data buffers between the nodes of RAC transmit data blocks through a high-speed, low-latency internal network. The following figure is a schematic diagram of a typical RAC external service, and an Oracle RAC Cluster contains the following parts

Nodes of the cluster (Cluster node)-2 to N nodes or hosts running Oracle Database Server. Private network (Network Interconnect)-A high-speed interconnected private network is needed between RAC to handle communication and Cache Fusion. Shared storage (shared Storage)-RAC requires shared storage devices so that all nodes can access data files. External service network (Production Network)-RAC external service network. Both clients and applications are accessed through this network. RAC background process

Oracle RAC has its own unique background processes that do not play a configuration role in a single instance. As shown in the following figure, some background processes that RAC runs are defined. The functions of these background processes are described below.

(1) the LMS (Global cache service processes Global Cache Service process) process is mainly used to manage the access of data blocks in the cluster and to transmit block images in the Buffer Cache of different instances. Copy the block directly from the cache of the controlled instance, and then send a copy to the requested instance. And ensure that a data block can only be mirrored once in the Buffer Cache of all instances. The LMS process coordinates the access to the data block by passing messages in the instance. When an instance requests a data block, the LMD process of the instance issues a request for a data block resource, which is directed to the LMD process of the instance of the master data block, the LMD process of the master instance, and the LMD process of the instance in use. At this point, the LMS process that owns the instance of the resource creates a consistent read of the block mirror and passes the block to the BUFFER CACHE of the instance requesting the resource. The LMS process ensures that only one instance is allowed to update the block at a time and is responsible for maintaining the mirrored record of the block (including the status FLAG of the updated block). RAC provides 10 LMS processes (0,9), the number of which increases as the amount of data passed between nodes increases. (2) LMON (Lock Monitor Process, lock monitoring process) is a global queue service monitor. LMON processes of each instance communicate regularly to check the health status of each node in the cluster. When a node fails, it is responsible for cluster reconfiguration, GRD recovery and other operations. The service it provides is called Cluster Group Service (CGS).

LMON mainly relies on two kinds of heartbeat mechanism to complete the health examination.

(1) Network heartbeat between nodes (Network Heartbeat): it can be imagined that nodes send ping packets regularly to detect the status of nodes. If a response is received within a specified time, the other party's status is considered normal.

(2) by controlling the disk heartbeat of the file (controlfile heartbeat): the CKPT process of each node updates the data block of the control file every 3 seconds. This data block is called Checkpoint Progress Record, and the control file is shared, so the instances can check whether each other is updated in time.

(3) LMD (the global enqueue service daemon, lock management daemon) is a background process, also known as the global queue service daemon, because it is responsible for resource management requirements to control access blocks and global queues. Inside each instance, the LMD process manages incoming remote resource requests (that is, lock requests from other instances in the cluster). In addition, it is responsible for deadlock checking and monitoring conversion timeouts.

(4) LCK (the lock process, lock process) manages non-cache fusion, and lock requests are local resource requests. The LCK process manages resource requests and cross-instance invocation operations for instances of shared resources. During the recovery process, it establishes a list of invalid lock elements and validates the lock elements. Because of the primary function of LMS lock management during processing, only a single LCK process exists in each instance.

(5) DIAG (the diagnosability daemon, diagnostic daemon) is responsible for capturing information about process failures in the RAC environment. The tracking information is written out for failure analysis, and the information generated by DIAG is very useful in working with Oracle Support technology to find the cause of the failure. Only one DIAG process is required per instance.

(6) GSD (the global service daemon, global service process) interacts with RAC management tools dbca, srvctl and oem to complete management transactions such as instance startup and shutdown. In order to ensure the normal operation of these management tools, we must first start gsd on all nodes, and a GSD process supports multiple rac.gsd process bits ORACLEHOME/bin directory of a node, its log file is under the ORACLEHOME/bin directory, and its log file is ORACLE_HOME/srvm/log/gsdaemon.log. The GCS and GES processes are responsible for maintaining the status information of files and cache blocks for each data through the Global Resource Catalog (Global Resource Directory GRD). When an instance accesses the data and caches the data, other instances in the cluster also get a corresponding block image, so that other instances do not need to read the disk when accessing the data, but read the cache in the SGA directly. GRD exists in the memory structure of each active instance, which causes the SGA of the RAC environment to be larger than the SGA of the single instance database system. Other processes and memory structures are not much different from single instance databases.

RAC shared memory

RAC requires shared storage, independent of the instance, such as the ocr and votedisk mentioned above, as well as data files are stored in this shared storage. There are some storage methods such as OCFS, OCFS2, RAW, NFS, ASM and so on. OCFS (Oracle Cluster File System) and OCFS2 are just a file system, which, like NFS, provides a shared storage file system in a clustered environment. RAW bare device is also a kind of storage mode, which is supported by RAC in previous versions of oracle11g. Before Oralce9i, OPS/RAC support can only use this way, that is, map shared storage to RAW Device, and then choose RAW device storage for the data needed by Oracle, but RAW is not intuitive relative to the file system, not easy to manage, and the number of RAW Device is limited. RAW obviously needs a new scheme to replace it. This gives you a file system like OCFS. Of course, this is just a set file system implemented by Oracle itself, and there are file systems provided by other vendors that can be used as storage options. ASM is only a database storage solution, not a cluster solution, so here ASM should be different from the concept of RAW and OCFS/OCFS2 at the same level. RAW and OCFS/OCFS2 can not only be used as database storage solutions, but also as storage solutions in Clusterware, which is the storage needed in CRS, while ASM is only used as database storage, strictly speaking, it is only a node application (nodeapps) in RAC. ASM does not support ocr and votedisk, which are required for clusterware installation. After all, ASM itself needs an instance, and CRS is completely outside the architecture, which is one of the reasons why ASM's solution is used, but OCFS/OCFS2 and RAW are always added. The comparison of various RAC shared storage methods is as follows:

Cluster file system-- GPFS under OCFS/OCFS2AIX that supports windows and Linux-- has the advantages of easy management and intuitive presentation, but the disadvantage is that it is based on file system management software and has to be processed by cache of OS, so it is not suitable for use in production environment. Can support CRS cluster software files and database files. RAW bare device mode-through the hardware-supported shared storage system, directly stored with RAW devices, cluster software files and database files can be supported. Network File system (NFS)-shared storage is implemented through NFS, but requires Oracle-certified NFS, which can support CRS cluster software files and database files. ASM-- collection RAW has the advantages of high performance and easy management of cluster file system. Shared storage is introduced under Oracle10g, but ASM itself needs to be supported by Oracle instances, so ASM only supports database files, but not CRS files. The difference between RAC Database and single instance Database

In order for all instances in RAC to access the database, all datafiles, control files, PFILE/Spfile and redo log files must be stored on a shared disk and can be accessed by all nodes at the same time, including bare devices and cluster file systems. RAC database is structurally different from a single instance: at least one more redo thread is configured for each instance, for example, a cluster of two instances requires at least four redo log group. Two redo group per instance. Also prepare an UNDO tablespace for each instance.

1. Redo and undo, who uses whose redo and undo segments when each instance makes changes to the database, locks the modified data separately, and separates the operations of different instances relatively independently to avoid data inconsistency. The special consideration of redo log and archive logs in this case will be considered later when backing up or restoring.

2. The instance of each node of memory and process has its own memory structure and process structure. The structure of each node is basically the same. Through Cache Fusion (cache fusion) technology, RAC synchronizes the cache information in SGA between nodes to improve access speed and ensure consistency.

How RAC works and related components

OracleRAC is a configuration extension of multiple single instances, where two or more nodes (instances) use a common shared database (for example, a database installs and opens multiple instances at the same time). In this case, each individual instance has its own cpu and physical memory, as well as its own SGA and background processes. Compared with the traditional oracle instance, it is significantly different from the background process in the system global area (SYSTEM CLOBAL AREA,SGA). The biggest difference is that the addition of a GRD,GRD memory block is mainly to record how many cluster databases and system resources the rac has, as well as the relevant information about the data blocks, because in the rac architecture, each data block has a copy in each SGA, and the rac must know the location, version, distribution and current state of these data blocks, which are stored in the GRD. However, GRD is only responsible for storage and not management, and the responsibility for management is left to the background processes GCS and GES. Multiple instances of Oracle access a common shared database. Each instance has its own SGA, PGA, and background processes, which should be familiar, because in the RAC configuration, each instance will need these background processes to run. You can understand the working principle and operation mechanism of RAC from the following aspects.

(1) SCN

SCN is a mechanism used by Oracle to track the sequence of changes within the database. Think of it as a high-precision clock. Every Redo log entry, Undo Data Block,Data Block has a SCN number. Both Consistent-Read and Current-Read,Multiversion-Block of Oracle are implemented on SCN. In RAC, GCS is responsible for global maintenance of SCN generation, the default is Lamport SCN generation algorithm, the general principle of this algorithm is: all nodes carry SCN in the communication content between nodes, each node compares the received SCN with the local SCN, if the local SCN is small, then adjust the local SCN and receive the same, if there is not much communication between nodes, but also take the initiative to report to each other regularly. Therefore, even if the node is in the Idle state, some Redo log will still be generated. There is also a broadcast algorithm (Broadcast), which is that after each Commit operation, the node wants other nodes to broadcast SCN. Although this method will cause a certain load on the system, it ensures that each node can see SCN immediately after Commit. These two algorithms have their own advantages and disadvantages. Although the load of Lamport is small, there will be delay between nodes. Although the broadcast has load, there is no delay. Oracle 10g RAC defaults to BroadCast algorithm, and you can see the relevant information in the alert.log log: Picked broadcast on commit scheme to generate SCNS

(2) GES/GCS principle of RAC

The Global queue Service (GES) is mainly responsible for maintaining the consistency of dictionary cache and library cache. Dictionary cache is the cache of data dictionary information stored in the SGA of an instance for high-speed access. Because the dictionary information is stored in memory, changes made to the dictionary on one node, such as DDL, must be immediately propagated to the dictionary cache on all nodes. GES is responsible for handling the above situations and eliminating differences between instances. For the same reason, in order to analyze the SQL statements that affect these objects, the library cache locks on the objects in the database are removed. These locks must be maintained between instances, and the global queuing service must ensure that deadlocks do not occur between multiple instances requesting access to the same object. The LMON, LCK, and LMD processes work together to implement the global queuing service. GES is an important service to regulate other resources between nodes in the RAC environment in addition to the maintenance and management of the data block itself (completed by GCS). To ensure the synchronization of instances in the cluster, two virtual services will be implemented: the Global queuing Service (GES), which controls access to locks.

Global memory Service (GCS), which controls access to blocks. GES is an extension of the distributed Lock Manager (DLM), which is a mechanism that can be used to manage locks and data blocks of oracle parallel servers. In a clustered environment, you need to restrict access to database resources that are protected by latches or locks in a single instance database. For example, objects in database dictionary memory are protected by implicit locks, while objects in the library cache must be protected by pin when they are referenced. In the RAC cluster, these objects represent resources protected by global locks. GES is a complete RAC component that is responsible for communicating with instance global locks in the cluster, and each resource has a master node instance that records its current state. Also, the current state of the resource is recorded on all instances that are interested in the resource. GCS, another RAC component, is responsible for coordinating access to data blocks between different instances. Access to these blocks and updates are recorded in the Global Catalog (GRD), which is a virtual memory structure that uses extensions in all instances. Each block has an instance of master, which is responsible for managing access to GSD, and the current state information of the block is recorded in the GSD. GCS is the mechanism that oracle uses to implement Cache fusion. Blocks and locks managed by GCS and GES are called resources. Access to these resources must be coordinated across multiple instances of the cluster. This coordination occurs at both the instance level and the database level. Resource coordination at the instance level is called local resource coordination; coordination at the database level is called global resource coordination.

The mechanism of local resource coordination is similar to that of single-instance oracle, including block-level access, space management, dictionary cache, library cache management, row-level locking, and SCN occurrence. Global resource coordination is for RAC, using additional memory components, algorithms, and background processes in SGA. GCS and GES are designed to be transparent to the application. In other words, you don't need to modify the application because the database is running on RAC, and the parallel mechanism on a single-instance database is reliable on RAC.

Background processes that support GCS and GES use private network heartbeats to communicate between instances. This network is also used by clustered components of Oracle and possibly by clustered file systems such as OCFS. GCS and GES run independently of the Oracle cluster components. However, GCS and GES rely on these cluster components to obtain the status of each instance in the cluster. If this information is not available from an instance, the instance will be closed. The purpose of this shutdown is to protect the integrity of the database, because each instance needs to know about the other instances, so that coordinated access to the database can be better determined. GES controls all library cache and dictionary cache locks in the database. These resources are local in the single instance database, but become global resources in the RAC cluster. Global locks are also used to protect the structure of data and manage transactions. In general, transactions and table locks are consistent in a RAC environment or a single instance environment.

All levels of Oracle use the same GES functions to acquire, transform, and release resources. When the database starts, the number of global queues will be counted automatically. GES uses the background processes LMD0 and LCK0 to perform most of its activities. In general, various processes communicate with local LMD0 background processes to manage global resources. The local LMD0 background process communicates with the LMD0 process on other instances.

The LCK0 background process is used to obtain the locks needed by the entire instance. For example, the LCK0 process is responsible for maintaining dictionary cache locks. Shadow processes (service processes) communicate with these background processes through AST (Asynchronous Trap) messages. Asynchronous messages are used to avoid blocking background processes, which will block while waiting for a reply from the remote instance. Background processes can also send BAST (asynchronous lock traps) to lock processes, which requires them to set the current holding lock to a lower-level mode. Resources are memory structures that represent components in the database, and access to these components must be restricted or serialized. In other words, this resource can only be accessed in parallel by a process or an instance. If the resource is currently in use, other processes that want to access the resource must wait in the queue until the resource becomes available. Queues are memory structures that are responsible for parallelizing access to special resources. If these resources are only required by the local instance, the queue can be obtained locally and does not require collaboration. But if this resource is requested by a remote instance, the local queue must be globalized.

ClusterWare architecture

In a stand-alone environment, Oracle runs on OS Kernel. OS Kernel is responsible for managing hardware devices and providing hardware access interfaces. Oracle does not directly manipulate the hardware, but it is replaced by OS Kernel to complete the call request to the hardware. In a cluster environment, storage devices are shared. OS Kernel is designed for a single machine and can only control the access between multiple processes on a single machine. If you still rely on OS Kernel services, there is no guarantee of coordination among multiple hosts. At this time, it is necessary to introduce an additional control mechanism. In RAC, this mechanism is the Clusterware between Oracle and OS Kernel, which intercepts the request before OS Kernel, then negotiates with Clusterware on other nodes, and finally completes the upper-level request. Before Oracle 10G, the clusterware needed by RAC depended on hardware manufacturers, such as SUN,HP,Veritas. From Oracle version 10.1, Oracle launched its own cluster product. Cluster Ready Service (CRS), henceforth RAC no longer relies on cluster software from any vendor. In Oracle version 10.2, the product was renamed Oracle Clusterware. So we can see that there are actually two cluster environments in the whole RAC cluster, one is a cluster composed of Clusterware software, and the other is a cluster composed of Database.

(I) the main processes of Clusterware

A) CRSD-- is responsible for the highly available operations of the cluster. The crs resources managed include database, instance, monitoring, virtual IP,ons,gds or other operations, including startup, shutdown, monitoring and failover. The improvement program is managed and started by root users. A failure in crsd will cause the system to restart.

B) cssd, which manages the relationship between nodes, is used for communication between nodes, and nodes notify the cluster when they join or leave the cluster. The process is run and managed by oracle users. Cssd also automatically restarts the system in the event of a failure.

C) oprocd-Cluster process management-Process monitor for the cluster. Used to protect the shared data IO fencing.

D) run only in cluster software that does not use vendor

E) evmd: event detection process, run and managed by oracle users

Cluster Ready Service (CRS, Cluster preparation Service) is the basic program for managing highly available operations in a cluster. Anything that Crs manages is called a resource, and it can be a database, an instance, a listener, a virtual IP (VIP) address, an application process, and so on. CRS manages these resources based on the resource configuration information stored in OCR. This includes startup, shutdown, monitoring, and failover (start, stop, monitor, and failover) operations. When the state of a resource changes, the CRS process generates an event. When you install RAC, the CRS process monitors Oracle instances, listeners, and so on, and automatically starts these components in the event of a failure. By default, the CRS process will restart five times and will not try again if the resource still fails to start. Event Management (EVM): the background process that publishes the CRS creation event. Oracle Notification Service (ONS): quick Application Notification (FAN:Fast Application Notification) event publishing and subscribing service for communications. RACG: extend the functionality of clusterware to support the specific needs and complex resources of Oracle. It executes the server-side invocation script (server callout script) Process Monitor Daemon (OPROCD) when the FAN event occurs: this process is locked in memory to monitor the cluster (cluster) and provide I/Ofencing protection. OPROCD performs its check, stops running, and OPROCD resets the processor and restarts the node if it wakes up beyond the desired interval. An OPROCD failure will cause the Clusterware to restart the node.

Cluster Synchronization Service (CSS): CSS cluster synchronization service, which manages cluster configuration, who is the member, who comes, who goes, and notifies the members, is the basis of inter-process communication in the cluster environment. Similarly, CSS can also be used to handle interactions between ASM instances and regular RDBMS instances in a single-instance environment. In a clustered environment, CSS also provides group services, that is, dynamic information about which nodes and instances constitute the cluster at any given time, as well as node name and node static information (which is modified when the node is added or removed). CSS maintains basic lock functionality within the cluster (although most locks are maintained by integrated distributed lock management within RDBMS). Among other jobs, CSS is responsible for maintaining a heartbeat between nodes in the cluster and monitoring split-brain failures on voting disks. In the final stage of installing clusterware, you will be required to execute the root.sh script on each node. This script will add the three processes to the startup item at the end of the / etc/inittab file, so that clusterware will start automatically every time the system starts. If the EVMD and CRSD processes are abnormal, the system will restart them automatically. If the CSSD process is abnormal, the system will restart immediately.

Note:

1. Voting Disk and OCR must be stored on the storage device for each node to access.

2. Voting Disk, OCR and network must be specified or configured during or before installation. After the installation is complete, you can configure and modify it through some tools.

RAC software structure

RAC software structure can be divided into four parts.

Operating system-related software RAC shared disk part Special background processes and instance processes Global buffer Service and Global queue Service in RAC

(1) Operation System-Dependent (OSD)

RAC accesses the operating system and some Cluster-related service processes through the operating system-related software. OSD software may be provided by Oracle (windows platform) or hardware vendor (unix platform). OSD consists of three self-parts:

L The Cluster Manager (CM): the cluster monitor monitors communication between nodes and coordinates node operations through interconnect. It also provides a unified view of all nodes and instances in CLUSTER. CM also controls the membership of CLUSTER. L The Node Monitor (Node Monitor): the Node Monitor provides the status of various resources within the node, including nodes, interconnect hardware and software, and shared disks. Heartbeat between l The Interconnect nodes (two heartbeat mechanisms, one through the network heartbeat; of the private network and the other through the disk heartbeat of the voting disk)

(2) Real Application Cluster Shard Disk Component

These components in RAC are no different from those in a single instance Oracle database. Includes one or more control files, some serial computer redo log files, optional archive log files, data files, and so on. Using server parameter files in RAC simplifies the management of parameter files, allowing global and instance-specific parameters to be stored in the same file.

The Global Service Daemon (GSD): a global service background process runs on each node, which is used to receive management messages sent by clients such as DBCA, EM, etc., and complete corresponding management tasks, such as instance startup and shutdown. A special instance process in RAC: Global Cache Service Processes (LMSn): controls the flow of messages to the remote instance and manages access to global data blocks. It is also used to pass BLOCK mappings between buffers of different instances. Global Enqueue Service Monitor (LMON): monitors resource interactions between global queues and clusters and performs global queue recovery operations. Global Enqueue Service Daemon (LMD): manages global queues and global resource access. For each instance, LMD manages resource requests from the remote end. Lock Processes (LCK): manages requests for non-block resources other than Cache Fusion, such as data files, control files, data dictionary attempts, library and row cache requests. Diagnosability Daemon (DIAG): captures diagnostic data of process failures in the instance.

(4) The Global Cache and Global Enqueue Service

Global Cache Service (GCS) and Global queue Service (GES) are integrated components of RAC that coordinate simultaneous access to shared databases and shared resources within databases.

GCS and GES include the following features:

Apply transparency. Distributed global resource catalog: as long as there is still one node, even if one or more nodes fail, GCS and GES can still guarantee the integrity of the global resource directory. Resource control: GCS and GES choose an instance to manage all resource information, which is called resource master. GCS and GES periodically evaluate and modify the resource master based on how the data is accessed. This approach reduces network traffic and resource acquisition time. GCS and the interaction between GES and CM: GCS and GES are independent of CM. But at the same time, GCS and GES depend on the state information of instances on each node provided by CM. Once the information of an instance cannot be obtained, Oracle will immediately close the unresponsive instance to ensure the integrity of the entire RAC. Cluster Registration (OCR)

The problem of forgetfulness is caused by the fact that each node has a copy of the configuration information and the configuration information of the modified node is out of sync. The solution adopted by Oracle is to put the configuration file on the shared storage, which is called OCR Disk. The configuration information of the entire cluster is stored in OCR, and the configuration information is stored in the form of "Key-Value". Before Oracle 10g, this file was called Server Manageability Repository (SRVM). In Oracle 10g, this part of the content was redesigned and renamed OCR. During Oracle Clusterware installation, the installer prompts the user to specify the OCR location. And the location specified by the user is recorded in the / etc/oracle/ocr.Loc (LinuxSystem) or / var/opt/oracle/ocr.Loc (SolarisSystem) file. In Oracle 9i RAC, the equivalent is the srvConfig.Loc file. Oracle Clusterware will read the OCR content from the specified location according to the contents when it starts.

(1) OCR key

The information of the whole OCR is a tree structure with three major branches. They are SYSTEM,DATABASE and CRS. There are many small branches under each branch. The information of these records can only be modified by root users.

(2) OCR process

Oracle Clusterware stores cluster configuration information in OCR, so the content of OCR is very important. All operations on OCR must ensure the integrity of OCR content, so not all nodes can operate OCR Disk during ORACLE Clusterware operation. There is a copy of the OCR content in the memory of each node, which is called OCR Cache. Each node has an OCR Process to read and write the OCR Cache, but only one node's OCR process can read and write the contents of the OCR Disk. This node is called the OCR Master node. The OCR process of this node is responsible for updating the OCR Cache content of the local and other nodes. All other processes that need OCR content, such as OCSSD,EVM, are called Client Process. Instead of accessing OCR Cache directly, these processes send requests like OCR Process and obtain content with the help of OCR Process. If you want to modify OCR content, the OCR Process of the node submits the application like Master node's OCR process, and Master OCR Process completes physical reading and writing, and synchronizes the content in all nodes' OCR Cache.

Oracle Arbitration Board (Voting Disk)

Voting Disk this file is mainly used to record the status of node members, in the event of a brain fissure, determine which Partion to gain control, and other Partion must be removed from the cluster. You will also be prompted to specify this location when you install Clusterware. After the installation is complete, you can view the Voting Disk location with the following command. $Crsctl query css votedisk

Network connection of cluster

I. Private network

Each cluster node is connected to all other nodes through a dedicated high-speed network, also known as cluster interconnection or high-speed interconnection (HSI). Oracle's Cache Fusion technology uses this network to effectively combine the physical memory (RAM) of each host into a cache. OracleCache Fusion allows any other instance to access data stored in the cache of one Oracle instance by transferring it over a private network. It also maintains data integrity and cache consistency by transferring locks and other synchronization information in the cluster nodes. Private networks are usually built on Gigabit Ethernet, but for high-capacity environments, many vendors provide proprietary solutions with low latency and high bandwidth specifically designed for Oracle RAC. Linux also provides a way to bind multiple physical NIC into a virtual NIC (not covered here) to increase bandwidth and improve availability.

II. Public network

To maintain high availability, each cluster node is assigned a virtual IP address (VIP). If the host fails, the IP address of the failed node can be reassigned to an available node, allowing the application to continue to access the database through the same IP address.

3. Virtual lP (VIP)

That is, virtual IP,Oracle recommends that clients connect through the specified virtual IP connection, which is also a new feature of Oracle10g. Its essential purpose is to achieve non-pause in the application (although there is still a small problem at present, but it is very close to the goal). Users connect to virtual IP. This IP is not tied to the network card, but is managed by the oracle process. Once the instance of the virtual IP connected by a user goes down, oracle will automatically map the virtual IP to a normal instance, so that the user's access to the database will not be affected, and there is no need for the user to modify the application. Oracle's TAF is based on VIP technology. IP differs from VIP in that IP takes advantage of the TCP layer timeout, while VIP takes advantage of the immediate response of the application layer. VIP it is a floating IP. When there is a problem with one node, it will automatically go to another node.

Transparent application handover (TAF)

Transparent application failover (Transport Application Failover,TAF) is provided by oracle data, which is widely used in RAC environment. Of course, it can also be used in master-slave hot standby environment of Data Guard and traditional HA implementation. Transparent and Failover in TAF point out two main features of this high availability feature:

TAF is used for failover, that is, switching. When a session connected to Oracle becomes unavailable due to a database failure, the session can automatically switch to another available node in RAC, or to Standby, or to another available node in HA mode. The failover of TAF is transparent to the application, and the application system can fail over automatically without special processing.

But is TAF perfect? Is it true that applications can switch seamlessly with TAF? Are there any other requirements for applications and databases? To answer these questions, we need to fully understand and master TAF. I have always believed that in order to make good use of a thing, we must first master the working principle and mechanism behind it. First, let's take a look at Failover. There are two kinds of Failover, one is connection-time Failover, the other is run-time Failover. The function of the former is that when the application (client) connects to the database, if it is unable to connect due to network, instance failure and other reasons, it can connect to other instances in the database. The role of the latter is that, for a session that is already working (that is, a connection has been established), if the instance of the session is aborted abnormally, the application (client) can connect to other instances of the database (or standby library).

Connection load balancing

Load balancing (Load-Banlance) refers to the load balancing of connections. RAC load balancing mainly means that when a new session connects to a RAC database, it determines which node the new connection is connected to to work according to the CPU load of the server node. Oracle RAC can provide dynamic data services, load balancing is divided into two types, one is based on client connection, the other is based on server.

The principle and characteristics of VIP

Oracle's TAF is based on VIP's technology. The difference between IP and VIP is that IP takes advantage of the TCP layer timeout, while VIP takes advantage of the immediate response of the application layer. VIP is a floating IP. When there is a problem with one node, it will be automatically transferred to another node. Suppose you have a two-node RAC, and in normal operation there is a VIP on each node, that is, VIP1 and VIP2. When node 2 fails, such as an abnormal relationship. RAC does the following:

(1) after CRS detects an exception of rac2 nodes, it will trigger Clusterware reconstruction. Finally, the rac2 nodes will be removed from the cluster, and a new cluster will be formed by node 1.

(2) the Failover mechanism of RAC transfers the VIP of node 2 to node 1, and then there are three IP addresses on the PUBLIC network card of node 1: VIP1,VIP2, PUBLIC IP1.

(3) the user's connection request to VIP2 will be routed to node 1 by IP layer.

(4) because there is an VIP2 address on node 1, all packets will pass smoothly through the routing layer, network layer, and transport layer.

(v) however, only two IP addresses of VIP1 and public IP1 are listening on node 1. The VIP2 is not listening, so there is no corresponding program in the application layer to receive the packet, and the error is caught immediately.

(6) the client can receive the error immediately, and then the client will re-initiate the connection request to the VIP1. VIP features:

L? VIP is created through a VIPCA script. L? VIP is registered in OCR as a CRS Resource of type Nodeapps, and the state is maintained by CRS. L? The VIP is bound to the node's public Nic, so the public Nic has two addresses. L? When a node fails, CRS transfers the VIP of the failed node to another node. L? The Listener of each node listens for both public ip and VIP.l on the public Nic. The tnsnames.Ora of the client generally configures the VIP pointing to the node. Log system

Redo Thread

There are multiple instances in the RAC environment, and each instance needs its own set of Redo Log files to log. This set of Redo Log is called RedoThread, in fact, it is also RedoThread in a single instance, but this word is rarely mentioned. A set of RedoThread for each instance is designed to avoid performance bottlenecks caused by resource competition. There are two types of Redo Thread, one is Private, which creates syntax alter database add logfile. Thread n, and the other is public, which sets the thread parameter for each instance in alter database add logfile.;RAC, which defaults to 0. If this parameter is set, the default value of 0 is used, Public Redo Thread is selected when the instance is started, and the Redo Thread is used exclusively by the instance. Each instance in RAC needs one Redo Thread, and each Redo Log Thread needs at least two Redo Log Group. Each Log Group member should be equal in size. No group should have more than two members. These members should be placed on different disks to prevent single point of failure.

Note: in the RAC environment, Redo Log Group is numbered at the entire database level. If instance 1 has two log groups, then the log group number of instance 2 should start from 3, instead of using the 1Magin2 number. In a RAC environment, the online logs of all instances must be placed on shared storage, because if a node shuts down abnormally and the remaining nodes want to crash recovery, the node performing crash recovery must be able to access the connection logs of the failed node. This requirement can only be met by putting the online logs on the shared storage.

Archive log

Each instance in RAC generates its own archive log, which is only used when performing Media Recovery, so the archive log does not need to be placed on shared storage, and each instance can store archive log locally. However, if you back up archive logs or perform Media Recovery operations on a single instance, you must be able to access the archive logs of all instances on this node. In the fantasy of RAC, you can choose to configure archive logs.

Use NFS

Use NFS to archive logs directly to storage, for example, two nodes, each node has two directories, and Arch2,Arch3 corresponds to the archive logs generated by instance 1 and instance 2, respectively. Each instance is configured with an archive location, archived locally, and then linked to the other party's directory locally via NFS.

Inter-instance archiving (Cross Instance Archive CIA)

Inter-instance archiving (Cross Instance Archive) is a variation of the previous approach, and it is also a common configuration method. Both nodes create two directories, Arch2 and Arch3, corresponding to the archive logs generated by instance 1 and instance 2, respectively. Each instance is configured with two archive locations. Location 1 corresponds to the local archive directory, and location 2 corresponds to another instance

Use ASM

Using ASM to archive logs to shared storage, only through the ASM provided by Oracle, hides the above complexity, but the principle is the same.

Trace log

The auxiliary diagnosis of Oracle Clusterware can only be carried out from log and trace. And its log system is more complex. Alert.log:$ORA_CRS_HOME/log/hostname/alert.Log, this is the preferred way to view files.

Clusterware background process log

L crsd.Log: $ORA_CRS_HOME/log/hostname/crsd/crsd.Logl ocssd.Log: $ORA_CRS_HOME/log/hostname/cssd/ocsd.Logl evmd.Log: $ORA_CRS_HOME/log/hostname/evmd/evmd.Log

Nodeapp log location

ORACRSHOME/log/hostname/racg/ contains nodeapp logs, including ONS and VIP, for example: ora.Rac1.ons.Log tool execution log: ORACRSHOME/log/hostname/racg/ contains nodeapp log, including ONS and VIP, for example: ora.Rac1.ons.Log tool execution log: ORA_CRS_HOME/log/hostname/client/

Clusterware provides a number of command-line tools such as ocrcheck, ocrconfig,ocrdump,oifcfg and clscfg, and the logs generated by these tools are placed in this directory, as well as ORACLEHOME/log/hostname/client/ and ORACLEHOME/log/hostname/client/ and ORACLE_HOME/log/hostname/racg.

Cache Fusion principle

The background processes of RAC have been introduced before. in order to gain a more in-depth understanding of the working principle of these background processes, let's first understand how multi-nodes in RAC manage the access to shared data files. To understand the core of how RAC works, you need to know the important concept of Cache Fusion. To play the role of Cache Fusion, there must be a prerequisite, that is, the speed of the Internet is faster than the speed of accessing the disk. Otherwise, there is no point in introducing Cache Fusion. In fact, the Internet of 100MB is very common now.

What is Cache Fusion?

Cache Fusion transfers blocks between the SGA of each node in the cluster through the Internet (high-speed Private interconnect). This is the core working mechanism of RAC. It virtualizes the SGA of all instances into a large SGA area. Whenever different instances request the same data block, this data block is transferred between instances through Private interconnect. To avoid the inefficient implementation of pushing blocks to disk first and then re-reading them into the cache of other instances (OPS implementation). When a block is read into the cache of an instance in the RAC environment, the block is assigned a lock resource (unlike row-level locks) to ensure that other instances know that the block is being used. Then, if another instance requests a copy of the block and the block is already in the cache of the previous instance, the block is passed directly to the SGA of the other instance over the Internet. If the block in memory has been changed, but the change has not been committed, a copy of the CR will be passed. This means that whenever possible, blocks can be moved between instance caches without having to write back to disk, thus avoiding the additional Icano cost of synchronizing multi-instance caches. Obviously, the data cached by different instances can be different, that is, before an instance accesses a particular block, and it never accesses that block, it either cache fusion from another instance or reads it from disk. GCS (Global Cache Service, Global memory Service) and GES (Global EnquenceService, Global queue Service) process management uses synchronous interconnection of data blocks between cluster nodes.

There are still some questions to think about:

When none of the instances read the block, but when the first instance read, how was the lock added and what lock was added? If there is another instance to read this block at this time, almost at the same time, how does Oracle arbitrate, how do you get one of them to read and the other to get it from the former's cache through cache? If a block has been read in by another instance, how does this instance determine its existence? If one instance changes this data block, will the change be passed to another instance, or will the other instance know and re-update the status? If an instance wants to swapout a block, and other instances also have the cache of this block, modified and unmodified, modified by this instance and modified by other instances, what to do? Truncate a table, drop a table. How is it different from a single instance? How should applications be designed so that RAC really works, rather than introducing competition that weakens the system? The implementation of RAC unlocking.

Locks are resources reserved in the SGA of each instance and are typically used to control access to database blocks. Each instance usually retains or controls a certain number of locks related to the block range. When an instance requests a block, the block must acquire a lock, and the lock must come from the instance that currently controls the locks. That is, locks are distributed over different instances. To get a specific lock, you have to get it from a different instance. However, from this process, these locks are not fixed on an instance, but will be adjusted to the most frequently used instances according to the lock request frequency, thus improving efficiency. To achieve the allocation, reallocation and control of these resources, it is very resource-consuming. This also determines that the application design of RAC is relatively high. If an instance crashes or an instance joins, there will be a long process of reallocating resources and processing. When all are running normally, they will be reallocated to use resources more efficiently, and will also be reallocated when the instance is launched or joined. You can see this information in the alert file. The allocation control of Cache Fusion and other resources requires a fast Internet, so we should pay attention to the metrics related to messages on the Internet in order to test the traffic and the corresponding time of the Internet. For some of the previous problems, you can learn from other concepts, which are global caching services and global queuing services.

Global caching service (GCS): to be understood in conjunction with Cache Fusion. Global caching involves data blocks. The global cache service is responsible for maintaining cache consistency in the global cache, ensuring that an instance can obtain a global lock resource when it wants to modify a block at any time, thus avoiding the possibility that another instance will modify the block at the same time. The modified instance will have the current version of the block (both committed and uncommitted) as well as the post image of the block. If another instance also requests the block, the global caching service is responsible for tracking the instance that owns the block, what version of the block is owned, and what mode the block is in. The LMS process is a key component of the global caching service.

Guess: the current cache fusion of Oracle transfers blocks to build a block in the SGA of that instance when other instances are accessed. The main reason may be that it is faster to access between interconnect or from local memory, so that Oracle can be quickly obtained from local memory when it is accessed again. But this also has trouble, because there will be multiple copy of data blocks in multiple nodes, so the administrative consumption is considerable, will there be a better solution for Oracle in later versions? If interconnect speed permits.)

Global queuing Service (GES): mainly responsible for maintaining consistency between dictionary cache and library cache. Dictionary cache is the cache of data dictionary information stored in the SGA of an instance for high-speed access. Because the dictionary information is stored in memory, changes made to the dictionary on one node, such as DDL, must be immediately propagated to the dictionary cache on all nodes. GES is responsible for handling the above situations and eliminating differences between instances. For the same reason, in order to analyze the SQL statements that affect these objects, the library cache locks on the objects in the database are removed. These locks must be maintained between instances, and the global queuing service must ensure that deadlocks do not occur between multiple instances requesting access to the same object. The LMON, LCK, and LMD processes work together to implement the global queuing service. GES is an important service to regulate other resources between nodes in the RAC environment in addition to the maintenance and management of the data block itself (completed by GCS).

SQL > select * from gv$sysstat where name like 'gcs%'

Here you can see the number of gcs and ges messages sent. (if you do not use DBCA to create the database, then you need SYSDBA permission to run CATCLUST.SQL scripts to create RAC-related views and tables)

What is high availability?

Oracle failsafe, Data Guard and RAC are all high reliability (HA) solutions provided by ORACLE. However, there is a big difference among the three. HA is the acronym of High Availability, which can be translated as High availability, or High availability, High availability (environment). I think it should be said that HA is a concept rather than a specific technology or a series of specific technologies, just like grid. Once you have made a system plan, you will know that there is a high availability in evaluating the performance of the system. That is, the dual-computer hot backup of OS level 1. RAC is the abbreviation of real application cluster, it is the technology of running a database on multiple hosts, that is, a db with multiple instance. Its advantage is that a cluster with good overall performance can be built by multiple machines with poor performance, and load balancing is realized, so when a node fails, the services on it will be automatically transferred to another node for execution. Users can't even feel anything.

The difference between FAILSAFE and RAC

1. Operating system:

The failsafe system is limited to the WINDOWS platform and must cooperate with MSCS (microsoft cluster server). RAC was first launched on the UNIX platform and has been extended to LINUX and WINDOWS platforms to interact with the system through OSD (operating system dependent). For high-end RAC applications, UNIX is still the preferred platform.

2. System structure:

FAILSAFE adopts SHARE NOTHING structure, that is, several servers are used to form a cluster to connect to a shared disk system. At the same time, only one server can access the shared disk and provide services. Only when this server fails does another take over the shared disk. RAC uses SHARE EVERYTHING, and each server in the cluster can access the shared disk and provide services. In other words, FAILSAFE can only use one server resource, and RAC can use multiple server resources in parallel.

3. Operating mechanism:

Each SERVER that makes up the FAILSAFE cluster has a separate IP, the entire cluster has an IP, and a separate IP is assigned to the FAILSAFE GROUP (the last two IP are virtual IP, so customers can access the database transparently as long as they know the cluster IP). During the working period, only one server (preferred or owner or manager) provides services, and the other servers (operator) are on standby. When the current server fails, the other server will take over the former, including FAILSAFE GROUP IP and CLUSTER IP, and FAILSAFE will start the above DATABASE SERVICE,LISTENER and other services. Customers just need to reconnect without making any changes. For the cluster composed of RAC, each server has its own IP,INSTANCE, etc., which can provide services separately, but they all operate on the same database on the shared disk. When a server fails, the user only needs to modify the network configuration, such as (TNSNAMES. ORA), you can reconnect to the server that is still running. But when used in conjunction with TAF, even the network can be configured to be transparent.

4. Cluster capacity:

The former is usually two, while the latter can be expanded to 8 on some platforms.

5. Division:

The disk where the FAILSAFE database is located must be in NTFS format, while RAC is relatively flexible, usually requiring RAW, but several OS have made the CLUSTER file system available for direct use by RAC. To sum up, FAILSAFE is more suitable for a system with high reliability requirements, relatively small applications, and relatively low requirements for high performance, while RAC is more suitable for larger applications with relatively high reliability, scalability and performance requirements.

The difference between RAC and OPS

RAC is the successor to OPS and inherits the concept of OPS, but RAC is brand new, and the CACHE mechanism is completely different from OPS. RAC solves the conflict problem caused by two nodes writing the same BLOCK at the same time in OPS. In terms of products, RAC and OPS are completely different products, but we can think of them as different versions of the same product.

The difference between dual-computer Hot standby, RAC and Data Guard

Data Guard is the remote replication technology of Oracle, which can be divided into physical and logical, but generally speaking, it needs to have an independent system in different places, these are two systems with different hardware configurations, but the software structure of the two systems remains the same, including the software version, directory storage structure, and data synchronization (in fact, it is not real-time synchronization), as long as the network is connected between the two systems. It is a solution of disaster recovery in different places. For RAC, it is a local high-availability cluster, and each node is used to share unused or the same applications to solve the problems of low computing efficiency and single-node failure. It is composed of several servers with the same or different hardware and a SAN (shared storage area). The comparison of Oracle high availability products is shown in the following table:

Communication between nodes (Interconnect)

Usually in the RAC environment, on the basis of the public network, it is necessary to configure two private networks for the interconnection between nodes. In the definition of HACMP/ES resources, these two private networks should be defined as "private". During the startup of the instance, RAC will automatically identify and use these two private networks, and if there is a public "public" network, RAC will identify another public network. When RAC identifies multiple networks, RAC uses the TNFF (Transparent Network Failvoer Failback) function. Under TNFF, all nodes communicate through the first private network, and the second (or third, etc.) serves as a backup after the failure of the first private network. Communication between RAC nodes is shown in the following figure.

CLUSTER_INTERCONNECTS is an optional initialization (init.ora) parameter in Oracle RAC. This parameter specifies which network to use for interconnected communication between nodes. If you specify multiple networks, RAC will automatically load balance on these networks. However, when CLUSTER_INTERCONNECTS is set, TNFF does not work, which will reduce the availability of RAC, and the failure of any interconnection network between nodes will cause the failure of one or more nodes of RAC. ORACLE RAC is used to choose the physical connection mode of the internal network card of INTERCONNECT: use switch connection or network cable direct connection. The disadvantage of direct connection is that once the internal network card of a node machine fails, it is abnormal for oracle to get the network card status of both nodes from OS, which will lead to the failure of both instances. When there is a problem with the INTERCONNECT line, oracle usually initiates a competitive mechanism to determine which instance is down. If the failed instance happens to be a good instance, it will cause both instances to fail. In 9i, oracle will wait for a period of time before activating the competition mechanism, waiting for OS to send the network status to oracle. If the instance whose network card is down is obtained by oracle before timeout, the instance will be delayed. In this case, the normal instance can be retained to continue to serve, otherwise it will still enter the competition mechanism.

To sum up, the communication between nodes is divided into two situations:

? It is connected to the switch, and under normal circumstances, it will ensure that the normal instance will continue to serve, but sometimes if the os does not have time to send the Nic status to oracle, it may cause both nodes to go down.

? If it is directly connected, it will cause both instances to go down.

CSS heartbeat

OCSSD is the most critical process of Clusterware. If there is an exception in this process, it will cause the system to restart. This process provides CSS (Cluster Synchronization Service) service. CSS service monitors cluster status in real time through a variety of heartbeat mechanisms and provides basic cluster service functions such as brain fissure protection.

CSS services have two heartbeat mechanisms: one is Network Heartbeat through private network, and the other is DiskHeartbeat through Voting Disk. Both heartbeats have a maximum delay, which is called IOT (I Timeout O Timeout) for Disk Heartbeat and MC (Misscount) for Network Heartbeat. These two parameters are both in seconds. By default, IOT is greater than MC. By default, these two parameters are automatically determined by Oracle, and adjustment is not recommended. You can view parameter values with the following command:

$crsctl get css disktimeout

$crsctl get css misscount

The communication protocols used between Oracle RAC nodes are shown in the following table.

Lock

LOCK (lock) is a data structure used to control concurrency. If two processes modify the same data at the same time, locks are used to control the order of accessing data in order to prevent confusion and accidents. The one with the lock can be accessed first, and the other process can not have the lock until the first one has released the lock and continue to access. Generally speaking, there are two kinds of locks in RAC, one is between processes on the local host, and the other is between processes on different hosts. There are two types of local locking mechanisms, one is called lock (lock), and the other is called latch latch.

Global lock refers to RAC lock, which is the lock between different hosts. Oracle adopts DLM (Distributed Lock Management, distributed lock management) mechanism. In Oracle RAC, data is shared globally, which means that each process sees the same block of data, which can be passed between different machines. The structure of GRD directory is given.

We can see that Mode, Role and n constitute the basic structure of RAC lock.

Mode has N, S, X three ways Role has Local and Global N has PI and XI, generally 0 means XI,1 means PI global memory management database file read in RAC cluster ready service (CRS) global resource catalog consistency management

Data consistency and concurrency describe how Oracle maintains data consistency in a multi-user database environment. In a single-user database, users modify the data in the database without having to worry about other users modifying the same data at the same time. However, in a multi-user database, statements in multiple transactions executed simultaneously can modify the same data. Simultaneous transactions need to produce meaningful and consistent results. Therefore, in a multi-user database, the control of data concurrency and data consistency is very important: data concurrency: each user can see the results of data consistency. The ANSI/IOS SQL standard (SQL 92) defines four transaction isolation levels, which have different effects on transaction performance. These isolation levels are proposed taking into account three phenomena that must be avoided in transaction concurrent execution. The three phenomena that should be avoided are:?

Dirty read: one transaction can read data written by other transactions but not yet committed. ?? Unrepeatable (fuzzy read): a transaction repeatedly reads previously read and queried data that has been modified or deleted by other committed transactions. Phantom read: a transaction repeatedly runs some column rows returned by a query, including additional rows that have been inserted by other committed transactions.

SQL92 defines four isolation levels based on these objects, and transactions running at specific isolation levels allow some special performance. The following table shows the read phenomena blocked by the isolation level.

OCR structure

(1) OCR KEY is a tree structure.

(2) each node of OCR PROCESS has a copy of OCR CACHE, and the ORC MASTER node is responsible for updating to OCR DISK

Oracle Clusterware background process

Defined in auto-started script / etc/inittab:

OCSSD (Clustery Synchronization Service) provides heartbeat mechanism to monitor cluster status.

DISK HEARTBEAT

NETWORK HEARBEAT

CRSD (Clustery Ready Service) provides high availability, intervention, shutdown, restart, and transfer services.

Resources include nodeapps and database-related: the former needs only one node to work normally, while the latter is related to the database and is not limited by nodes, and can be multiple.

EVMD: this process is responsible for publishing events generated by CRS and serves as a bridge between CRS and CSS services.

RACGIMON: this process is responsible for checking the health of the database, including the start, stop, and failover of the database service. As a persistent connection, check the SGA periodically.

OPROCD (Process Monitor Daemon) detects CPU hang (for use on non-Linux platforms)

Concurrency Control of RAC

DLM distributed lock management.

Non-Cache Fusion resources: including data files, control files, data dictionary views, Library Cache, Row CacheCache Fusion resources: including ordinary data blocks, index data blocks, segment headers, UNDO data blocks. GRD (Global Resource Directory): record the distribution map of each data block among clusters. Divide master node and shadownodePCM lock:mode role Past ImageLMS0 (LOCK MANAGER SERVICE) in SGA: the corresponding service is GCS (Global Cache Service), which is mainly responsible for passing the Cache fusion parameter GCS_SERVER_PROCESSESLMD between instances of the data block; the corresponding service is GES (Global ENQUEUE Service), which is mainly responsible for the lock management in the delivery process. LCK: responsible for synchronous access to NON-CACHE FUSION resources, with one process per instance. LMON: this process communicates with each instance periodically and the corresponding service is CGS (Cluster Group Service). The node monitoring node monitor is provided, which is marked by the bitmap 0Pol 1 in GRD. 0: node shuts down 1: node operates normally and communicates periodically through the CM layer. Two heartbeat mechanisms: network heartbeat and control file disk heartbeat 3s at a time. DIAG: monitor status, log alert.logGSD: provide management interface for users. Trigger conditions for RAC refactoring, the main background process of RAC

(1) NM (NODE MANAGEMENT) group

(2) reconfiguration cluster trigger: if node joins or leaves the cluster, NM triggers Network Heartbeat exception: because LMON or GCS or GES communication is abnormal, it is triggered by IMR (Instance Membership Reconfiguration) controlfile heartbeat.

Advantages and disadvantages of RAC RAC advantages

(1) Multi-node load balancing

(2) provide high availability, fault tolerance and seamless handover functions to minimize the impact of hardware and software anomalies.

(III) the time to provide transaction response through parallel execution technology-usually used in data analysis systems.

(IV) increase transactions per second and connections through scale-out-commonly used in OLTP.

(5) to save hardware costs, multiple cheap PC servers can be used instead of minicomputers and mainframes to save corresponding maintenance costs.

(6) it has good scalability, and it is convenient to add and delete nodes and expand hardware resources.

Shortcomings of RAC

(1) more complex and demanding management

(2) when the planning and design of the system is poor, the performance may not be as good as that of single node.

(3) it may increase the cost of software (charged according to CPU)

Shared storage

When you need to map a LUN (logical unit number) to multiple nodes and provide a shared storage volume for the cluster, the LUNID of the same storage LUN must be the same on each host side. For example:

(1) when creating a VMFS volume for multiple ESX nodes

(2) when creating shared storage in a dual-computer HA cluster

Time consistency

In cluster mode, each node has to work together, so the time of each host must be the same. Therefore, the time of each host must be the same. The time difference between each node cannot time out. Generally speaking, if it exceeds 30s, the node is likely to restart, so it is necessary to synchronize the time of each node. For example, you need to configure a ntp clock server to synchronize the time of each node of the RAC. Or let the time synchronization between the nodes to ensure the time synchronization of each node, but can not guarantee the accuracy of the time of the RAC database.

Internet (or private network, heartbeat line)

The cluster must rely on the internal Internet to achieve data communication or heartbeat functions. Therefore, it is very fastidious to use ordinary Ethernet or other high-speed networks (such as IB), and, of course, to use serial port lines to transmit heartbeat information. In addition, what kind of network parameters are used has a lot to do with the overall performance and robustness of the cluster.

Case study:

XX City, 4-node Oracle 10g RAC

The operating system uses RHEL 4. According to the default installation document, set the network parameters to the following values:

Net.core.rmem_default = 262144

Net.core.rmem_max = 262144

To execute a query statement, it takes 11 minutes to modify the parameters:

Net.core.rmem_default = 1048576

Net.core.rmem_max = 1048576

It only takes 16.2 seconds to execute again.

Consistency of firmware, driver, upgrade package

Case study:

XX city, HPC cluster, runs LS-DYNA (general display nonlinear finite element analysis program).

Description of the environment of the clustered storage system: the three Imax O nodes of the storage system are connected to a shared storage through a FC SAN switch.

The FC HBA card used by the node is the Qlogic QLE2460; fiber switch and the Brocade 200e disk array is listed as Dawning DS8313FF.

Fault phenomenon

After the arrival of the cluster, it was found that the disk array could be directly connected to the machine, and the two devices could not be connected to the 200E switch. Later, after testing the IOS version of the switch, the optical fiber port of the disk array could not be recognized normally. I contacted the supplier of the switch to update the IOS twice, and the port of the disk array could be identified normally, but the disk array still could not be found when it was connected to the machine. After today's test, it is found that the firmware version of the HBA card used by the three Imax O nodes is inconsistent. The firmware of the earliest I/O1 connected to the fiber switch and directly connected to the disk array is v4.03.02, and the firmware of the two Imax O nodes brought out today is v4.06.03. After the last two tests, the disk array, machines and switches can communicate normally, and no anomalies have been found until tonight. Judging from the current situation, it is the fault caused by the incompatibility between the HBA card with QLE2460 firmware v4.03.01 and 200E IOS V5.3.1. As for the stability of the new HBA card firmware v4.06.03 and 200E IOS V5.3.1, further tests need to be done.

Diagnosis and treatment result

The problem that the disk array can not be found after connecting 200e after upgrading the fimware of the HBA card of the O 1 node to v4.06.03 has been resolved. In fact, it is a problem caused by an inconsistent firmware version of the FCHBA card.

Share files OCR and Voting Disk

Oracle Cluster Registry (OCR): records the configuration information of OCR node members, such as database, ASM, instance, listener, VIP and other CRS resources, which can be stored on bare devices or cluster file systems. Voting disk: arbitration disk, which stores the member information of nodes. When multiple voting disks are configured, the number must be odd, and each node must be able to connect more than half of the voting disks at the same time in order to survive. Which node members are included for the first time, and the information about adding and deleting nodes.

Installation

In Oracle RAC, software is not recommended to be installed on shared file systems, including CRS_HOME and ORACLE_HOME, especially CRS software. It is recommended to install on the local file system, so that rolling upgrades (rolling upgrade) can be used to reduce planned downtime during software upgrades and the installation of patch and patchset. In addition, a single point of failure will be added if the software is installed on a shared file system. If you use ASM storage, you need to install ORACLE software separately for asm, separate ORACLE_HOME, easy to manage and maintain, for example, when you encounter asm's bug need to install patches, it will not affect RDBMS files and software.

Cerebral fissure (split brain)

In a shared storage cluster, when heartbeat is lost in the cluster, if each node still operates on the shared storage at the same time, then the situation caused in this case is disastrous. ORACLE RAC uses voting algorithm to solve this problem. The idea is as follows: each node has one vote. Consider the situation of a cluster with three nodes of AMagi BMagazine C. when Node A cannot communicate with Node BMague C for various reasons, then the cluster is divided into two DOMAIN,A nodes to become a DOMAIN and has one vote. If the BMague C node becomes a DOMAIN with two votes, then in this case, the BMague C node has control over the cluster, thus kicking node An out of the cluster, if it is implemented through IO FENCING. If it is a two-node cluster, the arbitration disk is introduced, and when the two nodes cannot communicate, the node that first arrives at the arbitration disk holds the control of the cluster. Brain split occurs only after network problems (interconnect is down), time inconsistency, misscount timeout, and so on, while brain split occurs to protect the entire cluster from the problematic nodes. Oracle uses server fencing, which is to restart the node in question in an attempt to fix the problem. Of course, there are many problems that cannot be fixed automatically. For example, the time is not consistent, and there is no ntp; network cable is broken. All these require manual intervention to fix the problem. The performance at this time is that the node in question is rebooted repeatedly.

Cluster software

Starting from Oracle10g, Oracle provides its own clustering software, called Oracle Clusterware, or CRS, which is a prerequisite for installing oraclerac, while the above third-party clusters become optional. At the same time, it provides another new feature called ASM, which can be used for the management of shared disk devices under RAC. It also realizes the striping and mirroring of data files to improve performance and security (S.A.M.E: stripe and mirroreverything). It is no longer dependent on third-party storage software to build RAC systems. In particular, the Oracle11gR2 version no longer supports bare devices, and Oracle will try its best to promote ASM and completely abandon the support of third-party cluster components.

Oracle Clusterware's heartbeat

Oracle Clusterware uses two heartbeat devices to verify the status of members and ensure the integrity of the cluster.

L? One is the heartbeat of the voting disk. The ocssd process writes a heartbeat message to the votedisk every second. L? The other is the heartbeat of private Ethernet between nodes.

Both heartbeat mechanisms have a corresponding timeout, called misscount and disktimeout, respectively:

L? Misscount is used to define the timeout of heartbeat communication between nodes (in seconds). Disktimeout, which defaults to 200s, defines the timeout for the connection between the css process and vote disk.

Reboottime, after a brain crack occurs and a node is kicked out, the node will restart within the time of reboottime; the default is 3 seconds. Use the following command to view the actual values of the above parameters:

L # crsctl get css misscountl # grep misscount $CRS_HOME/log/hostname/cssd/ocssd.log

When the following two situations occur, css kicks out the node to ensure the integrity of the data:

(1) when Private Network IO time > misscount, split brain or brain splitting occurs, resulting in multiple "subclusters" (subcluster). These sub-clusters vote to choose which to survive. The principle of kicking out nodes is in accordance with the following principles: the subcluster with different number of nodes survives; if the number of nodes is the same, the nodes with small node ID survive.

(2) VoteDisk I Time O Time > disktimeout, the principle of kicking out nodes is as follows: nodes that lose more than half of their vote disk connections will restart within the time of reboottime. For example, if there are 5 vote disk, a node will restart when its connection to > = 3 vote disk times out due to network or storage reasons. When one or both vote disk is damaged, it does not affect the operation of the cluster.

How to view the configuration of an existing system

For an existing system, you can confirm the heartbeat configuration of the database instance in the following ways, including the name of the network card, the IP address, and the network protocol used.

? The easiest way can be found in the database background alarm log. Use oradebug

SQL > oradebug setmypid

Statement processed.

SQL > oradebug ipc

Information written to trace file.

SQL > oradebug tracefile_name

/ oracle/admin/ORCL/udump/orcl2_ora_24208.trc

Find the line corresponding to the trace file: socket no 7 IP 10.0.0.55 UDP 16878

? From the data dictionary:

SQL > select * from v$cluster_interconnects

SQL > select * from x$ksxpia

Heartbeat tuning and settin

In order to prevent the heartbeat network from becoming a single point of failure of the system, we can simply use the operating system-bound network card as the heartbeat network of Oracle. Taking AIX as an example, we can use etherchannel technology. Suppose there are four ent0/1/2/3 network cards in the system, and we bind 2 and 3 as heartbeats: the corresponding technologies in HPUX and Linux are called APA and bonding, respectively.

Tuning of UDP VPC when using UDP as the communication protocol for cache fusion between database instances, you need to adjust the relevant parameters on the operating system to improve the UDP transmission efficiency and avoid errors that exceed the OS limit when compared with big data:

(1) UDP packet sending buffer: the size is usually set to be larger than (db_block_size * db_multiblock_read_count) + 4k

(2) UDP packet receiving buffer: the size is usually set to 10 times the sending buffer.

(3) maximum value of UDP buffer: set as large as possible (usually greater than 2m) and must be greater than the first two values

The corresponding view and modify commands for each platform are as follows:

Solaris View ndd / dev/udp udp_xmit_hiwat udp_recv_hiwat udp_max_buf

Modify ndd-set / dev/udp udp_xmit_hiwat 262144

Ndd-set / dev/udp udp_recv_hiwat 262144

Ndd-set / dev/udp udp_max_buf 2621440

AIX View no-a | egrep "udp_ | tcp_ | sb_max"

Modify no-p-o udp_sendspace=262144

No-p-o udp_recvspace=1310720

No-p-o tcp_sendspace=262144

No-p-o tcp_recvspace=262144

No-p-o sb_max=2621440

Linux View File / etc/sysctl.conf

Modify sysctl-w net.core.rmem_max=2621440

Sysctl-w net.core.wmem_max=2621440

Sysctl-w net.core.rmem_default=262144

Sysctl-w net.core.wmem_default=262144

HP-UX doesn't need it.

HP TRU64 View / sbin/sysconfig-Q udp

Modifying: Editing file / etc/sysconfigtab

Inet: udp_recvspace = 65536

Udp_sendspace = 65536

Windows doesn't need it.

About Me

● this article is sorted out from the network

● article is updated synchronously on itpub (http://blog.itpub.net/26736162), blog Park (http://www.cnblogs.com/lhrbest) and personal Wechat official account (xiaomaimiaolhr).

● article itpub address: http://blog.itpub.net/26736162/abstract/1/

● article blog park address: http://www.cnblogs.com/lhrbest

● pdf version of this article and wheat seedling cloud disk address: http://blog.itpub.net/26736162/viewspace-1624453/

● database written examination interview questions database and answers: http://blog.itpub.net/26736162/viewspace-2134706/

● QQ group: 230161599 WeChat group: private chat

● contact me, please add QQ friend (646634621), indicate the reason for adding

● was completed in Mordor from 09:00 on 2017-06-02 to 22:00 on 2017-06-30.

The content of the ● article comes from the study notes of wheat seedlings, and some of it is sorted out from the Internet. Please forgive me if there is any infringement or improper place.

Pick up your phone and use Wechat client to scan the picture on the left below to follow the Wechat official account of wheat seedlings: xiaomaimiaolhr, scan the QR code on the right to join the QQ group of wheat seedlings, and learn the most practical database technology.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.