Oracle 10g real application clusters introduction (RAC principle) 07/02 Update SLTechnology News&Howtos

Oracle 10g real application clusters introduction (RAC principle)

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

1. What is cluster

A cluster is made up of two or more independent servers connected through the network. Several hardware vendors have provided various requirements for Cluster performance over the years. Some Clusters are simply designed to provide high availability by transferring to the secondary node node in the event of a failure of the currently active node. Others are to provide distributed connectivity and scalability of work. Another common feature of Cluster is that for an application, it can be thought of as a separate server. Similarly, managing several servers should be as simple as managing a server as possible. The Cluster manager software provides this feature.

If it is the nodes of single server, the files must be stored in a location that their respective node can access. There are several different topologies to solve the problem of data access, which mainly depends on the main goals of the Cluster design.

When connected to each other, a physical network connection acts as a direct interactive communication between each Cluster node.

In short, a Cluster is a group of independent servers that work together to form a single system.

2. What is Oracle real Application Cluster (RAC)

RAC is software that allows you to use Cluster hardware by running multiple Instance that rely on the same Database. The database files is stored on a disk that physically or logically connects to each node. So that each active Instance can read and write to the files.

RAC software manages access to data. So change operations are coordinated between Instances, and the information and data mirrors seen by each Instance are consistent.

With the RAC structure, redundancy can be achieved so that applications can access Database through other Instance, even when one system is crash or inaccessible.

3. Why use RAC

RAC can highly utilize the standard Cluster and reduce the cost of module servers.

RAC automatically provides workload management for services. The services of the application can be grouped or classified into commercial components to complete the application task. Services in RAC can be continuous, uninterrupted Database operations and support multiple services on multiple Instance. Services can be designed to run on one or more Instance, and alternating Instances can be used to back up Instances. If the primary Instance fails, Oracle moves the services from the failed Instance node to the active alternative Instance. Oracle also automatically balances data loading through connections.

RAC uses multiple cheap computers to provide Database services together, just like a large computer, serving a variety of applications that only large-scale SMP can provide.

RAC is based on a shared disk structure and can be increased or reduced in demand without the need for artificial data separation in Cluster. And RAC can simply add and remove servers from Cluster.

4. Clusters and scalability

If you can use symmetric multiprocessing (symmetric multiprocessing SMP) mechanism to provide transparent services to your application, you should use RAC to achieve the same effect without any changes to the application code.

When a node fails, RAC can exclude the Database Instance and the node itself, thus ensuring the integrity of the Database.

Here are some examples of scalability:

* allow more concurrent batches.

* allow for more concurrent execution.

* the number of connected users has greatly increased in the OLTP system.

1) the level of scalability: there are four main levels.

* scalability of hardware: interconnectivity is its key, which generally depends on higher bandwidth and lower latency.

* scalability of OS: in OS, synchronization method can determine the scalability of the system. In some cases, the potential scalability of the hardware is lost because OS is unable to maintain multiple resources requested concurrently.

* scalability of the Database management system: a key factor in the concurrency structure is whether concurrency is influenced by internal or external processes. The answer to this question affects the mechanism of synchronization.

* scalability at the application level: applications must be explicitly designed to be extensible. When in the system, in most cases, each session is updating the same data, a bottleneck may occur. This is not only for RAC, but also for single-instance systems.

To be clear, concurrent Cluster processes may fail if either level does not achieve scalability, no matter how scalable the other levels are. The typical reason for lack of scalability is access to shared resources. This allows concurrent operations to be serialized on this bottleneck. This is not just a limitation in RAC, but a limitation in all structures.

2) scaleup and speedup

* scaleup is the ability to maintain the same level of performance when both workload and resources increase proportionally (corresponding time)

Scaleup= (volume parallel) / (volume original)-time for ipc

* speedup refers to the effect of proportional reduction in execution time by increasing the number of resources to complete a fixed workload.

Speedup= (time original) / (time parallel)-time for ipc

Among them, ipc is the abbreviation of interprocess communication-interprocess communication.

RAC Architecture and Concepts

1. The principle of RAC software

In a RAC Instance, you will see some background processes that do not exist in a normal Instance, which are mainly used to maintain the consistency of Database in each Instance. Manage global resources as follows:

* LMON: global queue service monitoring process-Global Enqueue Service Monitor

* LMD0: global queue service daemon-Global Enqueue Service Daemon

* LMSx: global buffering service process, x can be from 0 to j--Global Cache Service Processes

* LCK0: lock process-Lock process

* DIAG: diagnostic process-Diagnosibility process

At the Cluster layer, you can find the main processes of Cluster Ready Services software that provide standard Cluster interfaces and high-availability operations on all platforms. On each Cluster node, you can see the following process:

* CRSD and RACGIMON: engines for high availability operations.

* OCSSD: provides access to member nodes and service groups

* EVMD: event detection process, which is run and managed by oracle users

* Monitoring process of OPROCD:Cluster

In addition, there are several tools for managing various resources at the global level in Cluster. These resources are ASM Instance, RAC Database, Services, and CRS application nodes. The main tools involved in this book are Server Control (SRVCTL), DBCA and Enterprise Manager.

2. Storage principle of RAC software.

The RAC installation of Oracle10g is divided into two phases. The first phase is to install CRS, followed by the installation of Database software with RAC components and the creation of a Cluster database. The Oracle home used by the CRS software must be different from the home used by the RAC software. Although CRS and RAC software in Cluster can be shared by using the Cluster file system, the software is always installed on the local file system of each node according to certain rules. This supports online patch upgrades and eliminates failures caused by single-node software. Two more must be stored on a shared storage device:

* voting file: in essence, it is used for Cluster synchronization Services daemon to monitor node information. The size is about 20MB.

* Oracle Cluster Registry (OCR) file: also a key component of CRS. Used to maintain information about highly available components in Cluster. For example, the list of Cluster nodes, the Instance-to-node mapping of Cluster databases, and the list of CRS application resources (such as Services, virtual internal link protocol addresses, etc.). This file is automatically maintained through an administrative tool similar to SRVCTL. Its size is about 100MB.

Voting file and OCR file cannot be stored in ASM because they must be accessible before any Oracle Instance starts. And both must be stored in redundant, reliable storage devices, such as RAID. It is recommended that the best practice is to put these files on bare disk.

3. The structure of OCR

The configuration information for Cluster is maintained in OCR. OCR relies on a distributed shared cache structure to optimize queries about the Cluster knowledge base. Each node in the Cluster maintains a copy of the OCR cache in its memory accessed by the OCR process. In fact, in Cluster, only one OCR process reads and writes to the OCR in shared storage. This process is responsible for refresh its own local cache as well as the OCR cache of other nodes in the Cluster. For access involving the Cluster knowledge base, the OCR client accesses the local OCR process directly. When clients need to update OCR, they interact with the process that acts as reading and writing OCR files through the local OCR process.

OCR client applications include: Oracle Universal installer (OUI), SRVCTL, Enterprise Manager (EM), DBCA, DBUA, NetCA, and Virtual Network Protocol Assistant (VIPCA). In addition, OCR maintains dependency and state information that manages the resources of various applications defined within CRS, especially those of Database, Instance, Services, and nodes.

The name of the configuration file is ocr.loc and the configuration file variable is ocrconfig_loc. The location of the Cluster knowledge base is not limited by bare devices. OCR can be placed on a shared storage device managed by Cluster file system.

Note:OCR can also be used as a configuration file in a single Instance of ASM, with one OCR per node.

4. RAC Database storage principle

The main difference from single-Instance Oracle storage is that RAC storage must store all data files in RAC in a shared device (bare device or Cluster file system) so that Instance that accesses the same Database can share. At least two redo log groups must be created for each Instance, and all redo log groups must also be stored on a shared device for crash recovery purposes. The online redo log groups of each Instance is called an online redo thread of Instance.

In addition, a shared undo tablespace must be created for each Instance for the undo automatic management features recommended by Oracle. Each undo tablespace must be shared for all Instance, primarily for recovery purposes.

Archive logs cannot be stored on bare devices because their names are automatically generated and each is inconsistent. So it needs to be stored in a file system. If you use Cluster file system (CFS), you can access these archived files at any time on any node. If you do not use CFS, you have to make those archived logs available to other Cluster members during recovery, such as through the Network File system (NFS). If the recommended flash recovery area feature is used, it must be stored in a shared directory so that all Instance can access it. (the shared directory can be an ASM disk group or a CFS).

5. RAC and shared storage technology

Storage is a key component of grid technology. Traditionally, storage is directly attached to each Server (directly attached to each individual Server DAS). In the past few years, more flexible storage has emerged and been applied, mainly through storage space networks or formal Ethernet access. These new storage methods make it possible for multiple Servers to access the same set of disks, which can be easily accessed in a distributed environment.

Storage area network (SAN) represents the evolution of data storage technology at this point. Traditionally, data is stored inside the Server or in the devices attached to it in the Cmax S system. Subsequently, the network attached storage (NAS) phase is entered, which separates the storage devices from the Server and the network that directly connects them. The principles it follows in SAN further allow storage devices to exist in their respective networks and switch directly through high-speed media. Users can access the data of the storage device through the Server system, and the Server system is connected to the local network (LAN) and SAN.

The choice of file system is the key to RAC. Traditional file systems do not support parallel mounting of multiple systems. Therefore, files must be stored in a naked volume label without any file system or in a file system that supports multi-system concurrent access.

Therefore, the three main methods for shared storage for RAC are:

* Naked volume labels: since they are directly attached bare devices, they need to be used for storage and operate in block mode processes.

* Cluster file system: process access in block mode is also required. One or more Cluster file systems can be used to store all RAC files.

* automatic Storage Management (ASM): a lightweight, dedicated, optimized Cluster file system for Oracle Database files,ASM.

6 、 Oracle Cluster file system

Oracle Cluster file system (OCFS) is a shared file system designed specifically for Oracle RAC. OCFS eliminates the need for Oracle Database files to be connected to logical disks and allows all nodes to share an ORACLE Home instead of having a local copy of each node. OCFS volume labels can span one or more shared disks for redundancy and performance enhancement.

A table of file classes that can be placed in OCFS when:

* installation file for Oracle software: in 10g, this setting is only supported in windows 2000. It is said that the later version will provide support in Linux, but I haven't looked at it yet.

* Oracle files (control files, data files, redo logs files, bfiles, etc.)

* shared profile (spfile)

* Files created by Oracle while Oracle is running.

* voting and OCR files

Oracle Cluster file system is free for developers and users. Can be downloaded from the official website.

7. Automatic Storage Management (ASM)

It's a new feature of 10g. It provides a vertically managed file system and volume label manager dedicated to the creation of Oracle Database files. ASM can provide management of a single SMP machine or Cluster nodes across multiple Oracle RAC.

ASM no longer needs to manually adjust Imax O, and automatically allocates the load of Imax O to all available resources to optimize performance. Assist DBA in managing a dynamic database environment by allowing the Database size to be increased without the need for a shutdown database to adjust storage allocation.

ASM can maintain redundant backups of data, thus improving fault tolerance. It can also be installed into a reliable storage mechanism.

8. Select RAW or CFS

* advantages of CFS: simple installation and management of RAC; use of Oracle managed files (OMF) for RAC; installation of single Oracle software; automatic extension on Oracle data files; unified access to archive logs when the physical node fails.

* use of bare devices: generally used when CFS is not available or is not supported by Oracle; it provides the best performance and does not require the middle layer between Oracle and disk; if space is exhausted, automatic extension on bare devices will fail ASM, logical storage manager, or logical volume label management can simplify the work of bare devices, they also allow space to be loaded onto online bare devices, and names can be created for bare devices for ease of management.

9. Typical Cluster stack of RAC

Each node in Cluster needs a supported interconnected software protocol to support internal Instance interaction and TCP/IP to support CRS polling. All UNIX platforms use user datagram protocol (UDP) as the main protocol on Gigabit Ethernet and conduct IPC interaction of RAC internal Instance. Other supported unique protocols include the remote shared memory protocol and the hypertext protocol for connection interaction between SCI and Sunfire for hyperfiber interaction. In any case, the interaction must be recognized by the Oracle of the platform.

With Oracle clusterware, installation and support complications can be reduced. However, if users use non-etheric interactions, or develop applications that rely on clusterware on RAC, vendor clusterware may be required.

Like interactive connections, shared storage solutions must be recognized by the Oracle of the current platform. If CFS is available on the target platform, both Database area and flash recovery area can be created on CFS or ASM. If CFS is not available on the target platform, the Database area can be created on an ASM or a bare device (requires a volume label manager) and the flash recovery area must be created in the ASM.

10. RAC certification Matrix: it is designed to handle any authentication issues. You can use matrix to answer any RAC-related authentication questions. The specific steps for use are as follows:

* Connect and log in to http://metalink.oracle.com

* Click the "certify and availability" button in the menu bar

* Click the "view certifications by product" link

* Select RAC

* choose the right platform

11. Necessary global resources

In a single-Instance environment, lock coordinates lead to a shared resource like a row in a table. Lock avoids colleagues in both processes from modifying the same resources.

In the RAC environment, the synchronization of internal nodes is critical because it maintains the consistency of their respective processes in different nodes and prevents them from modifying the same resource data at the same time. The synchronization of internal nodes ensures that each Instance sees the most recent version of block in the buffer cache. The figure above shows when locking does not exist.

1) Coordination of global resources

The cluster operation requires synchronization to control access to shared resources in all Instance. RAC uses Global Resource Directory to record the usage of resources in cluster Database. Global Cache Service (GCS) and Global Enqueue Service (GES) manage the information in GRD.

Each Instance maintains a portion of the GRD in its local SGA. GCS and GES specify that an Instance manages all the information for a particular resource, which is called the master of the resource. Every Instance knows the Instance masters of resource.

It is important to maintain the cache coherency of cache in RAC's activities. The so-called cache coherency is a technology that maintains the consistency of multiple versions of block in different Oracle Instances. GCS implements cache coherency through the so-called cache fusion algorithm.

GES manages the internal Instance resource operations of all non-cache fusion algorithms and the state trajectory of the Oracle enlistment mechanism. The main resources controlled by GES are the dictionaries cache locks and library cache locks. At the same time, it also plays the role of deadlock detection for all deadlock-sensitive queues and resources.

2) Global cache coordination instance

Suppose a data block is modified by the first node to become dirty data. And in clusterwide, there is only one version of block copy, whose content is replaced by the SCN number. The specific steps are as follows:

The second Instance view of ① modifies the block to make a request to GCS.

② GCS submits a request to the holder (holder) of block. Here, the first Instance is holder.

The first Instance of ③ receives the message and sends the block to the second Instance. The first Instance holds the dirty buffer for recovery purposes. The dirty mirror image of block is called block's past p_w_picpath. A past p_w_picpath block will not be changed further.

After ④ receives the block, the second Instance notifies GCS that the block has been holds.

3) write to disk coordination:example

In the caches in Instances in the cluster structure, there may be different modified versions of the same block. The write protocol managed by GCS ensures that only the most recent version is written to disk. It also needs to ensure that other previous versions are cleaned from other cache. A request to write to disk can be initiated from any Instance, regardless of whether it saves the current or past version of block. Assuming the first block image of Instance hold in the past, request Oracle to write buffer to disk, as shown in the figure above. The process is as follows:

① 's first Instance sends a write request to GCS

② GCS transfers the request to the second Instance, the current holder of that block

The second Instance of ③ writes the block to disk after receiving a write request

The second Instance of ④ notifies GCS that the write operation is complete.

⑤ when GCS is notified, GCS commands the holders of all past mirrors to delete their past mirrors. This mirror will no longer be needed for recovery.

12. RAC and Instance/crash recovery

1) when an Instance fails, when the failure is detected by another Instance, the second Instance will perform the following restore operation:

During the first phase of ① recovery, GES is re-queued.

② GCS has also re-infused its resources. The GCS process only re-infuses resources that are out of its control. In the meantime, all GCS resource requests and write requests are temporarily suspended. However, transactions can continue to modify the data blocks as long as they have obtained the necessary resources.

③ when the queue is reconfigured, an active Instance can take possession of the Instance recovery queue. Therefore, when the GCS resource is re-injected, the SMON determines the collection of blocks that needs to be restored. This collection is called a restore set. Because, using the cache fusion algorithm, an Instance transfers the contents of these blocks to the requested Instance without having to write the blocks to disk. The on-disk version of these blocks may not contain the blocks of the modification operation of the data of other Instance processes. This means that SMON needs to merge the redo logs of all failed Instance to determine the recovery set. This is because a failed thread may cause a hole (hole) in redo to be filled with the specified block. So the failed redo thread of Instance cannot be applied continuously. At the same time, the redo thread of the active Instances does not need to be restored because SMON can use mirrors of past and current traffic buffers.

The buffer space used by the ④ for recovery is allocated, and the resources identified by the previously read redo logs are declared as recovery resources. This prevents other Instance from accessing these resources.

All resources needed by ⑤ in subsequent restore operations are obtained, and GRD is not currently frozen. Any data block that does not need to be restored can now be accessed. So it is partially available in the current system. At this point, suppose there are past or current blocks images that need to be restored, and in other caches in cluster Database, for these special blocks, the nearest mirror is the starting recovery point. "if neither the past mirror nor the current mirror buffer is in the caches of the active Instance for the block to be restored, the SMON writes a log, indicating that the merge failed." SMON restores and writes to each block identified in step 3 and releases resources immediately after recovery so that more resources can be used during recovery.

When all block is restored and the occupied recovery resources are released, the system is available again.

Note: in recovery, the cost of log consolidation is proportional to the number of failed Instances and is related to the size of each Instance's redo logs.

2) Instance recovery and Database availability

The figure above shows the availability of the database at each step during an Instance recovery:

A. RAC runs on multiple nodes

b. A node failure was detected

C. the queue portion of the GRD is reset; resource management is reassigned to the active nodes. The execution of this operation is faster.

D. the buffer portion of the GRD is reset, and the SMON reads the redo logs of the failed Instance to identify the collection of blocks that needs to be recovered.

E. SMON initiates a request to GRD to get all the Database blocks in the blocks collection that needs to be restored. When the request ends, all other blocks can be accessed

F. Oracle performs a scrolling forward recovery. The redo logs of the failed thread is applied to the Database, and those fully restored blocks will be accessible immediately

G. Oracle performs a rollback recovery. For transactions that have not been committed, undo blocks is applied to Database

H. the recovery of Instance is complete, and all data can be accessed

13. Valid internal node row-level lock

Oracle supports valid row-level locks. These row-level locks are mainly created during DML operations, such as UPDATE. These locks are held until the transaction is committed or rolled back. Any process that requests a peer's lock will be suspended.

The block transmission of the cache fusion algorithm is independent of these user-visible row-level locks. The transfer of GCS to blocks is a low-level operation that begins without the release of contemporary row-level locks. The blocks may be transferred from one Instance to another Instances, and the blocks may be locked.

GCS provides access to data blocks, allowing multiple transactions to proceed concurrently.

14. Additional memory requirements for RAC

Most of the memory specific to RAC is allocated from the shared pool when the SGA is created. Because blocks may be buffered across Instances, a larger buffer must be required. Therefore, when migrating the Database of single Instance to RAC, and keeping the request workload of each Instance accessible to single-instance, you need to increase the buffer cache and shared pool of the Instance running RAC by 10% and 15%. These values are based on the experience of RAC size, an initial trial value. It is generally greater than this value.

If you are using the recommended automatic memory management feature, you can set it by modifying the initial parameters of SGA_TARGET. However, considering that the same amount of user access is dispersed across multiple nodes, the memory requirement for each Instance can be reduced.

The actual resources can be used by querying the view V$RESOURCE_LIMIT view CURRENT_UTILIZATION and MAX_UTILIZATION fields in the GCS and GES entities in each Instance, as follows:

SELECT resource_name, current_utilization, max_utilization FROM v$resource_limit WHERE resource_name like'g% s%'

15. RAC and concurrent execution

Oracle's optimizer is based on the execution access cost, which takes into account the cost of concurrent execution and uses it as a component to obtain an ideal execution plan.

In RAC environment, the concurrency selection of optimizer is composed of internal nodes and external nodes. For example, if a special query request requires six query processes to complete, and there are six concurrent subordinate execution processes on the local node that are all idle, the query is executed using local resources to get the results. This illustrates the cost of effective internal node concurrency and no need for multi-node concurrent query coordination. If only two concurrent execution slave processes are available in the local node, the two processes and the four processes of the other nodes execute the query together. In this case, both internal and external nodes are used concurrently to speed up the query.

In real-world decision support applications, queries can not be well divided by various query servers. So some concurrent execution servers becomes idle state before other servers after completing its task. Oracle concurrent execution technology dynamically monitors the processes of idle and assigns tasks in the queue table of overloaded processes to processes in idle state. In this way, Oracle effectively redistributes the query workload of all processes. RAC further extends this efficiency to the entire cluster.

16. Global dynamic performance view

The global dynamic performance view displays all information related to the Instances that opens and accesses RAC Database. The standard dynamic performance view shows only information about the local Instance.

For all V$ type views, there will be a GV$ view, except for a few other special cases. Except that the columns,GV$ view in the V$ view contains an additional column called INST_ID, which shows the Instance number in the RAC. You can access GV$ on any open Instance.

To query the GV$ view, the initial parallel _ MAX_SERVERS initialization parameter on each Instance is set to at least 1. This is due to the special concurrent execution of queries against GV$. The coordinator of concurrent execution runs on the Instance of the client connection, and a slave is allocated on each Instance to query its potential V$ view. If PARALLEL_MAX_SERVERS on one Instance is set to 0, information about that node cannot be obtained, and similarly, if all concurrent servers are very busy, results cannot be obtained. In both cases, you will not get a prompt or error message.

17, RAC and Service

18. Virtual IP address and RAC

When a node fails completely, the virtual IP address (VIP) is about all valid applications. When a node fails, its associated VIP is automatically dispatched to other node in the cluster. When this happens:

* crs binds this ip to the MAC address of another node's Nic, which is transparent to the user. For directly connected clients, errors is displayed.

* subsequent packets destined for VIP will be directed to the new node, which will send error RST return packets to the client. So that the client can quickly get the errors information and retry the connection of other nodes.

If you do not use VIP, if a node fails, connections to that node will wait for a 10-minute TCP expiration.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.