Oracle 12c column storage (In Memory theory) 04/17 Update SLTechnology News&Howtos

Oracle 12c column storage (In Memory theory)

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

With the introduction of in memory component in Oracle 12c, Oracle database has dual-mode data storage mode, which can support mixed-type applications: traditional data saved in row form satisfies OLTP application; data stored in column form satisfies query-based OLAP application. In memory component can be used with other database components, and it is very convenient to realize the transformation based on real-time database analysis without the need for users to develop or modify the application alone. This article introduces some knowledge about in memory components, including the following:

-basic knowledge of column storage

-access data in in memory area

-Integration of In memory and RAC

1. Basic knowledge of column storage.

1.1 memory structure

The traditional database uses row storage, when a transaction occurs, oracle will operate on one row (or more rows) of data, that is to say, the operating unit of the data is a row of data, even if the data that may need to be accessed is only a few columns, this way of data preservation is very suitable and efficient for DML-based OLTP applications. However, in the OLAP system, the query operation for a large amount of data is absolute, and these queries are often only aimed at some specific columns in the table. In addition, data changes take place in the way of data loading, that is, data is rarely changed after it is loaded into the database, there is no doubt that it is a better choice to organize data in a column way. Precisely because these two ways of storing data have their own advantages and disadvantages, no matter which way to store data can not well meet the requirements of the database system of hybrid applications. Oracle has introduced the so-called dual-mode data storage mode: storing data in the form of rows in disk (that is, data files) and database buffer cache. Open up a separate memory space (in memory area), in which the data is stored as columns to meet the query requirements of the OLAP type. One of the main reasons why Oracle chooses to open up a separate piece of memory to save column schema data is that the application of OLAP is mainly query, and most of the data changes occur in the way of data loading, which means that oracle also completes the data loading in in memory area space by bulk data loading to ensure the real-time performance of the data. Next, some basic knowledge of in memory area components is introduced from two aspects: in memory memory structure and data loading process.

First of all, in memory area is a separate memory space independent of traditional SGA and PGA, which is composed of 1Mpool and 64Kpool. 1m pool is used to save data in column format, IMCU (in memoryCompressionUnit) is the basic storage unit; 64Kpool is used to store metadata information relative to IMCU, and SMU (SnapshotMetadataUnit) is the basic unit of memory. Readers can learn about the relevant information through the following query.

IMCU is the basic storage unit for storing column format data in memory. Oracle will try its best to ensure that the size of each IMCU is 1m, and each IMCU is made up of the two parts shown in figure 1.

Figure 1

The SMU part is mainly used to save the original data information of IMCU, such as the pointer corresponding to IMCU, the extent range contained in IMCU, the DBA range, the pointer of Journaltable, Rowid bitmap and so on.

1.2 data loading (populate)

Now that you've seen how in memory keeps the data in memory, let's take a look at how the data is loaded into memory. According to the previous introduction, the data is saved in row format in the data file, so a mechanism is needed to load the data into in memory area and complete the transition from row mode to column mode during the loading process.

First, oracle supports specifying in memory attributes on tables, partitions, or tablespaces, that is, in memory attributes are specific to physical database objects, not logical database objects. For example, we can use the following statement to specify the in memory property for the database object:

SQL > alter table sales inmemory no memcompress priority critical

SQL > ALTER TABLESPACE ts_data INMEMORY

SQL > ALTER TABLE sales MODIFY PARTITION SALES_201501 INMEMORY

It should be noted that because the in memory component is mainly aimed at OLAP applications, and most of the operations in this application are queries, and most of the time only care about one or more columns in the table, so the in memory feature can also specify that only specific one or more columns in the table can be loaded into in memory area.

Because the size of the in memory area area is limited, the memory resources of the host are also limited, and the capacity of the database often exceeds the existing memory resources, Oracle recommends loading tables with high performance requirements into in memory area and saving tables with low performance requirements to flash memory or disk. Of course, if the memory resources are sufficient and the database is small, most applications are query-based, or you can consider loading all the tables into the in memory area. In addition, it is precisely because of resource constraints that Oracle allows users to set in memory load priorities for different tables. The basic principle is that high-priority objects are loaded into the in memory area first, and low-priority objects can not be loaded until the high-priority objects are loaded. Oracle provides five in memory load priorities, each of which is detailed in Table 1.

Table 1

In addition, because in memory is mainly query-oriented OLAP or decision support system, that is to say, most of the data will not be changed after being loaded (Load) into the database, then compressing the data while loading the data can undoubtedly save memory space and improve the efficiency of the query (the main reason is that many columns to be queried will contain a large number of duplicate values). So the in memory component provides a wealth of compression options that allow users to specify compression methods while specifying in memory options for objects. Table 2 lists the supported compression levels:

The compression ratio in the above table is getting higher and higher from top to bottom. The following sql statement states that table salse is of the highest priority when loaded into inmemory area and needs to be compressed in "memcompress for query" mode: SQL > alter table sales inmemory memcompress for query low priority critical

If you need to know the compression ratio that each compression option can get before specifying compression options, you can use Oracle Compression Advisor (DBMS_COMPRESSION package) to estimate.

Finally, the loading process is implemented through the cooperation of the background process IMCO and the worker process (W00). After the database is started or the in memory option of some objects is enabled, the IMCO process will create some loading tasks and assign them to several work processes as needed, each of which is responsible for loading part of the data, when all the work processes have finished loading the corresponding part of the data. Notifies IMCO that the process loading is complete.

Data consistency of 2.In memory

If our database is read-only, then things will become much easier, because there will be no data consistency problems, but this is not the case. For most databases, transaction processing happens all the time. Then the consistency of the data needs to be guaranteed. The in memory component is no exception. If the data modified by the DML statement is not loaded into the in memory area, then the modification of the DML statement is limited to SGA; on the contrary, if the modified data has been loaded into the in memory area, then a mechanism is needed to ensure data consistency. For example, data that is not submitted cannot be seen, and sessions that perform changes should be able to see the latest data.

Oracle ensures data consistency through journal table. Each IMCU corresponds to its own journal table. If the data modified by the DML statement is included in the IMCU, the modified data is recorded in the journal table. We call it private journal;. After the transaction is committed, the corresponding record in the journal table is identified as shared journal. This ensures that the query can get consistent data when accessing IMCU, and if the data needed by the query cannot be found in journal table, oracle will automatically map the information in the Rowid bitmap recorded in IMCU to the corresponding location in buffer cache to find the data that meets the query requirements. Figure 2 depicts the basic relationship between journal table and IMCU.

Figure 2

However, if DML statements continue to occur, it will make more and more data in journal table, and even appear that most of the data in IMCU are old data, while the new data is saved in journal table, which is very harmful to the performance of in memory queries. Therefore, Oracle defines a threshold (threshold), which will trigger the reload process when the proportion of old data in IMCU reaches this threshold, that is, the IMCO background process will check every once in a while (default 2 minutes) to see if any IMCU meets the conditions for reloading. If it finds an IMCU that meets the conditions, it will notify the W00 worker process to reload the corresponding IMCU. However, because the cost of reloading is relatively high, and may affect some running statements, Oracle will take a gradual approach to reload IMCU, that is, only part of the IMCU that meets the reload conditions is selected for processing at a time, and the specific extent can be adjusted by the INMEMORY_TRICKLE_REPOPULATE_SERVERS_PERCENT parameter.

It is difficult to quantify how much extra load the journal table generated by the transaction produces on the system, because there are too many factors that affect it, such as the compression method used during loading, the way in which the application accesses the data. However, there are still some basic principles that can minimize the impact of data changes on in memory. Because data is reloaded into in memory area in extent units, if changes to the data are randomly distributed to each extent of the table, the cost of reloading will be high, because it means that a large amount of IMCU needs to be rebuilt. If data changes can be concentrated in a specific range of extent, or if most of the changes are data insertion and direct path loading, then the cost of reloading will be greatly reduced. Another suggestion is to use partition tables to save data as much as possible, which helps to limit data changes to specific partitions, and do not use or try to use lightweight compression methods such as DML,MEMORYCOMPRESS FOR DML for these partitions.

3. Access to data in in memory area

3.1 single table access

After the data is loaded into the in memory area, it can be accessed through the sql statement. One of the great characteristics of analytical query is that it only cares about some specific columns in the table rather than all of them, and the values of these columns often have a large number of duplicate values, and the columns as conditions are often common data types (such as values, strings, dates). Based on these characteristics, the in memory component of Oracle is also designed to improve the performance of these analytical queries.

First, each column in the IMCU contains the corresponding dictionary information and storage index information. During the loading process, the worker process will write a dictionary of the different values owned by each column in the corresponding IMCU, then specify a keyvalue for each row of data in the column, and use this keyvalue instead of the specific value, which can not only save space but also prepare for future queries using CPU's SIMD technology. The storage index (StorageIndex) is actually a common technology in the data warehouse. It can avoid accessing a large amount of data that does not meet the conditions by recording the maximum and minimum values of a column. The maximum and minimum values of the column in the corresponding IMCU, as well as their corresponding offsets, are stored in the header information of each column in the IMCU. Through this method, we can quickly filter out the data that do not meet the conditions by comparing the maximum and minimum values when querying the data, and once the data changes affect the information in the storage index, we can quickly locate the corresponding location. It should be noted, however, that stored indexes do not necessarily apply to all where conditions (predicates).

In addition, because the data has been loaded into memory, most of the operations need to be done through cpu, and the operations related to IWeiO will not occur (unless some of the data in the queried table has not been loaded into the in memory area). How to use CPU resources more efficiently has become an important factor in determining performance, so Oracle uses SIMD technology (Single Instruction processing Multiple Data values) to enable CPU to access multiple data in one instruction, but because the instructions supported by SIMD are limited, this also explains why Oracle creates dictionary information for each column when building IMCU. Figure 3 depicts the basic concepts of SIMD accessing data.

Figure 3

In the above figure, the sales table is loaded into in memory area, and the header information of the PROMO_ID column in IMCU also contains the dictionary information of the column. The value of each row in the column has been converted to keyvalue. When the query condition is PROMO_ID=9999, you can use SIMD technology to make CPU compare multiple rows of data at a time, thus greatly improving the performance of the query.

Finally, we can confirm that the in memory option is used to access the table by looking for "TABLE ACCESS INMEMORY FULL TEST" information in the execution plan. For example:

3.2 Multi-table connection

In addition to optimizations for access tables, the in memory component has also made improvements to table joins, with the main features being: Bloom filters and in memory aggregation.

For the Bloom filter (Bloom Filters), I believe we are no stranger. Its main function is to judge whether a certain data appears in another set, or to compare the common elements between big data sets. Since 10g, Oracle has used Bloom filters when dealing with table joins in some SQL statements. If the tables involved in the table join have specified the in memory attribute and have been loaded into the in memory area, the optimizer will first select a table in the join (usually a smaller table), perform a series of hash functions on the columns as the link condition, and produce a resulting bitmap (bitmap), and then perform the same hash function on the data of another table in batches, and compare it with the previous resulting bitmap The introduction of the Bloom filter makes in memory more efficient in dealing with table joins because it does not produce SIMD throughout the process and the SIMD technology can be used in the comparison process.

CBO automatically determines whether to use a Bloom filter when making an implementation plan, and the user does not need to specify it manually. If you see the following information in the execution plan, the Bloom filter is used.

The implementation plan above states:

1. First you access the table "TEST_SMALL" in in memory area, which is step 5 in the execution plan, and then you build the BF0000 used by the link, which is step 4 in the execution plan.

two。 The table "TEST_BIG" is then accessed in in memory area, which executes step 7 of the plan, and then uses the previously built filter.

3.3 Multi-table connection

In data warehouse applications based on analytical query statements, in addition to simple table joins, there are often multi-table links, and often include some aggregation and grouping operations, such as star queries in data warehouse applications. For this kind of query, oracle proposes a vector grouping (VectorGroupBY) feature to improve the performance of select statements. Vector grouping is a two-stage process:

Phase 1:CBO finds the dimension table (Dimension table) with a small amount of data in the query, joins the columns that meet the conditions with the large fact table (Fact table), and generates a vector group (Vector Group). The columns in the vector group and the fact table that need to be grouped or aggregated are then combined to form a multi-dimensional array and several temporary tables.

Phase 2: apply the vector grouping generated in the previous stage to the fact table, and then add the values of the columns that need to calculate the grouping or aggregate results to the temporary table. Finally, the data of these temporary tables are applied to the multi-dimensional array to calculate the final grouping or aggregation results.

In the whole process, the construction of vector grouping and the comparison of vectors with fact tables are done in memory, and SIMD will also be used, so the performance of this kind of query can be greatly improved. Of course, because this operation is done in memory, the memory resource requirements of the system are relatively large, requiring the process running this kind of query to have enough PGA space. The following execution plan illustrates the application of grouping vectors in queries:

According to the above implementation plan:

-first, the tables "TEST_SMALL_1" and "TEST_SMALL_2" are accessed, and of course they have been loaded into in memory area

In the middle. The grouping vectors are then constructed, they are "KV0000" and "KV0001", and after combining with the tables that need to be grouped, the temporary tables are also created, which are "SYS_TEMP_0FD9D6604_116B7C6" and "SYS_TEMP_0FD9D6604_116B7C6".

-the table "TEST_BIG" is accessed, and then the vector grouping is applied to the table. Then start adding grouping results to the temporary table.

-the result in the multidimensional array is generated, which is "VW_VT_F486F43F". Finally, the final grouping is completed by means of "HASH GROUPBY".

The Integration of 4.in memory and RAC

Continuing the consistent characteristics of Oracle's new features, in memory features can also be compatible with other existing database components, such as RAC, to achieve high availability and scalability of the system. Because RAC is a typical share everything structure, it can open the same database in multiple nodes at the same time, so for the same database object, it can be loaded (populate) to multiple nodes. Of course, the prerequisite is that the database instances of these nodes are set to in memory area (the parameter in memory_size is not equal to 0). Since data can be loaded into multiple nodes, it means that we need to think about two questions:

-question 1: how to distribute data to multiple nodes.

-Question2: is it necessary to keep redundancy in in memory area to ensure high availability.

For how the data is distributed, oracle provides two ways to distribute the data to multiple nodes according to the ROWID range of the data or according to the partition (or sub-partition) of the table. The first method is to divide the data of the table into several parts according to the scope of rowid, and then load each data evenly into different nodes. This distribution is more suitable for tables with uneven data distribution, and the application access to the table is a more uniform scenario on each instance. For example:

ALTER TABLE test INMEMORY DISTRIBUTE BY ROWID RANGE

The second method is suitable for partition tables. Oracle loads each partition into the in memory area of different nodes according to the definition of partition, which is more suitable for tables with uniform data distribution. If the application's access to the table is uniform on each instance, it is especially suitable for hash partitioned tables. For example: ALTER TABLE lineorder INMEMORY DISTRIBUTE BY PARTITION

As to whether the data should be kept redundant in in memory area, if it is a normal RAC database, then the data will not be saved in in memory area, while for Exadata all-in-one machine, the data in in memory area can be set redundant. The reason for choosing to do this is that the private network configurations of non-Exadata all-in-one RAC systems vary widely. If you choose to keep redundancy, once an instance down is dropped, it means that a large amount of data will need to be transferred between the nodes' private networks in order to ensure data redundancy. If the performance of the private network cannot be guaranteed, such data transmission may consume a lot of time and network resources. And lead to serious consequences. While the private network of Exadata integrated machine adopts optical fiber network, and uses advanced RDS protocol, data transmission can reach tens of gigabytes per second, so when dealing with a large number of private network data transmission caused by node failure, it can still ensure the normal operation of the cluster private network.

In addition, as the current high-availability technology at the hardware level has been very mature, most of the accidents in which a database instance or node down is dropped are one-off and can be recovered quickly. Therefore, after finding an instance or node fail, Oracle will not immediately trigger data redistribution, but will wait for a period of time so that the problem node or instance can restart and load its own data. Only when the waiting time expires, will other nodes trigger the process of data redistribution and redistribute the data in the in memory area of the failed node to the normal node. Based on this design pattern, it is recommended that you reserve some space for the in memory area of each node when using the in memory option on the RAC system to ensure that there is still enough space for data reallocation.

Figures 4 and 5 below describe the difference between Exadata and non-Exadata environments where in memory area saves data

Figure 4-Exadata environment

Figure 5-non-Exadata environment

According to the above graph, it is not difficult to find that in a RAC environment, each node does not contain all the data in the table. Therefore, in the RAC environment, automatic parallel queries (AutoDOP) with oracle enabled are needed to access the tables loaded into in memory area using in memory. In addition, it should be noted that in the multi-instance concurrent query, the transmission between instances is not IMCU, but each node will run the same sql statement on the data of this node, and then send its own result set to the instance that initiated the sql statement to form the final result and return it to the user. For example: a 4-node RAC database, table sales has been loaded into in memory area. Run the following query:

Select sum (stock) from sales where store_id in (100200300) and order_date=to_date ('2016-01-01-01,' yyyy-mm-dd')

CBO first calculates the cost of using in memory scan, and if the cost is lowest, CBO chooses to access the sales table in in memory area. Next, oracle accesses the information in the data dictionary, finds out which instances the table is loaded into, starts the corresponding concurrent process (parallelslave) at the corresponding node, and sends the query statement to the concurrent process to run. After the concurrent process of each instance runs the corresponding sql statement, the resulting summary value is sent to the instance that initiated the query, and the final summary value is generated and returned to the customer. In the whole process, it is not the IMCU that passes between instances, but the summary value, so a large amount of private network data communication can be avoided.

The above is the author's superficial introduction to oracle 12c in memory components. I hope it will be helpful to those who use Oracle database for development, and can be used for reference when developing applications on oracle databases using in memoery components.

Transfer to: http://blog.sina.com.cn/s/blog_74a7d3390102wegl.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.