In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
Today, I will talk to you about the analysis of storage knowledge about data consistency, hierarchical storage, hierarchical storage and information life cycle management. Many people may not know much about it. In order to make you understand better, the editor summarized the following. I hope you can get something from this article.
I. Overview
Data consistency refers to whether the logical relationship between the associated data is correct and complete. The question can be understood as whether the application's own perceived state of data is consistent with the state of the data eventually written to disk. For example, a transaction operation actually sends out five write operations. when the system successfully writes the data of the first three writes to the disk, the system suddenly fails, resulting in that the latter two writes are not written to the disk. At this point, the application and the disk have a different understanding of the state of the data. When the system is restored and the database program reads the data again from the disk, it will find that there is a logical problem with the data and the data is not available.
2. Data consistency problems caused by Cache
One of the main causes of data consistency problems is a variety of Cache or caches (including database Cache, file system Cache, storage controller Cache, disk cache, etc.) located on the data Imax O path. As there are differences in the speed at which different system modules process data IO, it is necessary to add Cache to cache IO operations to adapt to the processing speed of different modules. While improving the processing performance of the system, these Cache may also "hold up" IO operations, resulting in some negative effects. If part of the IO is still "stranded" in the IO operation when the system fails, the data actually written to disk will be less than the data actually written by the application, resulting in data inconsistency. When the system recovers, there may be a logic error in the data read directly from the hard disk, causing the application to fail to start. Although some database systems (such as Oracle, DB2) can regenerate data from redo logs and fix logic errors, this process is very time-consuming and may not be successful every time. For some relatively weak databases, such as SQL Server, this problem is even more serious.
There are two ways to resolve such files: close Cache or create a snapshot (Snapshot). Although shutting down Cache will degrade the processing performance of the system, in some applications, this is the only option. For example, in some high-level disaster recovery solutions (RPO is 0), synchronous mirror technology is used to replicate data synchronously between production center and disaster recovery center in real time. Because the data is replicated in real time, Cache must be turned off.
The purpose of the snapshot is to create a status view of the data volume at a specific point in time, through which you can only see the data of the data volume at the time of creation, and the update of the source data volume (there are new data writes) after this point in time will not be reflected in the snapshot view. With this snapshot view, you can back up or copy the data. So how is the data consistency of the snapshot view guaranteed? This involves multiple entities (storage controllers and snapshot agents installed on the host) and a series of actions. The typical operation flow is as follows: when the storage controller wants to create a snapshot of a data volume, it notifies the snapshot agent; after receiving the notification, the snapshot agent notifies the application to pause the IO operation (enters backup mode) and flush the Cache in the database and file system, and then returns a message to the storage controller indicating that the snapshot can be created As soon as the storage controller receives the indication message returned by the snapshot agent, it creates a snapshot view and notifies the snapshot agent that the snapshot has been created; the snapshot agent informs the application that it is running normally. Because the application pauses the IO operation and flush the Cache in the host, the data consistency is ensured.
Creating a snapshot has a certain impact on application performance (taking Oracle database as an example, it takes about 2 minutes to enter Backup mode, 1 minute to exit Backup mode, plus communication time, a snapshot takes about 4 minutes), so snapshots should not be created too frequently.
Third, the problem of data consistency caused by time non-synchronization
Another major cause of data inconsistency is the lack of synchronization in time when operating on multiple associated data volumes (such as backup, replication). For example, the database files, Redo log files and archive log files of an Oracle database are stored on different volumes. If the association between several volumes is not considered during backup or replication, then the volumes generated by backup or replication must have data inconsistencies.
The solution to this kind of problem is to establish a "volume group" (Volume Group) to form a group of associated data volumes, and to establish snapshots of multiple volumes in the group at the same time when creating snapshots to ensure that these snapshots are synchronized in time. After that, the snapshot view of the volume is used for operations such as replication or backup, and the resulting data copy strictly ensures the consistency of the data.
IV. Data consistency in file sharing
Dual computers or clusters are usually used to realize data sharing between homogeneous and heterogeneous servers, workstations and storage devices, which are mainly used in nonlinear editing and other applications that require multiple hosts to read and write a disk partition at the same time.
In the NAS environment, the data can be shared through the network sharing protocol N FS or CIFS. However, if not in the NAS environment, multiple hosts reading and writing to a disk partition at the same time will bring the problem of consistency of writing data, resulting in the problem that the file system is destroyed or other hosts can not read the current written data after the current host writes. Disk partitions can be shared on multiple hosts by using data sharing software. Data sharing software is used to provision the writing of data from multiple hosts to ensure the consistency of the data.
HSM:Hierarchical Storage Management, hierarchical storage management. It originated in 1978 and was first used in IBM mainframe systems. It is a technology that combines offline storage with online storage. It automatically migrates the data commonly used in the disk to secondary mass storage devices such as tape libraries according to the specified strategy. When you need to use this data, the tiered storage system automatically transfers the data from the next storage device to the next disk.
ILM:Information Lifecycle Management, information lifecycle management. It was proposed by StorageTek (now acquired by SUN) in 2001, but the vigorous promotion and practice of ILM was completed by EMC.
Tiered Storage: tiered storage, which refers to the hierarchical storage of data based on performance, business continuity, security, protection, data retention, regulatory compliance and cost considerations, such as master disk, backup disk, archive disk, tape archive, CD archive, etc. HP once put forward the concept of TSC (Tiered Storage Classes).
In fact, the meaning of Tiered Storage is similar to that of HSM, which is to select storage devices with appropriate storage performance and capacity according to the actual needs, so as to reduce the total storage cost. The definitions and differences between hierarchical storage and information lifecycle management are discussed in detail below.
I. hierarchical storage
Hierarchical storage is to store data on storage devices with different performance according to different data importance, access frequency and other indicators, and adopt different storage methods. On the one hand, it can greatly reduce the space occupied by non-important data on the first-level local disk, and speed up the storage performance of the whole system. Several storage devices with different performance and different forms of storage are involved here.
At present, the storage devices commonly used for data storage are mainly disks (including disk arrays), magnetic tapes (including tape drives and tape libraries) and optical discs (including all CD-R, CD-RW, DVD-R, DVD-RW and other optical disc towers and CD library devices). In terms of performance, the disk is of course the best, followed by the CD, and the worst is the tape. In terms of price, the cost per unit capacity of disk is the most expensive, followed by CD, and tape is the lowest. This provides conditions for our different applications to pursue the best performance-to-price ratio, because these different storage media can be applied to different storage methods. These different forms of storage include online storage, near-line storage and offline storage.
1. Online storage
Online storage (OnStore), also known as work-level storage, storage devices and stored data always keep an "online" state, can be read at will, and can meet the speed requirements of the computing platform for data access. For example, the disks commonly used in our PCs are basically stored in this form. Generally, online storage devices are disk devices such as disks and disk arrays, which are relatively expensive, but have the best performance.
2. Offline storage
Offline storage (NearStore) is mainly used to back up the data stored online to prevent possible data disasters, so it is also called backup-level storage. The typical product of offline mass storage is tape or tape library, which is relatively cheap. Data on offline storage media is read and written sequentially. When you need to read the data, you need to roll the tape to the end and then position it. When you need to modify the written data, all the data needs to be rewritten. Therefore, the access to offline mass storage is slow and inefficient.
3. Near-line storage
Near-line storage (OffStore) refers to the storage of data that is not often used, or the amount of data access is not large, on a low-performance storage device. The requirements for these devices are fast addressing and high transmission rate. Therefore, the performance requirements of near-line storage are relatively low, but because the uncommonly used data accounts for the majority of the total data, which means that the capacity of near-line storage devices should be guaranteed first.
II. Information Lifecycle Management
Information is not created equal, and different information has different values, such as key data and logs related to business production. The value of the same information is also different at different stages.
Information naturally enters a cycle from the moment it is produced, and a life cycle is finally completed through creation, protection, access, migration, archiving and destruction, and this process must be well managed, otherwise, either too many resources are wasted, or lack of resources reduces work efficiency.
The goal of ILM is to maximize the value of information throughout its life cycle and to maximize the value of information at every point of its life cycle with the lowest TCO. ILM is a strategy to align the IT infrastructure with business needs based on the changing value of information.
The implementation of ILM strategy can be divided into three stages:
Phase I-establish infrastructure classification or service levels and strive to store information in the appropriate storage layer. This phase allows you to take advantage of the value of a hierarchical infrastructure, which, although manual, lays the foundation for any policy-based information management.
Phase II-complete detailed application and data classification, as well as links to business policies. You can use tools to automate policies for one or more applications to achieve better management and optimal allocation of storage resources. Applications that consume a lot of IT resources, or applications that can quickly implement ROI using information lifecycle management, are the ideal goals for this phase.
Phase III-adds automation capabilities to established policies, extends the scope of information lifecycle management to a broader set of enterprise applications, and further optimizes the infrastructure. This phase allows you to leverage as many common components and methods as possible, further reducing operational and infrastructure costs.
III. The relationship between hierarchical storage and information life cycle management
Hierarchical storage is just a way to store data. It is an important part of implementing ILM, but not all of it. Confusing it with ILM is like mixing backup or archiving with ILM. Tiered storage is a valuable first step in the implementation of ILM. But that's it, it has never solved many of the major issues that have become increasingly critical because a large number of data are stored in the data center, such as how to recover data quickly and how to provide services to manage storage.
Fourth, the relationship between hierarchical storage and hierarchical storage (I don't quite understand, hope the expert's advice)
Michael Peterson wrote in the SNIA data Management Forum in January 2006 that ILM and Tiered Storage mentioned that there are three mechanisms for storing data in tiered storage:
1. Static applications specify the storage of information to a certain layer.
2. Move data in batches in stages (such as archiving)
3. Dynamic automatic data migration (such as hierarchical storage management or some services based on ILM policies)
After reading the above, do you have any further understanding of the storage knowledge analysis of data consistency, hierarchical storage, hierarchical storage and information lifecycle management? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.