Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Design and implementation of Multi-Model multimode database engine

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Nowadays, with the development of business "Internet" and "intelligence" and the development of architecture "micro-service" and "cloud", application systems have put forward new standards and requirements for data storage and management. the diversity of data has become a major challenge to the database platform, and the database field has also given birth to a new mainstream direction.

Database multimode Multi-Model means that the same database supports multiple storage engines, which can meet the unified management needs of structured, semi-structured and unstructured data.

1. The demand of database cloud gives birth to Multi-Model multi-mode

More and more enterprises use cloud database interfacing applications with a variety of requirements. The traditional approach is to provide more than a dozen different database products in dbPaaS to meet various needs. With the increase of the system, the cost of overall maintenance and data consistency management is very high, which will affect the use of the entire system.

"Multi-mode" schematic diagram of cloud database

In order to achieve unified business data management and data fusion, the new database needs to have the ability of multi-mode (Multi-Model) data management and storage. Generally speaking, structured data refers to the data storage structure of form type, and typical applications include traditional businesses such as bank core transactions; while semi-structured data is widely used in scenarios such as user portraits, Internet of things device log collection, application clickstream analysis, and unstructured data corresponds to a large amount of picture, video, and document processing business, which is growing rapidly with the development of financial technology.

The ability of multi-mode data management enables the database to carry out unified data storage and management across departments and businesses, achieve multi-service data fusion, and support diversified application services. In architecture, multi-mode Multi-model is also aimed at the needs of cloud database, which makes the database use a set of data management system to support multiple data types, so it supports a variety of business models and greatly reduces the cost of use and operation and maintenance.

2.Multi-Model storage engine architecture

Database is the core of many existing business systems. With the rapid development of data generation and acquisition technology, the amount of data is increasing explosively, and the structure of data is becoming more and more flexible and diverse. In the face of the real arrival of big data and artificial intelligence, the traditional database management system based on relational theory has encountered great challenges in cost, performance, scalability, fault tolerance and so on.

In the face of many types of structured data, semi-structured data and unstructured data, modern applications put forward different storage requirements for different data, so the database also needs to adapt to the needs of multi-type data management.

Two popular solutions are mixed persistence (Polyglot Persistence) and multimode database (Multi-Model Database).

1) mixed persistence Polyglot Persistence

The idea of mixed persistence means that users choose to use appropriate databases according to the different needs of their work, so that in a complete system, many different databases may be running at the same time.

Fig. 1 schematic diagram of Polyglot Persistence

A significant advantage of hybrid persistence is the performance improvement of a single process, but the disadvantages are equally obvious: it poses challenges in deployment, use, and maintenance at the cost of increased complexity and learning costs.

2) Multimode Multi-Model

Multi-model multimode database is another solution. There are multiple data engines in the same database to store and use various types of data centrally. Many different types of applications are connected to a database at the same time and managed in the same distributed database, which greatly simplifies the cost of application development and later maintenance.

Figure 2: schematic diagram of the architecture of the multimode database engine

The picture is a schematic diagram of a multimode Multi-Model database. We can see that there are multiple data engines such as relational data, JSON semi-structured data, object data and full-text search engine in the same storage engine. This architecture greatly reduces the difficulty of development and operation and maintenance. The application connects to the database uniformly, and the data is divided, isolated and managed inside the database. For the application, you only need to connect to the database, and there is no need to build the corresponding data background for each application.

3. Storage data structure

According to the demand of multi-mode database, there will be new innovation in the storage data structure of distributed database. The following is the design and implementation of the data storage structure and access of SequoiaDB in Multi-model, which can be used as a good reference for Multimodel database.

3.1 structured and semi-structured data storage

Structured data is characterized by a fixed structure, and the attributes of each row are the same, such as the data in traditional relational database tables. Semi-structured data is a self-describing structure, which contains related tags to separate semantic elements and layer records and fields, such as XML,JSON.

Storage structure

How to manage both structured and semi-structured data in the data engine? SequoiaDB uses the JSON data model and uses BSON format within the database to store structured and unstructured data in collections as documents.

BSON (Binary JSON) is a binary encoded data format for JSON, and like JSON, BSON supports embedded documents and arrays. BSON is stored as a single entity by several key-value pairs, which is called a document. BSON includes data types in JSON and extends some data types that are not available in JSON, such as Date,BinData, and so on. A simple example of the BSON structure is shown in the following figure.

Figure 3: example of BSON structure

BSON has the following features: lightweight (Lightweight), Traversable (ergodic), and Efficient (efficient). Because the BSON structure contains enough self-describing information, it is a form of storage for schema-less.

SequoiaDB takes BSON as the storage structure of records. Because of its good flexibility, there is no need to define the structure of the collection in advance, and the field information contained in each record can be the same or different, and can be modified at any time, so that structured and semi-structured data can be stored and accessed in a consistent way.

The data management model in SequoiaDB is shown in figure 4.

Figure 4: SequoiaDB data management model architecture diagram

Data is ultimately persisted in disk files, and the three concepts related to it are as follows:

file (File): a physical file on disk for persistent storage of collection data, indexes, and LOB data.

pages (Page): pages are a basic structure used to organize data in database files. Pages are used in SequoiaDB to manage and allocate space in the file.

data block (Extent): consists of several pages for storing records.

In this model, the three core logical concepts related to structured / semi-structured data storage include:

collection space (Collection Space): objects used to store collections that physically correspond to a set of files on disk.

collection (Collection): the logical object that holds the document.

document (Document): records stored in a collection, stored in BSON structure.

A collection will contain several extent, all of which are concatenated using linked lists. When you insert a document into the collection, you need to allocate space from the extent. If there is not enough space in the current extent, assign a new extent (extend the file if necessary), hang it on the collection's extent linked list, and then insert the document into it. The records in each extent are also organized in the form of linked lists, so that all records in the block can be read sequentially when the table scan is in progress.

Data access

1) SQL

At present, a large number of database-based applications use SQL to access the database, so SQL support is an indispensable capability of the database. SequoiaDB supports the standard SQL interface and is fully compatible with PostgreSQL and MySQL syntax and protocols. Existing applications can smoothly switch the storage system to SequoiaDB to achieve a great improvement in scalability, performance and reliability brought by the distributed storage system.

2) API

SequoiaDB provides rich API interfaces for managing the whole cluster and operating data in structured data, and provides drivers for various mainstream compilation languages.

Data compression

For JSON/BSON data structure, because of its nested structure, it not only has a flexible storage structure, but also causes data expansion. The expansion of JSON data storage is also an important reason for the performance bottleneck of early JSON databases such as MongoDB.

When SequoiaDB uses JSON/BSON as the data storage structure, in order to avoid excessive expansion, the data compression mechanism is added to the data engine. At present, the SequoiaDB engine provides two types of compression: row compression and table compression. Row compression uses the Snappy algorithm, which is a fast compression mechanism that does not require a dictionary. Table compression uses LZW algorithm, which is a dictionary-based compression mechanism.

The data compression mechanism, on the one hand, saves space and cost from storage, on the other hand, it improves the efficiency of the unit Icano. In the query scenario where the IO throughput is very high, the deep compression mechanism based on data dictionary can greatly reduce the IO overhead and effectively improve the query efficiency.

3.2 unstructured data storage

Storage structure

Unstructured data, that is, data without fixed structure, such as documents, pictures, audio / video, etc., this type of data accounts for an increasing proportion in many businesses. In SequoiaDB, large objects (LOB,Large Object) are used to manage this type of data.

Large objects are attached to ordinary collections. When a user uploads a large object, the system assigns it a unique OID value, which can be used to specify subsequent operations on the large object.

Large objects are fragmented when they are stored, and the fragments are scattered in the corresponding partition groups using the hash algorithm, and their hash space is the same as the hash space of the collection. The shard size is the LOB page size, specified when creating the collection space, and defaults to 512KB.

In order to store and manage LOB data effectively, SequoiaDB abstracts LOB data into metadata and data itself, and uses two kinds of files to store these data: LOBM file is used to store LOB shard metadata, and LOBD file is used to store real LOB data shard. Their logical structure is shown in the following figure.

Figure 5: logical structure of LOB file

The LOBM file mainly includes:

header: contains some metadata information for this file.

space management segment (SME): used to mark page usage.

bucket management segment (BME): pages occupied by fragments with the same hash value are hung on a bucket in the form of a two-way linked list.

page: one-to-one correspondence with the page in LOBD, recording the collection information, OID and sequence values to which the page belongs.

LOBD files mainly include:

header: contains some metadata information for this file.

real data page: used to store LOB fragments. LOB also has some of its own metadata, which is stored in shards with a sequence of 0, including the size of the LOB data, creation time, version number, and so on.

Data access

1) write to LOB

When LOB data needs to be written, the LOB data is shredded on the orchestration node, and each shard is assigned a sequence value, which represents the order of these shards in the original LOB data. Therefore, the OID of LOB and the sequence value of the shard uniquely identify the shard.

When storing a LOB fragment, use its OID + sequence to calculate the hash value. First, use the partition hash function of the collection to calculate which partition group the shard is to be stored on, then use the hash function of LOB shard to calculate which bucket it is attached to, and then allocate data pages in the LOBD and LOBM files to complete data writing, and the pages in LOBM are hung on the corresponding bucket.

2) read LOB

When you get LOB data, you need to specify its OID value. According to the OID value, the engine obtains the shard with a sequence value of 0, reads out the metadata information of the LOB, then calculates the shard, determines all the shard information, and sends a request to all partition groups that contain the shard.

After receiving the fragment data returned by all levels, the coordinating node merges and restores the LOB data in the order of sequence to obtain the complete LOB data.

3) Standard Posix file system interface

In addition to LOB's API, the SequoiaFS file system is currently provided, which is based on a set of file systems implemented by FUSE under the Linux system and supports general file manipulation API. SequoiaFS uses the collection of SequoiaDB to store the attribute information of files and directories, and the LOB object stores the data content of files, thus realizing a distributed network file system similar to NFS. Users can mount a collection of remote SequoiaDB to the local node by mapping, so that files and directories can be manipulated through the common file system API under the target directory of the mounted node.

4. Summary

According to the report of Gartner, Multi-Model multimode is one of the main technical directions in the database field in recent years. It represents a new concept of multi-type data management under the cloud architecture, and it is also a new choice to simplify operation and maintenance and save development costs.

SequoiaDB's Multi-Model database products have been applied in many industries, which proves that the market is slowly accepting this new database architecture. We also see that databases such as MySQL,PostgreSQL are also beginning to support many types of formats such as JSON, and are also moving towards Multi-model. It is believed that all products will continue to innovate in the future, and there will be more Multi-model database products.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report