Data types and distributed Storage 04/22 Update SLTechnology News&Howtos

Data types and distributed Storage

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Data types and distributed Storage

Overview:

Data type

1. Structured data

★ definition:

Structured data, that is, row data, is stored in a database, and the implemented data can be logically expressed by a two-dimensional table structure.

It can be represented by data or a unified structure, which we call structured data, such as numbers and symbols. The traditional relational data model and row data are stored in the database and can be represented by two-dimensional table structure.

Structured data is simply a database. It is easier to understand when combined into typical scenarios, such as enterprise ERP, financial system, medical HIS database, education card, government administrative examination and approval, other core databases and so on. What storage solutions are required for these applications? It basically includes high-speed storage application requirements, data backup requirements, data sharing requirements and data disaster recovery requirements.

The data in all relational databases are structured data.

Structured data: SQL, TPS (transaction processing system) is poor, MySQL (master-slave replication, sub-database and sub-table to improve performance)

two。 Unstructured data

Definition and function of ★:

Unstructured data, including all formats of office documents, text, pictures, XML, HTML, various reports, images and audio / video information, etc.

Compared with the structured data (that is, row data stored in the database, which can be logically expressed by two-dimensional table structure), the data that is not convenient to be represented by the two-dimensional logical table of the database is called unstructured data, including all formats of office documents, text, pictures, XML, HTML, various reports, images and audio / video information, and so on.

An unstructured database is a database whose field length is variable, and the record of each field can be composed of repeatable or non-repeatable subfields. it can be used to deal with not only structured data (such as numbers, symbols, etc.) but also more suitable for dealing with unstructured data (full-text, image, sound, film and television, hypermedia, etc.).

Unstructured WEB database is mainly produced for unstructured data. Compared with the popular relational database in the past, the biggest difference is that it breaks through the limitations of relational database structure definition and data fixed length, supports repeated fields, sub-fields and variable length fields, and realizes the processing of variable length data and repeated fields and the variable length storage management of data items. It has incomparable advantages over traditional relational databases in dealing with continuous information (including full-text information) and unstructured information (including all kinds of multimedia information).

Unstructured data: kmurv

3. Semi-structured data

Definition and function of ★:

The so-called semi-structured data is the data between completely structured data (such as relational database, object-oriented database) and completely unstructured data (such as sound, image files, etc.). HTML documents belong to semi-structured data. It is generally self-describing, and there is no obvious distinction between the structure and content of the data.

Semi-structured data: json, xml (Document Store document storage, mongodb, Elasticsearch)

4. Data model

Structured data: two-dimensional tables (relational)

Semi-structured data: trees, graphs

Unstructured data: none

The data models of RMDBS include: mesh data model, hierarchical data model, relational (oracle,mysql, etc.)

Nosql non-relational database (memcached,redis,mangodb)

Structured data: first structure, then data

Semi-structured data: first data, then structure

CAP theory (consistency, availability, partition fault tolerance)

CAP theory is widely known in the Internet community, and engineers with slightly broader knowledge will use it as a criterion to measure system design. Everyone understands CAP very clearly: any distributed system can't have both in terms of availability, consistency and partition fault tolerance. Therefore, the design of any distributed system is just a choice among the three.

Definition and function of ★:

C (consistency): data on all nodes is kept synchronized at all times

A (availability): each request can receive a response, regardless of success or failure

P (partition tolerance): the system should be able to provide continuous service, even if messages are lost within the system (partition)

High availability and consistent data is the goal of many system designs, but partitioning is inevitable.

☉ CA without P:

If P (partitioning is not allowed) is not required, C (strong consistency) and A (availability) can be guaranteed. But in fact, partitioning is not a question of whether you want it or not, but it will always exist, so CA's system is more likely to allow subsystems to maintain CA after partitioning.

☉ CP without A:

If A (available) is not required, it means that every request needs to be strongly consistent between Server, while P (partition) will cause the synchronization time to be infinitely longer, so that CP can also be guaranteed. Many traditional database distributed transactions belong to this mode.

☉ AP wihtout C:

To be highly available and allow partitions, you need to discard consistency. Once the partition occurs, the nodes may lose contact with each other. In order to be highly available, each node can only use local data to provide services, which will lead to global data inconsistency. Many NoSQL now fall into this category.

Distributed Storage Technology and its Application

1. The key links and challenges of massive data

Key links under ★ big data:

Generation of massive data in ☉

Access log data

Business data

User upload

Application of massive data in ☉

Accurate advertising

Personalized customization

Future prediction

Management of massive data in ☉

File

Picture

data

The challenge brought by big data, ★

data acquisition

Data storage

Data search

Data sharing

Data transmission

Data analysis

Data visualization

two。 How does big data store?

According to did you know (http://didyouknow.org/)), the amount of information currently accessible on the Internet is close to 1 billion = 1 trillion (1024). There is no doubt that large websites also store huge amounts of data. How to store these huge amounts of data effectively is a problem that architects of every large website must solve. Distributed storage technology is developed to solve this problem.

★ traditional storage issues:

Scale-up is limited by array space

Scale-out is limited by switching equipment

Nodes are limited by the file system

For example, NFS image storage will encounter problems such as bandwidth, storage space, request concurrency and so on.

The concept of ★ distributed storage:

Different from the current common centralized storage technology, distributed storage technology does not store data on one or more specific nodes, but uses the disk space on each machine in the enterprise through the network. and these scattered storage resources form a virtual storage device, and the data is scattered in every corner of the enterprise.

Features of ★ distributed storage system:

☉ Extensible (Scalable)

Distributed storage system can be expanded to hundreds or even thousands of clusters, and with the growth of cluster size, the overall performance of the system increases linearly.

☉ Reliability (Reliable)

High performance. No matter for the whole cluster or a single server, the distributed storage system is required to have high performance.

☉ low cost (Cheap)

The automatic fault tolerance and automatic load balancing mechanism of distributed storage system make it possible to build on ordinary PC. In addition, the linear expansion ability also makes it very convenient to add and reduce machines, and automatic operation and maintenance can be realized.

☉ is easy to use

Distributed storage systems need to be able to provide easy-to-use external interfaces, in addition, it also requires complete monitoring, operation and maintenance tools, and can be easily integrated with other systems, such as importing data from Hadoop cloud computing systems.

The mechanism classification of ★ distributed storage:

☉ generic distributed storage:

Distributed storage (does not support mounting, does not support complex file system mechanisms, does not support permission model), mogilefs, fastdfs,

☉ dedicated distributed storage:

Distributed file system (support for mounting), moosefs,...

Challenges of ★ distributed Storage

Communication between nodes

Data storage

Data space balance

Fault tolerance

File system support

The core of ★ distributed storage

Metadata storage (efficient)

Data storage (redundant)

There are generally two types of ★ storage:

☉ centralized:

NAS:Network Attached Storage; file system level, such as NFS, FTP, SAMBA...

SAN:Storage Aera Network; block level, such as IP SAN, FC SAN...

☉ distributed:

Central node storage: each cluster has nodes dedicated to storing metadata, while other nodes store part of the data

Centerless node storage: each node in each cluster stores metadata and some data

★ distributed storage and distributed file systems:

File system: with file system interface

Storage: no file system interface, accessed through API

Common implementation of ★ distributed Storage File system

Google Filesystem

GFS is good at dealing with single large files.

GFS+MapReduce (programming Model-running Framework-API) can cut programs to run on multiple nodes and realize distributed processing.

Hadoop Distributed Filesystem

Developed according to GFS ideas, good at dealing with single large files

ClusterFS is good at dealing with single large file Taobao Filesystem Taobao open source file system, good at dealing with a large number of small files, suitable for large-scale scenarios. MogileFS is a high-performance distributed storage, good at dealing with a large number of small files Ceph is a Linux PB-level distributed file system, testing MooseFS distributed file system, compatible with POSIX (FUSE), can be directly mounted and used, when there are many nodes, concurrency environment, poor scalability, general performance. Lustre: a parallel distributed File system

3. Specific technology and application

-according to the degree of structure, massive data can be roughly divided into structured data, unstructured data and semi-structured data.

Storage and Application of ★ structured data

☉ definition:

The so-called structured data is a user-defined data type, which contains a series of attributes, each of which has a data type, which is stored in a relational database and can be expressed by a two-dimensional table structure.

☉ storage:

Most systems have a large amount of structured data, which are generally stored in relational databases such as oracla or MySQL. When the scale of the system is too large to be supported by a single-node database, there are generally two methods: vertical expansion and horizontal expansion.

◆ scale vertically:

Vertical expansion is easier to understand. To put it simply, it is to split the database according to the function, and store the data of different functions in different databases, so that a large database is divided into multiple small databases, thus achieving the expansion of the database. The overall function of a well-designed application system must be composed of many loosely coupled functional modules, and the data required by each functional module corresponds to one or more tables in the database. The less the interaction between the functional modules, the more unified, and the lower the degree of coupling of the system, the easier it is to achieve vertical segmentation.

◆ scale horizontally:

To put it simply, the horizontal segmentation of data can be understood as split according to data rows, that is, some rows in the table are split into one database, while other rows are split into other databases. In order to be able to easily determine which database each row of data is split into, segmentation always needs to be done according to certain rules, such as the range of a numeric field, the range of a time type field, or the hash value of a field.

Vertical expansion and horizontal expansion have their own advantages and disadvantages, generally a large system will use a combination of horizontal and vertical expansion.

★ unstructured Storage and its Application

☉ definition:

Compared with structured data, the data that is not convenient to be represented by database two-dimensional logical tables is called unstructured data, including all formats of office documents, text, pictures, XML, HTML, various reports, images and audio / video information, and so on.

☉ storage: distributed storage

Distributed file system is the main technology to realize unstructured data storage.

Google File System (GFS)

Hadoop Distributed Filesystem (HDFS)

TFS:Taobao Filesystem

GlusterFS (decentralized design)

Lustre,HPC

Ceph (kernel-level build)

Mogile Filesystem (distributed storage)

API (php,java,perl,python)

Moose Filesystem (MFS)

FastDFS

★ semi-structured Storage and its Application

☉ definition:

It is the data between completely structured data (such as relational database, object-oriented database) and completely unstructured data (such as sound, image files, etc.). The semi-structured data model has a certain structure. but it is more flexible than the traditional relational and object-oriented models. Semi-structured data models are not based on the strict concept of traditional database schemas at all, and the data in these models are self-describing.

Because semi-structured data has no strict schema definition, it is not suitable to use traditional relational database for storage. The database suitable for storing this kind of data is called "NoSQL" database.

☉ storage: NoSQL database

Known as the next generation database, it is non-relational, distributed, lightweight, supports horizontal scaling and generally does not guarantee compliance with ACID principles. "NoSQL" is actually a misleading alias, and it is more appropriate to call it Non Relational Database (non-relational database). The so-called "non-relational database" means:

Use loosely coupled types, extensible data schemas to logically model data (Map, columns, documents, charts, etc.), rather than using fixed relational schema tuples to build data models.

A cross-node data distribution model that follows the CAP theorem (which can guarantee any two of consistency, availability and partition tolerance) is designed to support horizontal scaling. This means the necessary support for multiple data centers and dynamic provisioning (transparently adding / removing nodes in the production cluster), namely Elasticity.

Have the ability to persist data on disk or memory, or both, and sometimes use hot-swappable custom storage.

Support a variety of 'Non-SQL' interfaces (usually more than one) for data access.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.