In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
Overview of Database Segmentation
OLTP and OLAP
In the Internet era, the storage and access of massive data has become a bottleneck in system design and use. According to the usage scenario, massive data processing is mainly divided into two types: online transaction processing (OLTP) and online Analytical processing (OLAP).
Online transaction processing (OLTP), also known as transaction-oriented processing system, is characterized by that the original data can be immediately transmitted to the computing center for processing, and the processing results can be given in a very short time.
Online Analytical processing (OLAP) refers to analyzing, querying and reporting data in a multi-dimensional way, which can be used with data mining tools and statistical analysis tools to enhance the function of decision analysis.
The main differences between the two can be illustrated in the following table:
OLTPOLAP
System function
Daily transaction processing
Statistics, analysis, report
DB design
Real-time transaction-oriented applications
Application oriented to statistical analysis
Data processing.
Current, up-to-date, two-dimensional discrete
Historical, aggregated, multi-dimensional integrated, unified
Real-time performance
High requirements for real-time reading and writing
Low requirements for real-time reading and writing
Business
Strong consistency
Weak transaction
Analysis requirement
Low, simple
High and complex
Relational database and NoSQL database
There are a variety of technical implementation schemes for the above two types of systems, and the storage part of the database is mainly divided into two categories: relational database and NoSQL database.
Relational database is a database based on relational model, which deals with the data in the database with the help of mathematical concepts and methods such as set algebra. The mainstream oracle, DB2, MS SQL Server and mysql all belong to this kind of traditional database.
NoSQL database, the full name is Not Only SQL, means to use a relational database when it is suitable for a relational database, and it is not necessary to use a relational database when it is not applicable, so you can consider using a more suitable data storage. It is mainly divided into temporary key-value storage (memcached, Redis), permanent key-value storage (ROMA, Redis), document-oriented database (MongoDB, CouchDB), and column-oriented database (Cassandra, HBase). Each NoSQL has its own unique usage scenarios and advantages.
Traditional relational databases such as Oracle,mysql are very mature and have been widely used in commercial use, so why use NoSQL database? Mainly because with the development of the Internet, the amount of data is getting larger and larger, and the performance requirements are higher and higher, the traditional database has congenital defects, that is, the performance bottleneck of single machine (single database), and it is difficult to expand. This not only has the bottleneck of single machine and single library, but also difficult to expand, which naturally can not meet the growing mass data storage and its performance requirements, so there are a variety of different NoSQL products. The fundamental advantage of NoSQL is that in the era of cloud computing, it is simple, easy to large-scale distributed expansion, and very high read and write performance.
The following is an analysis of the characteristics, advantages and disadvantages of the two:
Relational database
1) the characteristics of relational database are:
-the data relational model is based on relational model, structured storage, and integrity constraints.
-based on two-dimensional tables and their relationships, data operations such as join, union, intersection, difference, division and so on are required.
-use structured query language (SQL) to read and write data.
Operations require data consistency, transactions and even strong consistency.
2) advantages:
-maintain data consistency (transaction processing)
-complex queries such as join can be performed.
-generalization and mature technology.
3) disadvantages:
-data reading and writing must be parsed by sql, and the performance of reading and writing for large amounts of data and high concurrency is insufficient.
-Lock is required when reading and writing data or modifying data structure, which affects concurrent operations.
-unable to adapt to unstructured storage.
It is difficult to expand.
-expensive and complicated.
NoSQL database
1) the characteristics of NoSQL database are:
-unstructured storage.
-based on multidimensional relational model.
-has a unique usage scenario.
2) advantages:
-High concurrency, big data has strong reading and writing ability.
-basically supports distribution, easy to expand, and scalable.
-simple, weakly structured storage.
3) disadvantages:
-join and other complex operation ability is weak.
-transaction support is weak.
-poor versatility.
-complex business scenarios without complete constraints are poorly supported.
Although in the era of cloud computing, the traditional database has congenital disadvantages, but the NoSQL database can not be replaced, and NoSQL can only be used as a supplement to the traditional data, so avoiding the shortcomings of the traditional database is a problem that must be solved in the big data era. If the traditional data is easy to expand and split, you can avoid the performance defects of the stand-alone (single database), but because the current open source or commercial traditional databases basically do not support large-scale automatic expansion, so you need to use a third party to do the processing, that is the data segmentation that this book is going to talk about. Let's analyze how to split the data.
What is data segmentation?
To put it simply, it means that through some specific conditions, the data we store in the same database will be distributed to multiple databases (hosts), so as to achieve the effect of dispersing the load of a single device.
The Sharding of data can be divided into two segmentation modes according to the type of segmentation rules. One is to split the data into different databases (hosts) according to different tables (or Schema), which can be called vertical (vertical) segmentation of data; the other is to split the data in the same table to multiple databases (hosts) according to certain conditions according to the logical relationship of the data in the table, which is called horizontal (horizontal) segmentation of data.
The most important feature of vertical segmentation is that the rules are simple and the implementation is more convenient, which is especially suitable for systems with very low coupling between businesses, little interaction and clear business logic. In this system, it is easy to split the tables used by different business modules into different databases. Splitting according to different tables will have less impact on the application, and the splitting rules will be relatively simple and clear.
Horizontal segmentation is slightly more complex than vertical segmentation. Because you want to split different data in the same table into different databases, the split rule itself is more complex for the application than splitting based on the table name, and later data maintenance will be more complex.
Vertical slicing
A database is composed of many tables, and each table corresponds to a different business. Vertical sharding means that tables are classified and distributed to different databases according to business, so that the data or pressure is shared to different databases, as shown below:
The system is divided into several modules, user, order transaction, and payment.
A well-designed application system, its overall function must be composed of many functional modules, and the data needed by each functional module corresponds to one or more tables in the database. In the architecture design, the more unified the interaction points between the functional modules are, the lower the coupling degree of the system is, and the better the maintainability and expansibility of the system modules are. In such a system, it is easier to realize the vertical segmentation of data.
However, it is often difficult for some tables in the system to be completely independent, and there is the case of expanding the library join. For this kind of tables, we need to balance whether the database concession business, share a data source, or be divided into multiple libraries, and the business is called through the interface. In the initial stage of the system, when the amount of data is relatively small, or the resources are limited, the shared data source will be chosen, but when the data develops to a certain scale and the load is very heavy, it is necessary to do segmentation.
Generally speaking, it is difficult to segment a business where there is a complex join, and it is often easy to segment a business independently. How to split and to what extent is a difficult problem to test the technical architecture.
Let's analyze the advantages and disadvantages of vertical segmentation:
Advantages:
After the split, the business is clear and the split rules are clear.
It is easy to integrate or expand between systems.
Data maintenance is simple.
Disadvantages:
Some business tables can not be join, so they can only be solved by interface, which increases the complexity of the system.
Due to the different limitations of each business, there is a performance bottleneck of a single database, so it is not easy to expand the data and improve the performance.
Transaction processing is complex.
Because vertical splitting divides tables into different databases according to business classification, some business tables are too large, and there are bottlenecks in reading, writing and storage of a single database, so horizontal splitting is needed to solve the problem.
Horizontal syncopation
As opposed to vertical split, horizontal split does not classify tables, but is dispersed into multiple libraries according to certain rules of a field, each table containing part of the data. To put it simply, we can understand the horizontal segmentation of data as the segmentation of data rows, that is, some rows in the table are split into a database, while other rows are split into other databases, as shown in the figure:
To split the data, you need to define sharding rules. Relational database is a two-dimensional model of rows and rows, and the first principle of splitting is to find the split dimension. For example, from a member's point of view, if a merchant order trading system queries a member's order in a certain day or month, then it needs to be split according to the member's union date, and the different data is grouped according to the member's ID, so that all data queries join will be solved in a single database; if from the merchant's point of view, to query all orders of a merchant on a certain day, you need to split according to the merchant's ID However, if the system wants to split by member, but also by business data, there will be some difficulties. How to find the appropriate slicing rules needs to be comprehensively considered.
Several typical slicing rules include:
According to the user ID model, the data is distributed to different databases, and the data of users with the same data are scattered into one database.
Spread the data from different months or even days into different databases according to the date.
Touch according to a specific field, or spread to different libraries according to a specific range of segments.
As shown in the figure, the sharding principle is to find suitable sharding rules and distribute them to different libraries according to the business. Here is an example of user ID:
Now that the data has been split, there are advantages and disadvantages.
Advantages:
The splitting rules are abstract, and the join operation can basically be done in the database.
There is no performance bottleneck of single library big data and high concurrency.
The modification of the application side is less.
The stability and load capacity of the system are improved.
Disadvantages:
Split rules are difficult to abstract.
The consistency of fragmented transactions is difficult to solve.
It is difficult to expand the data many times and has a great amount of maintenance.
Cross-library join has poor performance.
Talking about the differences and advantages and disadvantages between vertical segmentation and horizontal segmentation, we will find that each segmentation method has its shortcomings, but the common characteristics and disadvantages are:
The problem of introducing distributed transactions.
Cross-node Join problem.
Merge sorting paging problems across nodes.
Multi-data source management problems.
At present, there are two main ideas for data source management:
a. Client mode, configure and manage one (or more) data sources in each application module, access each database directly, and complete the data integration within the module.
b. All data sources are managed uniformly through the intermediate agent layer, and the back-end database cluster is transparent to the front-end applications.
It is possible that more than 90% of people tend to choose the second when faced with both of the above solutions, especially when the system is getting bigger and more complex. Indeed, this is a very correct choice, although the cost may be relatively higher in the short term, but it is very helpful for the scalability of the entire system.
Mycat solves the defects of traditional database by data segmentation, and has the advantage that NoSQL is easy to expand. Through the intermediate agent layer, it avoids the problem of dealing with multiple data sources, and is completely transparent to the application. At the same time, it also makes a solution to the problems existing after data segmentation. The following chapter analyzes the origin of mycat and how to split the data.
Due to the difficulty of data Join after data sharding, I would like to share the experience of data sharding:
The first principle: try not to divide as much as possible.
The second principle: if you want to split, you must choose the appropriate segmentation rules, and plan in advance.
The third principle: data segmentation tries to reduce the possibility of cross-library Join through data redundancy or table grouping (Table Group).
The fourth principle: as database middleware is difficult to grasp the advantages and disadvantages of data Join implementation, and it is extremely difficult to achieve high performance, business reading as little as possible using multi-table Join.
What is mycat,maycat, where does it come from, and how to solve these problems? let's analyze it in the next chapter.
For more information, please follow the official account of Wechat: it_.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.