In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
Editor to share with you what is the use of NoSQL data modeling technology, I believe most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!
NoSQL data modeling technology
NoSQL databases are often used for many non-functional areas, such as extensibility, performance, and consistency. These characteristics of NoSQL are being widely studied in theory and practice, and the focus of research is those non-functional things related to performance distribution. We all know that CAP theory has been well applied to NoSQL systems (Chen Hao's note: CAP, that is, Consistency, Availability, Partition)
Tolerance), in a distributed system, you can only implement at most two of these three elements at the same time, while NoSQL generally gives up consistency. On the other hand, the data modeling technology of NoSQL has not been well studied by the world because of its lack of basic theory such as relational database. This article compares the NoSQL family in terms of data modeling and discusses several common data modeling techniques.
To start talking about data modeling techniques, we have to take a more or less systematic look at the growth trends of the NoSQL data model, so that we can have some internal connections between them. The following is an evolutionary map of the NoSQL family, and we can see this evolution: the Key-Value era, the BigTable era, the Document era, the full-text search era, and the Graph database era:
NoSQL Data Models
First of all, we need to note that SQL and relational data models have been around for a long time, and this user-oriented nature means:
End users are generally more interested in the aggregate display of data than in separate data, which is mainly done through SQL.
We cannot manually control the concurrency, integrity, consistency, or data type verification of data. This is why SQL needs to do a lot of things in terms of transactions, two-dimensional table structures (schema), and appearance unions.
On the other hand, SQL allows software applications to control database data aggregation and data integrity and validity in many cases. And if we remove data consistency and integrity, it will be of great help to performance and distributed storage. Because of this, we have the evolution of data models:
Key-Value keys are very simple and powerful for storage. Many of the following technologies are basically based on this technology. However, Key-Value has a very fatal problem, that is, if we need to find a range of key. Chen Hao Note: anyone who has studied hash-table data structure should know that hash-table is a non-sequence container, unlike arrays, links, queues and other ordered containers, we can control the order of data storage. Therefore, the ordered key value (Ordered Key-Value) data model is designed to solve this limitation to fundamentally improve the problem of data sets.
The Ordered Key-Value ordered key value model is also very powerful, but it also does not provide some kind of data model for Value. Generally speaking, the model of Value can be parsed and accessed by the application. This is very inconvenient, so the emergence of the BigTable type of database, this data model is actually map,map in map and then set map, layer by layer, that is, layer upon layer of nested key-value (value is also a key-value), the Value of this database is mainly through "column families", columns, and time cut to control the version. (Chen Hao Note: the version control of data due to time cut is mainly to solve the problem of data storage concurrency, that is, the so-called optimistic lock. For more information, see "Application of Multi-version concurrency Control (MVCC) in distributed Systems".)
The Document databases document database improves the BigTable model and provides two meaningful improvements. The first is to allow subjective patterns (scheme) in Value, rather than map with map. The second is the index. Full Text Search Engines full-text search engines can be seen as a variant of document databases, which can provide flexible and variable data patterns (scheme) and automatic indexing. The main difference between them is that document databases are indexed by field names, while full-text search engines are indexed by field values.
Graph data models schema database can be thought of as a branch of this evolution process from Ordered Key-Value database. The schema database allows you to build a data model of the schematic structure. The reason it has a relationship with the document database is that many of its implementations allow value to be either a map or a document.
Summary of NoSQL data model
The rest of this article will introduce you to the technical implementation of data modeling and related patterns. However, before introducing these technologies, let's start with a preface:
The design of NoSQL data model generally starts with the specific data query of business application, rather than the relationship between data:
The relational data model basically analyzes the structure and relationship between the data. Its design philosophy is: "What answers do I have?"
The NoSQL data model basically starts with the way the application accesses the data, for example, I need to support some kind of data query. Its design concept is "What questions do I have?"
NoSQL data model design requires a deeper understanding of data structures and algorithms than relational databases. In this article, I will talk to you about well-known data structures that are not just used by NoSQL, but are very helpful to NoSQL's data model.
Data redundancy and de-normalization are first-class citizens.
Relational database is very inconvenient to deal with hierarchical data and schema data. NoSQL is obviously a very good solution for schema data, and almost all NoSQL databases can strongly solve such problems. This is why this article is devoted to a chapter on the hierarchical data model.
Here is the classification table of NoSQL, which is also the product I used to practice when writing this article:
Key-Value storage: Oracle Coherence, Redis, Kyoto Cabinet
BigTable-like storage: Apache HBase, Apache Cassandra
Document database: MongoDB, CouchDB
Full-text index: Apache Lucene, Apache Solr
Figure database: neo4j, FlockDB
Conceptual technology Conceptual Techniques
This section focuses on the basic principles of the NoSQL data model.
(1) de-normalize Denormalization
De-normalizing Denormalization can be thought of as copying the same data into different documents or tables so that queries can be simplified and optimized, or a particular data model that suits the user. Most of the technologies mentioned in this article are more or less oriented to this technology.
In general, de-normalization requires a tradeoff between the following:
Query data quantity / query IO VS total data quantity. With de-normalization, on the one hand, all the data needed by a query statement can be combined and stored in one place. This means that the same data required by other different queries needs to be placed in different places. Therefore, this produces a lot of redundant data, which leads to the increase of the amount of data.
Processing complexity VS total amount of data. Queries that perform table joins on normal form-compliant data schemas obviously increase the complexity of query processing, especially for distributed systems. The de-normalized data model allows us to store and construct data structures in a convenient way to simplify the complexity of the query.
Applicability: Key-Value Store key value to database, Document Databases document database, BigTable style database.
(2) aggregate Aggregates
All types of NoSQL databases provide flexible Schema (data structures, restrictions on data formats):
Key-Value Stores and Graph Databases are basically not in the form of Value, so Value can be in any format. This allows us to arbitrarily combine the keys of a business entity. For example, we have a business entity with a user account that can be combined by the following key: UserID_name,UserID_email, UserID_messages, and so on. If a user does not have email or message, then there will be no such record accordingly.
The BigTable model supports flexible Schema through column collections, which we call column family. BigTable can also have different versions (by time cut) on the same record.
Document databases document databases are hierarchical "de-Schema" storage, although some allow you to verify that the data you need to save satisfies some kind of Schema.
Flexible Schema allows you to store a set of related business entities in a nested internal data format (Chen Hao Note: data encapsulation format similar to JSON). This will bring us two benefits.
Minimize "one-to-many" relationships-entities can be stored in a nested way so that there are fewer table joins.
Internal technical data storage can be made more similar to business entities, especially hybrid business entities. It may be stored in a document set or in a table.
The following figure illustrates these two benefits. The picture shows the commodity model in e-commerce (Chen Hao Note: I remember talking about the challenges of product classification database design in e-commerce in the article "challenges everywhere")
First of all, all merchandise Product will have an ID,Price and Description.
Then, we can know that different types of goods will have different properties. For example, the author is the attribute of the book, and the length is the attribute of jeans. Some of its attributes may be "one-to-many" or "many-to-many" relationships, such as tracks on a record.
Next, we know that it is not possible for some business entities to use fixed types. For example, the properties of jeans are not available to all brands, and some famous brands have very special attributes.
For relational databases, it is not easy to design such a data model, and the design is absolutely far from elegance. The flexible Schema in our NoSQL allows you to use an aggregate Aggregate (product) to build all different kinds of goods and their different properties:
Entity Aggregation
In the figure above, we can compare the differences between relational databases and NoSQL. But we can see that in data updates, non-standardized data storage will have a great impact on performance and consistency, which is what we need to focus on and have to sacrifice.
Applicability: Key-Value Store key value to database, Document Databases document database, BigTable style database.
(3) Application layer connection Application Side Joins
Table joins are basically not supported by NoSQL. As we said earlier, NoSQL is "problem-oriented" rather than "answer-oriented", and not supporting table joins is the consequence of "problem-oriented". The joins of tables are constructed at design time, not at execution time. Therefore, table joins have a lot of overhead at run time (Chen Hao Note: those who have been involved in SQL table joins all know what Cartesian product is, you can refer to the previous cool shell "graphical database table Joins"), but after using Denormalization and Aggregates technology, we basically do not need to join tables, such as: you use nested data entities. Of course, if you need to join data, you need to do it at the application layer. Here are a few major Use
Case:
Many-to-many data entity relationships-often need to be connected or joined.
Aggregate Aggregates does not apply to situations where data fields are often changed. For this, we need to divide the fields that are often changed into another table, and we need to join the data when querying. For example, we have a Message system that can have a User entity that includes an embedded Message entity. However, if users are constantly attaching message, it is best to split the message into another separate entity, but join the User and Message entities when querying. As shown below:
Applicability: Key-Value Store key value to database, Document Databases document database, BigTable style database, Graph Databases graph database.
General Modeling Technology General Modeling Techniques
In this book, we will discuss various common data modeling techniques in NoSQL.
(4) Atomic polymerization Atomic Aggregates
Many NoSQL databases (not all) are deficient in transaction processing. In some cases, they can achieve its transactionality through distributed locking technology or application layer management MVCC technology. (Chen Hao Note: see "the Application of Multi-version concurrency Control (MVCC) in distributed Systems" in this site. However, generally speaking, only aggregate Aggregates technology can be used to guarantee some ACID principles.
This is why our relational database needs to have a strong transaction processing mechanism-because the data of the relational database is normalized and stored in different places. So, Aggregates aggregation allows us to save a business entity as a document, a line, and a key-value, so that we can update it atomically:
Atomic Aggregates
Of course, the data model of atomic aggregation Atomic Aggregates does not achieve full transaction processing, but if atomicity, locks, or test-and-set instructions are supported, then Atomic Aggregates is applicable.
Applicability: Key-Value Store key value to database, Document Databases document database, BigTable style database.
(5) enumerable key Enumerable Keys
Perhaps the biggest benefit of unordered Key-Value is that business entities can be easily hash to partition on multiple servers. Sorted key can complicate things, but sometimes an application can benefit a lot from sorting key, even if the database itself does not provide this functionality. Let's consider the data model for email messages:
Some NoSQL databases provide atomic counters to allow some consecutive ID to be generated. In this case, we can use userID_messageID as a composite key. If we know the latest message ID, we can know the previous message and possibly the Message before and after it.
Messages can be packaged. For example, daily mail bags. In this way, we can traverse the message for a specified period of time.
Applicability: Key-Value Store key value to the database.
(6) reduced dimension Dimensionality Reduction
Dimensionality Reduction dimensionality reduction is a technique that allows a multidimensional data to be mapped to a Key-Value or other non-multidimensional data model.
Traditional geolocation information systems use something such as "quadtree QuadTree" or "R-Tree" to index geolocation. The contents of these data structures need to be updated in place, and if the amount of data is large, the operation cost will be high. Another way is that we can traverse a two-dimensional data structure and flatten it into a list. A well-known example is Geohash (Geographic Hash). A Geohash scans a two-dimensional space using a zigzag route, and the movement in the traversal can be simply indicated by zeros and ones, and then generate a string of zero and one in the process of moving. The following figure shows this algorithm: (Chen Hao Note: first divide the map into four parts, longitude is the first, latitude is the second, so the longitude on the left is 0, the one on the right is 1, and the latitude is also the same, the top is 1, and the bottom is 0. In this way, latitude and longitude can be combined into four values, which identify four regions, and we can recursively quadruple each region so constantly. You can then get a string of 1 and 0, and then remove it using 0-9 BMZ (remove a
I, l, o) these 32 letters are encoded by base32 to get an 8-length code, which is Geohash's algorithm)
Geohash Index
The most powerful feature of Geohash is that you can know the distance between the two regions using simple bit operations, as shown in the figure (Chen Hao: the two in the proximity box, which looks a lot like the IP address). Geohash transforms a two-dimensional coordinate into an one-dimensional data model, which is the dimensionality reduction technology. For the dimensionality reduction technology of BigTable, see [6.1] at the end of the article. For more information on Geohash and other technologies, see [6. 2] and [6. 3].
Applicability: Key-Value Store key value to database, Document Databases document database, BigTable style database.
(7) Index table Index Table
Index Table indexed tables are a very straightforward technique that allows you to get the benefits of indexing in databases that do not support indexing. BigTable is the most important database of this kind. This requires us to maintain a special table with corresponding access patterns. For example, we have a master table that holds the user account, which can be accessed by UserID. A query needs to find out all the users in a city, so we can add a table. This table uses the city as the primary key, and all the UserID related to the city is its Value, as shown below:
Index Table Example
It can be seen that the need for the city index table is consistent with the user table of the master table, so every update of the master table may need to update the index table, otherwise it is a batch update. Either way, this can damage some performance because of the need for consistency.
Index Table index tables can be thought of as the equivalent of views in a relational database.
Applicability: BigTable database.
(8) key combination index Composite Key Index
Composite key key combination is a very common technique, for which we can get great benefits when our database supports key sorting. The splicing of the Composite key key combination into a second sort field allows you to build a multidimensional index, much like the Dimensionality Reduction dimensionality reduction technique we talked about earlier. For example, we need to access user statistics. If we need to count the distribution of users according to different regions, we can design Key in this format (State:City:UserID). This allows us to traverse users by group through State to City, especially our NoSQL database supports querying by region on key (such as BigTable systems):
SELECT Values WHERE state= "CA:*" SELECT Values WHERE city= "CA:San Francisco*"
Composite Key Index
Applicability: BigTable database.
(9) Bond combination polymerization Aggregation with Composite Keys
Composite keys key combination technology can not only be used for indexing, but also can be used to distinguish unused types of data to support data grouping. Consider, for example, that we have a massive array of logs that record the access sources of users on the Internet. We need to calculate the number of unique visitors from a website, and in a relational database, we may need the following SQL query statement:
SELECT count (distinct (user_id)) FROM clicks GROUP BY site
We can build the following data model in NoSQL:
Counting Unique Users using Composite Keys
In this way, we can sort the data by UserID, and we can easily process the data of the same user (one user doesn't generate too much event) and get rid of duplicate sites (using hash table or something). Another optional technique is that we can create a data entity for each user and then append its site source to that data entity, which, of course, results in a loss of performance.
Applicability: Ordered Key-Value Store sort key values to the database, BigTable style database.
(10) reverse search Inverted Search-directly aggregate Direct Aggregation
This technology is more data processing technology than data modeling technology. Still, this technology can affect the data model. The main idea of this technique is to use an index to find data that meets certain criteria, but aggregating the data requires full-text search. Let's give an example. Using the above example, we have a lot of logs, including Internet users and their access sources. Let's assume that each record has a UserID, as well as the type of user (Men,Women,Bloggers, etc.), the city the user is in, and the site he has visited. What we need to do is to find independent users for each type of user who meet certain criteria (access source, city, etc.).
Obviously, we need to search for users who meet the criteria, and if we use reverse search, this will make it easy for us to do this, such as {Category-> [user IDs]} or {Site-> [user IDs]}. Using such an index, we can take the intersection or union of two or more UserID (this is easy to do, and can be done quickly, if the UserID is sorted). However, it will be a bit troublesome for us to generate reports by type of user, because we may use statements like this
SELECT count (distinct (user_id))... GROUP BY category
But this kind of SQL is inefficient because there is too much category data. To address this problem, we can set up a direct index {UserID-> [Categories]} and then use it to generate the report:
Counting Unique Users using Inverse and Direct Indexes
Finally, we need to understand that random queries for each UserID are very inefficient. We can solve this problem through batch query processing. This means that for some user sets, we can preprocess (different query conditions).
Applicability: Key-Value Store key value to database, Document Databases document database, BigTable style database.
Hierarchical Model Hierarchy Modeling Techniques (11) Tree Polymeric Tree Aggregation
Trees or arbitrary graphs (to be de-normalized) can be typed directly into a record or document.
This is very efficient when the tree structure is taken out at once (for example, we need to show a blog tree comment)
There will be problems with searching and any access to this entity.
For most NoSQL implementations, updating data is uneconomical (compared to independent nodes)
Tree Aggregation
Applicability: Key-Value key value to database, Document Databases document database
(12) adjacency list Adjacency Lists
The Adjacency Lists adjacency list is a graph-each node is a separate record that contains all parent or child nodes. In this way, we can search by a given parent or child node. Of course, we need to query the traversal graph through hop. This technique is inefficient in querying breadth and depth, as well as getting the subtree of a node.
Applicability: Key-Value key value to database, Document Databases document database
(13) Materialized Paths
Materialized Paths can help avoid recursive traversing (for example, tree structures). This technology can also be thought of as a variant of de-normalization. The idea is to add the identification attribute of the parent or child node to each node, so that all descendant and ancestor nodes can be known without traversing:
Materialized Paths for eShop Category Hierarchy
This technique is very helpful for full-text search engines because it allows you to turn a hierarchical structure into a document. In the figure above, we can see that all items or subcategories under Men's Shoes can be processed by a short query-just a given category name.
Materialized Paths can store a collection of ID, or a bunch of ID spelled strings. The latter allows you to search for a specific branch path through a regular expression. The following figure shows this technique (the path of the branch includes the node itself):
Query Materialized Paths using RegExp
Applicability: Key-Value key value to database, Document Databases document data, Search Engines search engine
(14) nested set Nested Sets
Nested sets nesting set is the standard technology of tree structure. It is widely used in relational databases, and it is fully applicable to Key-Value key-value pair databases and Document Databases document databases. The idea of this technique is to store leaf nodes as an array and map each non-leaf node to a set of leaf nodes by using the start and end of the index, as shown in the following figure:
Modeling of eCommerce Catalog using Nested Sets
This kind of data structure is very efficient for immutable data invariant data, because its point memory space is small, and all leaf nodes can be found quickly without tree traversal. However, there is a high performance cost in inserting and updating, because the new leaf nodes need to update the index on a large scale.
Applicability: Key-Value Stores key value database, Document Databases document database
(15) flattening of nested documents: limited field names Nested Documents Flattening:Numbered Field Names
Search engines basically work with flat documents, such as each document is a sample table of flat fields and values. This data model is used to map business entities to a text document, which can be challenging if your business entity has a complex internal structure. A typical challenge is to map a hierarchical document map. For example, another document is nested within a document. Let's look at the following example:
Nested Documents Problem
Each of the above business entity codes has a resume. It includes a person's name and a list of skills. One way I map this level document to a text document is to create Skill and Level fields. This model can search for a person by technology or hierarchy, while a combined query like the one shown above will fail. (Chen Hao's note: because I can't tell whether Excellent is on Math or Poetry.)
A solution is given in [4.6] in the reference. It marks each field with the numbers Skill_i and Level_i, so that each pair can be searched separately (OR is used in the following figure to traverse to find all possible fields):
Nested Document Modeling using Numbered Field Names
This approach has no extensibility at all and will only increase code complexity and maintenance work for some complex problems.
Applicability: Search Engines full-text search
(16) flattening of nested documents: proximity query Nested Documents Flattening: Proximity Queries
This technique is used to solve flat hierarchical documents in Appendix [4.6]. It uses adjacent queries to limit the range of words that can be queried. In the following figure, all skills and levels are placed in a field called SkillAndLevel, and the "Excellent" and "Poetry" that appear in the query must be one followed by the other:
Nested Document Modeling using Proximity Queries
Appendix [4. 3] describes a successful case where this technology is used in Solr.
Applicability: Search Engines full-text search
(17) Graph structure batch Batch Graph Processing
Graph databases graph database, such as neo4j, is an outstanding graph database, especially using one node to explore neighbor nodes, or to explore the relationship between two or a small number of nodes. However, it is inefficient to deal with a large amount of graph data, because the performance and scalability of the graph database is not its purpose. Distributed graph data processing can be processed by MapReduce and Message Passing pattern.
The above is all the content of the article "what is the use of NoSQL data modeling technology?" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.