In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article shares with you a sample analysis of XML indexing technology in a relational database engine. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.
Xml (Extensible markup language) has become the standard of data representation and data exchange in Web applications. With the rapid development of Internet, especially the widespread use of e-commerce, Web services and other applications, XML data has become the mainstream data form. Therefore, the management technology of XML data, especially the XML data query technology, has become the current research hotspot.
Compared with relational data, XML has a variety of advantages, but the biggest disadvantage is its efficiency. Because the field name of the data only needs to appear once in the relational data file, while the element name will appear repeatedly in the XML data file, which must affect the efficiency of the query. In order to improve the query efficiency of XML as much as possible, it is necessary to provide indexing function for XML types.
The World wide Web Alliance identified XPath3.0 and XQuery1.0 as recommended standards on January 23, 2007, ending the previous competitive situation of various query languages. Based on this standard, in addition to traditional manufacturers, various scientific research institutions have proposed the implementation of XPath and XQuery (there are more than ten mentioned in the literature). Their storage models are different, query algorithms are different, and optimization approaches have their own strengths. Under this background, Dameng Database Company has also put forward its own XML query engine model according to its own development strategy. at present, Dameng's XML query engine is under intense development. The establishment of an effective index on XML data is an important factor affecting the performance of XML data query. On the basis of in-depth analysis of the index technology of existing database products, a more reasonable index structure is designed for Dameng XML query engine, so that the engine can play a better performance.
Brief introduction of XML Index Technology
At present, the research on XML is mainly divided into two aspects. One is the native database for storing, querying and managing semi-structured data such as XML, in which the data and metadata are completely represented by XML structure, which has nothing to do with its underlying data storage format (such as object model, relational model, etc.). The other is the conversion between it and relational database, which uses the mature technology of relational database to deal with XML data. As the latter direction is of practical significance, it has become the focus of XML research.
In addition to the storage scheme, index technology is also one of the most important factors that determine a database system. If no index structure is built on the XML document, any query against the XML data is likely to result in traversing the entire document tree, which is intolerable as the XML dataset grows. Therefore, the research on XML index technology has high theoretical and practical value.
Although the traditional indexing technology has become relatively mature after long-term accumulation, this kind of indexing technology mainly focuses on the function of locating data records based on values (rather than schemas with certain relationships). Pay little attention to the logical relationship between data records The basic feature of XML data query is to extract data that conforms to the pattern according to the input of pattern features (structural relations described in the form of regular path expressions). Therefore, the main content of XML index is to design techniques suitable for pattern matching.
XML index classification
Path-based XML index
The path-based index is based on the path information of the nodes in the XML tree structure, and adopts some reduction method, so that the reduced tree structure only maintains different path information, and there will not be two nodes with the same path. The proposed indexes include DataGuides index, Index Fabric index and adaptive path index (Adaptive Path Index for XML Data, APEX) for XML data.
An Dataguides index is a structural summary of a refined path starting from the root node. The string path formed by concatenating edge tags is described only once in Dataguides. Dataguides reduces the number of nodes required to traverse path queries, which is effective for traversing XML documents from the root. However, for the path query with wildcards or the path query with descendant-or-self axis defined in the XPath standard, the query efficiency is low and there is data redundancy.
Then write the java object file TestLob.java about these two large fields. The definition type is CLOB and the BLOB property field is String and byte [] type, respectively, where CLOB is a large text type, so it corresponds to the String type in Java. BLOB is to deal with some large files stored in a binary flow situation that are not strictly defined, so let it use the byte [] type, and then define the Getter and Setter methods of these two properties respectively. The related code is as follows:
An Dataguides index is a structural summary of a refined path starting from the root node. The string path formed by concatenating edge tags is described only once in Dataguides. Dataguides reduces the number of nodes required to traverse path queries, which is effective for traversing XML documents from the root. However, for the path query with wildcards or the path query with descendant-or-self axis defined in the XPath standard, the query efficiency is low and there is data redundancy.
Index Fabric is an index structure developed from the Patricia Trie tree. It encodes each tag path to each element node with a string, and then inserts these encoded values into the Patricia Trie tree, thus transforming the query of XML data according to the path into the query of string. When querying, first encode the query path into the form of a string, and then find it in the index tree. The advantage of Index Fabric index is that it stores the hierarchical structure information of XML data, uniformly handles the retrieval of XML data with and without schema information, and makes the time needed for querying and updating XML data related to hierarchy rather than to the length of index keywords. The disadvantage of the Index Fabric index is that it loses the structural relationship between the element nodes because it only retains the information of the element nodes with text values. Therefore, similar to DataGuides indexes, Index Fabric indexes are inefficient in processing partially matched query expressions with descendant-or-self axes defined in the XPath standard.
For this reason, APEX [14] introduces information that depends on the distribution of XML data queries: the tag nodes corresponding to frequent XML query statements are pre-stored in a hash structure. Its function is similar to that of Cache: when there are new query requirements to be processed, first search the hash table for whether there is a set of nodes that are satisfied. However, it is inefficient in processing query expressions with element or attribute values.
Node-based index
The node-based index essentially decomposes the XML data into a set of records of data units, and stores the location information of the unit in the XML data in the records. Unlike path-based indexes, node-based indexes break the limitation that nodes must be found through label paths and decompose XML data into canonical node records. Because it saves the location information of nodes and can be well integrated into the mature relational database management system, it is the most widely used index at present.
Depending on how location information is encoded, node-based indexes can generally be divided into the following categories:
1. Prefix-based index
The prefix-based index is mainly generated by Dewey [12] coding, and the ORDPATH coding of reference [13] adopts a similar method, and gives a method of compressing ORDPATH, which has been applied to the index organization of SQL Server 2005.
The basic idea of prefix coding is to directly take the coding of the parents of a node as the prefix of the node code. for prefix coding, it is necessary to judge whether one node v is descended from another node u, as long as it determines whether the coding of u is the prefix of v. An important property of prefix coding indexes is that they are lexicographically ordered: any node u in a subtree rooted at node r has a prefix coding c (u) greater than (less than) the prefix coding of all nodes in its left sibling subtree (right sibling subtree). Therefore, the prefix-based index can not only effectively support the operation of inclusion relations, but also effectively support the calculation of document location relations.
two。 Index based on interval coding
For the interval coded index, each node in the tree T is assigned an interval coding [begin,end], which satisfies that the interval coding of a node contains the interval coding of its descendant nodes. That is, the node u in the tree T is the ancestor of node v if and only if start (u)
The first interval coding scheme is Dietz coding, in which each node in the tree T is assigned a binary tuple with pre-order ergodic number and post-order ergodic number. Since an ancestor node u in a tree T must appear before (after) its descendant node v in preorder traversal (post-order traversal), nodes u and v are ancestor / descendant relations if and only if PRe (u)
Another typical example of an interval-coded index is the XISS index, which assigns a pair of numbers to each node, where order is the extended preorder coding and size is the range of the descendants of the node. For any node X and Y in a document tree, if and only if order (x)
The XISS index breaks down the original query statement into sub-expressions. Then implement the query for these sub-expressions respectively, and finally join these intermediate results to get the query result set. Thus, the query statements with wildcards can be better supported. However, it joins each intermediate result to get the final query result. While such an approach does solve all wildcard problems, this join of intermediate results is likely to be time-consuming, especially for simple expressions with long paths.
Comparison of two indexing mechanisms
The path-based index is mainly based on the strategy of node merging, and the index structure which is much smaller than the original document is obtained through the techniques of node equivalence and path equivalence, and its structure is still tree-shaped, so when processing queries, basically you still have to traverse the entire index tree to get the results. The path-based index can well support the query of simple path expressions, but it is not very effective for regular path expressions.
The node-based index indexes each node by coding technology, and the structural relationship between nodes can be determined by coding that it can well support regular path expressions in constant time, but for long path expressions, especially when the query produces a lot of intermediate results, the join operation of the node index is expensive.
Path-based index and node-based index have their own advantages and disadvantages, but they can complement each other. At present, in the practical application, the node-based index is more widely used and the research is more mature. Therefore, the research on the XML index structure of Dameng Company is mainly based on the node-based index, and appropriate reference to the path-based index to improve.
Thank you for reading! This is the end of the article on "sample Analysis of XML Index Technology in Relational Database engine". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.