Example Analysis of XML processing method VTD-XML 07/03 Update SLTechnology News&Howtos

Example Analysis of XML processing method VTD-XML

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article shares the content of an example analysis of XML processing method VTD-XML. Xiaobian thinks it is quite practical, so share it with everyone for reference. Let's follow Xiaobian and have a look.

problem

Usually when we talk about XML usage, the biggest headache is XML verbosity and XML parsing speed, which is especially serious when we have to deal with large XML files. What I'm talking about here is how to optimize XML processing speed.

When we choose to process XML files, we have roughly two choices:

DOM, which is the W3C standard model, constructs XML structure information in a tree form and provides interfaces and methods for traversing the tree.

SAX, a low-level parser, reads forward element-by-element and contains no structural information.

Both options have pros and cons, but neither is a particularly good solution. Their pros and cons are as follows:

DOM

Advantages: Easy to use, because all XML structure information exists in memory, and traversal is simple, support XPath.

Disadvantages: Parsing is too slow, memory usage is too high (5x~ 10x of the original file), and it is almost impossible to use for large files.

SAX

Advantages: Parsing speed is fast, memory consumption is not related to the size of XML (XML can be increased without increasing memory).

Disadvantages: poor usability, because there is no structure information, and can not traverse, does not support XPath. If you need structure, you can only read a little bit and construct a little bit, so the maintainability is particularly poor.

As we can see, DOM and SAX are basically opposite extremes, but neither of them will work well for most of our requirements, and we need to figure out another way to handle it. Note that the efficiency problem with XML is not a problem with XML per se, but rather a problem with parsing XML, just as we saw above that there are different efficiency tradeoffs between the two approaches.

thinking

We like DOM-like usage because we can traverse, which means XPath support, which greatly enhances ease of use, but DOM is inefficient. As we already know, the efficiency problem lies in the processing mechanism. So, what exactly is it about DOM that affects its efficiency? Let's do a full autopsy:

On most platforms today based on virtual machine (hosting, or any similar mechanism) technology, object creation and destruction is a time-consuming task (worth the time spent mainly on Garbage Collection), and the large number of object creation and destruction used in DOM mechanism is undoubtedly one of the reasons for its efficiency (it will cause excessive Garbage Collection).

Each object has an extra 32bits to store its memory address, which is a significant expense when you have a large number of objects like DOM.

The main efficiency problem that causes the above two problems is that DOM and SAX are both extractive parsing modes, which predestines both DOM and SAX to need a large number of objects to be created (destroyed), causing efficiency problems. Extractive parsing means that when parsing XML, DOM or SAX extracts a portion of the original file (usually a string) and then parses it in memory (the output is naturally an object or objects). DOM, for example, parses every element, attribute, PRessing-instruction, comment, etc. into objects and gives them structure, which is called extractive parsing.

Another problem brought about by the problem of extractive is update efficiency. In DOM (SAX doesn't mention it at all because it doesn't support updates), every time we need to make a change, all we have to do is parse the object information back into an XML string. Note that this parsing is a complete parsing, that is, the original file is not utilized, but the DOM model is directly parsed back into an XML string. In other words, DOM does not support Incremental Update.

Another "minor" problem that is likely to go unnoticed is XML encoding. Whatever parsing method is used needs to be able to handle XML encoding, that is, decode when reading and encode when writing. Another efficiency problem with DOM is that when I want to make only small changes to a large XML file, it must decode the entire file first and then build the structure. Invisible is another expense.

Let's summarize the problem. The simple efficiency problem of DOM mainly lies in its extractive parsing mode (SAX is the same, the same problem), which raises a series of related problems. If these efficiency bottlenecks can be broken, then it is conceivable that XML processing efficiency will be further improved. If the ease of use and processing efficiency of XML is greatly improved, then the application scope and application pattern of XML will be further sublimated, which may lead to many wonderful XML-based products that have never been thought of before.

way out

VTD-XML is a non-extractive XML parser that solves (avoids) all of the problems raised above due to its excellent mechanism, and also "incidentally" brings other benefits of non-extractive, such as fast parsing and traversal, XPath support, Incremental Update, etc. I have a set of data here, taken from the official website of VTD-XML:

VTD-XML parsing speed is SAX (with NULL content handler) 1.5x~2.0x. With NULL content handler means that SAX parsing does not insert any additional processing logic, that is, the maximum speed of SAX.

The memory footprint of VTD-XML is 1.3x~1.5x of the original XML (where 1.0x is the original XML and 0.3x~0.5x is the VTD-XML), while the memory footprint of DOM is 5x~ 10x of the original XML. As an example, if an XML file is 50MB in size, then reading in VTD-XML will take up between 65MB and 75MB of memory, while DOM will take up between 250 MB and 500MB of memory. Using DOM to process large XML files based on this data is an almost impossible option.

You might wonder, can you really make XML parsers that are easier to use than DOM and faster than SAX? Don't jump to conclusions, let's take a look at how VTD-XML works!

basic principle

Like most good products, VTD-XML isn't complicated; it's clever. In order to achieve the purpose of non-extractive, it reads the original XML file intact into memory in binary mode, without decoding, and then parses the position of each element on this byte array and records some information. After that, the traversal operation is carried out on these saved records. If you need to extract the XML content, you can decode the original byte array by using the position information in the record and return the string. It all looks simple, but this simple process has multiple performance details embedded in it and hides several potential capabilities. Let's start by describing the individual performance details:

To avoid excessive object creation, VTD-XML decided to use the primitive numeric type as the record type, so that heap was not necessary. The record mechanism of VTD-XML is called VTD (Virtual Token Descriptor). VTD solves the performance bottleneck in the tokenization stage. It is really a very clever and careful approach. VTD is a 64-bit numeric type that records the offset, length, depth, and token type of each element.

Note that VTD is fixed-length (officially 64bits), the purpose of this is to improve performance, because the length is fixed, especially efficient (O(1)) when reading, querying, etc., that is, VTD can be organized in an efficient structure such as arrays, which greatly reduces performance problems caused by heavy use of objects.

VTD's superpower (not to exaggerate) is its ability to transform XML's tree-like data structure into a simple operation on a byte array. Any operation you can imagine on a byte array can be applied to XML. This is because the XML read in is binary (byte array), and VTD records the location of each element and other access information, when we find the VTD to operate, as long as the offset and length information can be used to the original byte array for any operation, or can be directly operated on VTD. For example, if I want to find an element in a large XML file and delete it, then I just need to find the VTD of that element.(Traversal method will be discussed later), delete this VTD from the VTD array, and then use all the VTD to write to another byte array, because the deleted VTD indicates the position of the element to be deleted, so this element will not appear in the newly written byte array. Writing a new byte array with VTD is actually a copy of a byte array, and its efficiency is quite high. This is called incremental update.

As for the traversal mode of VTD-XML, it adopts LC (Location Cache), which is simply a tree-like table structure constructed by VTD with its depth as the standard. LC entry is also 64bits long, with the first 32bits representing the index of a VTD and the last 32bits representing the index of the first child of the VTD. Using this information, you can calculate any location you want to reach. For specific traversal methods, please see the official website article. VTD-XML based on this traversal has a different interface than DOM, which is understandable, and this traversal of VTD-XML can take you to where you need to go in a few minimal steps, and the traversal performance is outstanding.

Thank you for reading! About "XML processing method VTD-XML sample analysis" This article is shared here, I hope the above content can be of some help to everyone, so that everyone can learn more knowledge, if you think the article is good, you can share it to let more people see it!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.