How to use DOM and XPath for effective XML processing in Java 04/18 Update SLTechnology News&Howtos

How to use DOM and XPath for effective XML processing in Java

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces how to use DOM and XPath to deal with XML effectively in Java. It is very detailed and has certain reference value. Friends who are interested must read it!

Document object model

The DOM specification is designed to work with any programming language. Therefore, it attempts to use a set of common, core features that are available in all languages. The DOM specification also tries to keep its interface definition independent. This allows Java programmers to apply their DOM knowledge when using Visual Basic or Perl, and vice versa.

The specification also treats each part of the document as a node made up of types and values. This provides a conceptual framework for dealing with all aspects of the document. For example, the following XML snippet

The Italicized portion.

It is represented by the following DOM structure:

Figure DOM representation of 1:XML documents

Each Document, Element, Text, and Attr part of the tree is a DOM Node.

The abstraction of * * does come at a price. Consider the XML fragment: Value. You might think that the value of the text can be represented by a normal Java String object and accessible through a simple getValue call. In fact, the text is treated as one or more child Node under the tagname node. Therefore, in order to get text values, you need to traverse the child nodes of tagname, collating each value into a string. There is a good reason to do this: tagname may contain other embedded XML elements, and it doesn't make much sense to get its text value in this case. However, in the real world, we see that frequent coding errors account for 80% of cases due to the lack of convenient functions, which makes sense.

Design problem

The disadvantage of DOM language independence is that a complete set of working methods and patterns that are commonly used in each programming language cannot be used. For example, you cannot create a new Element using familiar Java new constructs, and developers must use factory constructor methods. The collection of Node is represented as NodeList rather than the usual List or Iterator object. These minor inconveniences mean unusual coding practices and additional lines of code, and they force programmers to learn DOM's way of doing things rather than intuitively.

DOM uses the abstraction that everything is a node. This means that almost every part of an XML document, such as Document, Element, and Attr, inherits the (extend) Node interface. This is not only conceptual, but also allows different implementations of each DOM to make its own classes visible through standard interfaces without the performance loss caused by intermediate wrapper classes.

Due to the lack of consistency in the number of node types and their access methods, the abstraction of "everything is a node" loses some meaning. For example, the insertData method is used to set the value of the CharacterData node, while the setValue method is used to set the value of the Attr (attribute) node. Because there are different interfaces for different nodes, the consistency and * * of the model are reduced, while the learning curve is increased.

JDOM

JDOM is a research project to adapt DOM API to Java, thus providing a more natural and easy-to-use interface. Recognizing the tricky nature of language-independent DOM constructs, JDOM aims to use embedded Java representations and objects, and to provide convenient functions for common tasks.

For example, JDOM directly handles "everything is a node" and the use of DOM-specific constructs (such as NodeList). JDOM defines different node types (such as Document, Element, and Attribute) as different Java classes, which means that developers can construct them using new to avoid the need for frequent type conversions. JDOM represents strings as Java String and represents a collection of nodes through normal List and Iterator classes. (JDOM replaces the DOM class with its own class. )

JDOM has done quite useful work to provide a better interface. It has been accepted as JSR (official Java Specification Request), and it is likely to be included in the core Java platform in the future. However, because it is not yet part of the core Java API, some people are hesitant to use it. There are also reports of performance issues related to the frequent creation of Iterator and Java objects. (see Resources).

If you are satisfied with the acceptability and availability of JDOM, and if you do not have the direct need to move Java code and programmers to other languages, JDOM is a good choice to explore. JDOM does not meet the needs of the company where the project discussed in this article is located, so they use the very common DOM. This is also done in this article.

Common coding problems

Analyses of several large XML projects reveal some common problems in using DOM. Several of them are introduced below.

The code is bloated

All the projects we looked at in our study had a prominent problem of their own: spending many lines of code to do simple things. In one example, 16 lines of code are used to check the value of an attribute. The same task, with improved robustness and error handling, can be implemented with three lines of code. The low-level nature of DOM API, incorrect application of methods and programming patterns, and lack of complete knowledge of API can increase the number of lines of code. The following summary describes specific examples of these issues.

Traversing DOM

In the code we explore, the most common task is to traverse or search DOM. Listing 1 demonstrates the need to find a condensed version of the code called a "header" node in the config section of the document:

In listing 1, you traverse the document from the root by retrieving the top element, getting its * child nodes (configNode), and finally checking the configNode's child nodes individually. Unfortunately, this approach is not only lengthy, but also accompanied by vulnerabilities and potential errors.

For example, the second line of code gets the intermediate config node by using the getFirstChild method. There are already many potential problems. The * children of the root node may not actually be the node the user is searching for. By blindly following the * child nodes, I ignored the actual name of the tag and may search for incorrect parts of the document. A frequent error occurs when the root node of the source XML document contains spaces or carriage returns; the * child nodes of the root node are actually Node.TEXT_NODE nodes, not the desired element nodes. You can try it yourself, download the sample code from Resources and edit the sample.xml file-placing a carriage return between the sample and config tags. The code is terminated immediately with an exception. To browse the desired node correctly, you need to check the children of each root until you find a node that is not Text, and that node has the name I'm looking for.

Listing 1 also ignores the possibility that the document structure may be different from what we expected. For example, if root does not have any child nodes, configNode will be set to null, and the third line of the example will generate an error. Therefore, to browse the document correctly, not only check each child node individually and check the corresponding name, but also check each step to ensure that each method call returns a valid value. Writing robust, error-free code that can handle arbitrary input requires not only great attention to detail, but also many lines of code.

Finally, if the original developer knew about it, all the functionality of the example in listing 1 should be achieved by using a simple call to the getElementsByTagName function. This is what we will discuss below.

Retrieve the text value in an element

In the analyzed project, after DOM traversal, the second most common task is to retrieve the text values contained in the element. Consider the XML fragment The Value. If you have navigated to the sometag node, how do you get its text value (The Value)? An intuitive implementation might be:

SometagElement.getData ()

As you may have guessed, the above code does not perform the desired action. Because the actual text is stored as one or more child nodes, you cannot call getData or similar functions on sometag elements. A better way might be:

Sometag.getFirstChild () .getData ()

The problem with the second attempt is that the value may not actually be contained in * child nodes; processing instructions or other embedded nodes may be found in the sometag, or text values may be contained in several child nodes rather than in a single child node. Given that spaces are often represented as text nodes, the call to sometag.getFirstChild () may only give you a carriage return between the tag and the value. In fact, you need to traverse all the child nodes to check the nodes of type Node.TEXT_NODE and collate their values until you have the complete values.

Note that JDOM has solved this problem for us with the convenient function getText. DOM level 3 will also have an answer that uses the getTextContent method of planning. Lesson: there is nothing wrong with using a more advanced API as much as possible.

GetElementsByTagName

The DOM level 2 interface contains a method to find a child node with a given name. For example, call:

NodeList names = someElement.getElementsByTagName ("name")

A node called names contained in the someElement node will be returned. This is certainly more convenient than the traversal method I've discussed. This is also the cause of a set of common mistakes.

The problem is that getElementsByTagName recursively traverses the document, returning all matching nodes. Suppose you have a document that contains customer information, company information, and product information. All three items may contain name tags. If you call getElementsByTagName to search for customer names, your program is most likely to misbehave, retrieving not only customer names, but also product and company names. Calling this function on a subtree of a document may reduce the risk, but the flexible nature of XML makes it difficult to ensure that the subtree you are working on contains the structure you expect and that there are no fake child nodes with the name you are searching for.

Effective use of DOM

Given the limitations imposed by DOM design, how can the specification be used effectively and efficiently? Here are a few basic principles and guidelines for using DOM, as well as libraries to make work easier.

basic principle

If you follow a few basic principles, your experience with DOM will be significantly improved:

◆ do not use DOM to traverse documents.

◆ uses XPath whenever possible to find nodes or traverse documents.

◆ uses more advanced function libraries to make it easier to use DOM.

These principles are derived directly from the study of common problems. As discussed above, DOM traversal is the main cause of errors. But it is also one of the most frequently needed features. How do you traverse a document without using DOM?

Path

XPath is a language for addressing, searching, and matching parts of a document. It is a W3C recommendation (Recommendation) and is implemented in most languages and XML packages. Your DOM package may support XPath directly or through an add-in (add-on). The sample code in this article supports the use of Xalan packages for XPath.

XPath uses path markings to specify and match parts of a document, similar to those used in file systems and URL. For example, XPath: / x/y/z searches for the root node x of the document, where there is a node y and a node z under it. This statement returns all nodes that match the specified path structure.

More complex matches may be both in the structure of the containing document and in the values of the node and its attributes. The statement / xUnix * returns any node under the y node whose parent is x. / XUnip y [@ name='a'] matches all y nodes whose parent is x, with an attribute called name and an attribute value of a. Note that the XPath process filters the white space text node to get the actual element node-it only returns the element node.

A detailed discussion of XPath and its usage is beyond the scope of this article. See Resources for links to some excellent tutorials. Take some time to learn XPath, and you will make it easier to work with XML documents.

Function library

One of the things that surprised us when we studied the DOM project was the amount of copy and paste code that existed. Why do experienced developers not have good programming habits but use copy and paste instead of creating helper libraries? We believe that this is because the complexity of DOM makes learning more difficult and makes developers understand that they can complete the * pieces of code they need. It takes a lot of time to develop the expertise needed to generate the functions that make up the helper library specification.

To save some detour time, here are some basic helper functions that will make your own library up and running.

FindValue

When working with XML documents, the most common action to perform is to find the value of a given node. As discussed above, difficulty occurs both in traversing the document to find the desired value and in retrieving the value of the node. Traversal can be simplified by using XPath, while the retrieval of values can be encoded once and then reused. With the help of two lower-level functions, we implemented the getValue function: XPathAPI.selectSingleNode provided by the Xalan package (used to find and return * nodes that match a given XPath expression), and getTextContents, which non-recursively returns consecutive text values contained in the node. Note that the getText function of JDOM, or the getTextContent method that will appear in DOM level 3, can be used instead of getTextContents. Listing 2 contains a simplified listing; you can access all functions by downloading the sample code (see Resources).

FindValue is invoked by passing in both the node to start the search and the XPath statement that specifies the node to be searched. The function finds * * nodes that match a given XPath and extracts their text values.

SetValue

Another common operation is to set the value of the node to the desired value, as shown in listing 3. This function takes a starting node and a XPath statement-just like findValue-and a string to set the matching node value. It looks for the desired node, removes all of its child nodes (and therefore removes any text and other elements contained in it), and sets its text content to the incoming (passed-in) string.

AppendNode

Although some programs find and modify the values contained in the XML document, others modify the structure of the document itself by adding and removing nodes. This helper function simplifies the addition of document nodes, as shown in listing 4.

The parameters to this function are the node under which the new node is to be added, the name of the new node to be added, and the XPath statement that specifies the location under which the node is to be added (that is, which parent of the new node should be). The new node is added to the specified location in the document.

These are all the contents of the article "how to use DOM and XPath for effective XML processing in Java". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.