Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

A tutorial on how to reconstruct Dom trees

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly explains the "Dom tree reconstruction method tutorial", the article explains the content is simple and clear, easy to learn and understand, now please follow the editor's train of thought slowly in depth, together to study and study the "Dom tree reconstruction method tutorial"!

In this case, the generic crawler is generally divided into several different parts, as shown in the following figure:

Among them, HTML source code rewriting this component will modify the web page source code according to a certain strategy, eliminate extraneous nodes, and merge complex but unnecessary nested nodes. After rewriting, the relative standard and unified HTML is output and transmitted to the downstream information extraction components for content extraction.

This student's question involves rewriting the source code. In fact, using lxml to insert a node in the DOM tree is not a problem at all. Anyone who can use Google can naturally find a large number of solutions by searching lxml html insert element, as shown in the following figure:

However, the strange thing about this problem is that it needs to add child nodes in front of the text node. Dry speaking may be difficult to describe, and I will use an example to illustrate this problem.

Let's take a look at this code first:

From lxml.html import fromstring, Element, etree from html import unescape html ='

Hello

Node = fromstring (html) p_node = node.find ('. / / p') element = Element ('span') element.text =' Qingnan 'p_node.insert (0, element) new_html = unescape (etree.tostring (node). Decode () print (new_html))

Based on our experience with Python lists, if a list an is now ['Hello'], when we execute a.insert (0, 'Qingnan'), the result should be ['Qingnan', 'Hello']. But let's take a look at how the above code works:

You can see that Qingnan is behind you. If you look at the diagram at the beginning of this article, in the example given by the questioner, he wants to insert child nodes in front of the text. In this example, it should be Qingnan Hello.

You can give it a try. No matter how you search on Google, you can't find a way to insert a node in front of the text.

But in fact, as long as you go back to the official documentation, you will find that the solution to the whole problem is not difficult. What we need to use is lxml.html.builder [1].

Again, in the above example, how to get the span tag in front of the text? We use builder to achieve:

From lxml.html import builder from html import unescape html = 'node = fromstring (html) new_node = builder.P (builder.SPAN (' Qingnan'), 'Hello') node.append (new_node) new_html = unescape (etree.tostring (node). Decode () print (new_html))

The running effect is shown in the following figure:

Seeing this, some students may think that I am playing a scoundrel. It's like asking me to write a program to calculate the first five terms of the Fibonacci series, so I wrote the answer print (1, 1, 2, 3, 5) in 5 seconds. In the above code, I directly use builder.P (builder.SPAN ('Qingnan'), 'Hello'), which is the same as writing directly

Hello, Qingnan.

What's the difference? Isn't that cheating?

I know you are not convinced, but this is the real situation. This is what general crawlers do when they rewrite the HTML source code. Because it is very troublesome to rewrite the Dom tree of a web page directly. If you modify the Dom tree directly, it often occurs that you need to find the parent node of a node, and then find the children of the sibling node of the parent node to modify it. Or to determine whether a node has children, whether there are or not, you need two kinds of logic to deal with, in order to prevent the destruction of the Dom tree.

Therefore, instead of modifying the Dom tree directly, we scan the original Dom tree and use builder to rebuild a new Dom tree. The process of rebuilding the Dom tree is much easier than modifying the Dom tree. After all, anyone who has written code knows that it is much easier to write new code than to change other people's code.

Thank you for reading, the above is the "Dom tree reconstruction method tutorial" content, after the study of this article, I believe you have a deeper understanding of the Dom tree reconstruction method tutorial, the specific use of the need for you to practice and verify. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report