What is the principle of xpath parsing in python? 07/15 Update SLTechnology News&Howtos

What is the principle of xpath parsing in python?

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article is to share with you about the principle of xpath parsing in python. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

XPath, whose full name is XML Path Language, or XML path language, is a language for finding information in XML documents. It was originally used to search XML documents, but it is also suitable for searching HTML documents.

XPath's selection function is very powerful, it provides a very concise path selection expression, in addition, it also provides more than 100 built-in functions for string, numerical, time matching and node, sequence processing, etc. Almost all the nodes we want to locate can be selected by XPath.

Principle of xpath parsing:

Achieve tag positioning: instantiate an etree object and load the parsed page source data into the object.

Call the xpath method in the etree object and combine the xpath expression to locate the tag and capture the content.

Installation pip install lxml of the environment

Lxml is a parsing library of python, which supports HTML and XML parsing, supports XPath parsing, and has very high parsing efficiency.

How to instantiate an etree object

1. Load the source data from the local html document into the etree object:

Etree. Parse (filePath) # your file path

two。 You can load source data obtained from the Internet into this object

Etree.HtML ('page_ text') # page_ text Internet response data xpath expression expression description nodename selects all children of this node / indicates location starting from the root node. It represents a level. / / represents multiple levels. It can mean to start positioning from any location. . Select the current node... Select the parent of the current node @ Select attribute * wildcard Select all element nodes and element names @ * Select all attributes [@ attrib] select all elements with a given attribute [@ attrib='value'] select all elements with a given attribute with a given value [tag] select all direct child nodes with specified elements [tag='text'] select all elements with the specified element and the text content is the text node to explain the examples of the above expression in detail

This is a HTML document.

Test bs4

BaiLi ShouYue

Li Qingzhao

Wang Anshi

Su Shi

Liu Zongyuan

This is span Song Dynasty is the most powerful dynasty, not the army is strong, but the economy is very strong, the people are very rich just because of those floating to cover the glorious day, climbing the heights without Changan City is not worrying.

When it was drizzling during the Qingming Festival, the pedestrians on the road were even more depressed and asked the shepherd boy where there was a restaurant. He pointed to the distant apricot blossom village or the bright moon and border pass of the Qin Dynasty, and the officers and soldiers born in Wanli had not yet been returned. As long as Li Guang, the flying general of the Han Dynasty, was still around, we must not let the enemy troops cross Yinshan and often see you in King Qi's house. I heard your song several times in front of Cui Jiutang, but now the south of the Yangtze River is picturesque. In this season of falling flowers, I met you again, du Fu, du Mu, du Xiaoyue, honeymoon, the ancient Phoenix station, where the Phoenix once roamed, the wind went to Taiwan, only the flowing water of the Yangtze River remained the same, the palace weeds of the Soochow era buried the secluded trail, and the distinguished families of the Jin Dynasty also became ancient tomb hills.

Open it from a browser like this

For convenience and intuition, we test writing a HTML file for local reading

Positioning of child nodes and descendant nodes / and / /

Let's first look at the child node and the descendant node. We look at p from top to bottom, and we can see that the parent node of p is body,body. The parent node is html.

Navigate to the p object of this HTML and look at the html source code above. You can see that there are three p objects

We output the information of this node through three different methods, and we can see that the output is the same three Element, that is, the functions of the three methods are the same.

Import requestsfrom lxml import etreetree = etree.parse ('test.html') r1=tree.xpath (' / html/body/p') # directly from top to bottom to find the node r2=tree.xpath ('/ html//p') # jumps a node to find the object of the p node r3=tree.xpath ('/ / p') # # jump all the nodes above to find the object of the p node r1Mager2MagneR3 > ([,], [,], ]) attribute positioning

If I only want the song tag in p, I can locate its attribute.

Of course, an element is returned.

R4=tree.xpath ('/ / p [@ class= "song"]') R4 > [] Index location

If I only want to get the label of Su Shi in song

We found the song,/p and can return all the tags in it.

Tree.xpath ('/ / p [@ class= "song"] / p') > > [,]

The p tag of Su Shi returned separately, it should be noted that the index here does not start with 0, but 1.

Tree.xpath ('/ / p [@ class= "song"] / p [3]') [] fetch text

For example, I want to take the text content of du Mu.

As above, if we want to locate du Mu's a tag, we must first find his upper level li, which is the an in the fifth li, so we have the following writing. Text () converts the element into text, of course, the above plus a text () can display the text content.

Tree.xpath ('/ / p [@ class= "tang"] / / li [5] / a/text ()') > > [du Mu']

You can see that this returns a list, and if we want to take the string in it, we can do this

Tree.xpath ('/ / p [@ class= "tang"] / / li [5] / a/text ()') [0] du Mu

Look at a more direct, / / li directly navigate to the li tag, and / / text () directly extract the text under this tag. Note, however, that all the text under the li tag will be extracted, and sometimes the text you don't want will be extracted, so it's best to write in more detail, such as the li in which p.

Tree.xpath ('/ / li//text ()') ['caught in the drizzle during the Qingming Festival, the pedestrians on the road were even more depressed. He asked the shepherd boy where there was a restaurant. He pointed to the Xinghua village in the distance. "the officers and soldiers born in Wanli have not yet been returned. As long as Li Guang, the flying general of the Han Dynasty, is still around, we must not let the enemy army cross the shady mountain." I often see you in the house of Qi Wang. I heard your song several times in front of Cui Jiutang, but now Jiangnan is picturesque. I met you again in this falling flower season. 'du Xiaoyue', 'honeymoon', 'the ancient Phoenix station had Phoenix Xiang to roam, the wind went to Taiwan, only the Yangtze River flowed day after day, the palace weeds of the Soochow era buried the secluded trail, and the famous families of the Jin Dynasty all became ancient tombs and hills.]

For example, I want to take the following attribute

You can directly use @ to fetch attributes

Tree.xpath ('/ / p [@ class= "song"] / img/@src') ['http://www.baidu.com/meinv.jpg']

Or if I want to take all the href attributes, I can see all the href properties of tang and song

Tree.xpath ('/ @ href') ['http://www.song.com/',', 'http://www.baidu.com',' http://www.163.com', 'http://www.126.com',' http://www.sina.com', 'http://www.dudu.com', 58.com 's real estate information of http://www..com'] crawler # importing the necessary library import requestsfrom lxml import etree#URL is the website. Headers see figure 1 url=' https://sh.58.com/ershoufang/'headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0) Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.7 Safari/537.36'} # initiate a request to the website page_test=requests.get (url=url,headers=headers). Text# here is to load the source code data obtained from the Internet into the object tree=etree.HTML (page_test) # first take a look at the explanation in figure 2, where there are several li The li_list returned in the office is a list li_list=tree.xpath ('/ / ul [@ class= "house-list-wrap"] / li') # here we open a 58.txt file to save our information fp=open ('58.txtwalking grammar encodinggrammar inheritance utf8') # li traversal li_listfor li in li_list: # here. / is the inheritance of the previous li, equivalent to li/p.... Title=li.xpath ('. / p [2] / h3/a/text ()') [0] print (title+'\ n') # writes the file to the file fp.write (title+'\ n') fp.close ()

Figure 1:

Figure 2:.

Here we want to extract all the housing information. We can see that the last node of each small node is the same. What we want to extract is the housing information in h3 node a. See figure 3.

The child nodes in each / li node are the same, so we can find all the li nodes first and then go down to find the information we want.

Thank you for reading! This is the end of this article on "what is the principle of xpath parsing in python". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it out for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.