Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Xpath syntax in Python crawler

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

Most people do not understand the knowledge points of this article "how to use Xpath grammar in Python crawler", so the editor summarizes the following content, detailed content, clear steps, and has a certain reference value. I hope you can get something after reading this article. Let's take a look at this "how to use Xpath grammar in Python crawler" article.

What is Xpath?

XPath is the XML path language, which is a language used to determine the location of a part of an XML (Standard Universal markup language subset) document. XPath is based on the tree structure of XML and provides the ability to find nodes in the data structure tree. At first, the original intention of XPath is to regard it as a general grammar model between XPointer and XSL. But XPath was quickly adopted by developers as a small query language.

To put it simply: Xpath (XML Path Language) is a language that looks for information in XML and HTML documents and can be used to traverse elements and attributes in XML and HTML documents.

XPath development tools

Here are two widely used and convenient tools for editors:

Chrome plugin XPath Helper (requires scientific access to the Internet)

Firefox plug-in Try XPath.

Of course, the Chrome plug-in XPath Helper can also find the installation package and install it through the plug-in partner. The installation steps are as follows:

Open the plug-in partner and select the downloaded plug-in

If you choose to extract the plug-in content to the desktop, there will be an extra folder on the desktop.

Put the folder under the path you want to put

Open the Google browser, select the extension, open it in developer mode, choose to load the extracted extension, and select the path to open it.

XPath node

In XPath, there are seven types of nodes: elements, attributes, text, namespaces, processing instructions, comments, and document (root) nodes. XML documents are treated as node trees. The root of a tree is called a document node or root node.

XPath syntax

XPath uses path expressions to select nodes or node sets in an XML document. Nodes are selected by following the path (path) or step (steps).

Mode of use:

Use / / to get the elements in the entire page, then write the tag signature, and then extract it in the write predicate, such as:

/ / title [@ lang='en']

Knowledge points to pay attention to:

The difference between / and / /: / means to get only child nodes, / / get descendant nodes, generally / / use more, of course, it depends on the situation.

Contains: sometimes an attribute contains multiple values, then you can use the contains function, as shown in the following example:

/ / title [contains (@ lang,'en')]

The subscript in the predicate starts with 1, not 0.

Lxml library

Lxml is a HTML/XML parser, the main function is how to parse and extract HTML/XML data.

The library, which is a third-party library, needs to be installed using pip, with the following command:

Pip install lxml

Basic use:

The following is the html file code of the case practice, and the friends in front of the screen can save it and practice together.

First

Second

Third

Fourth

Fifth

Case 1: parsing a string into an html document

From lxml import etree

Text =''

Html = etree.HTML (text) # read

Print (html)

# serialize html by string

Result = etree.tostring (html) .decode ('utf-8')

Print (result)

Case 2: read the html code from a file:

From lxml import etree

Html = etree.parse ('hello.html') # read

# serialize html by string

Result = etree.tostring (html) .decode ('utf-8')

Print (result)

Case 3: using Xpath syntax in lxml

From lxml import etree

Html = etree.parse ('hello.html')

# get all li tags:

# result = html.xpath ('/ / li')

# print (result)

# for i in result:

# print (etree.tostring (I))

# get the values of all class attributes under all li elements:

# result = html.xpath ('/ / li/@class')

# print (result)

# get the a tag whose href is https://www.yisu.com/ under the li tag:

# result = html.xpath ('/ / li/a [@ href= "https://www.yisu.com/"]'))

# print (result)

# get all span tags under the li tag:

# result = html.xpath ('/ / li//span')

# print (result)

# get all class in the a tag under the li tag:

# result = html.xpath ('/ / li/a//@class')

# print (result)

# get the value corresponding to the a href attribute of the last li:

# result = html.xpath ('/ / li [last ()] / a _ href')

# print (result)

# get the content of the penultimate li element:

# result = html.xpath ('/ / li [last ()-1] / a')

# print (result)

# print (result [0] .text)

# the second way to get the content of the penultimate li element:

Result = html.xpath ('/ / li [last ()-1] / a/text ()')

Print (result)

The above is about the content of this article on "how to use Xpath grammar in Python crawler". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more about the relevant knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report