Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python crawler XPath

2025-01-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly shows you "how to use Python crawler XPath", the content is easy to understand, clear, hope to help you solve doubts, the following let the editor lead you to study and learn "how to use Python crawler XPath" this article.

First, the problem description 1. What is XPath?

Xpath is a language for finding information in XML and HTML documents, which can be used to traverse elements and attributes in XML and HTML documents. XPath uses path expressions to select nodes or node sets in XML documents. These path expressions are very similar to those seen in a regular computer file system.

Second, the solution 1.XPath syntax

To learn xpath well, you must first understand the nodes in the html document.

First item second item third item fourth item fifth item # Note that a closed tag is missing here

The above is a random section of html text found on the Internet, it can be observed that under the div tag is the ul tag, while under the ul tag is the li tag, so it is found that the html tag is as tree-like as a tree. Xpath is looking for it in this way. Take life as an example, to determine a person's position, first make sure that he is in China, then determine which province, city, and community he is in, and finally find him.

Expression.

Description

Nodename

Select all child nodes of this node bookstore select all child nodes under bookstore

/

If it is at the front, it represents the selection from the root node. Otherwise, select a node under a node / bookstore select all bookstore nodes under the root element

/ /

Select a node from the global node and find all the book nodes from the global node at any location / / book

@

Select the attribute of a node / / book [@ price] Select all book nodes that have the price attribute

.

Current node

Text ()

Get the text in the label

Peer tags can be obtained by li [1], li [2], and li [3].

2.lxml library

A brief introduction to the lxml library, which will be used next

Lxml is a HTML/XML parser, the main function is how to parse and extract HTML/XML data.

Lxml, like regular, is also implemented in C, is a high-performance PythonHTML/XML parser, you can use the previously learned XPath syntax to quickly locate specific elements and node information.

3. Actual case

Climb a random website and find the html text of the website.

As shown below:

To find title and href, you can find the paths / / div [@ id= "resultList"] / div [@ class= "el"] / p/span/a/@title respectively.

/ / div [@ id= "resultList"] / div [@ class= "el"] / p/span/a/@href

The operation is as follows:

The above is all the content of this article "how to use Python crawler XPath". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report