In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
Most people do not understand the knowledge points of this article "how to use Xpath grammar in Python crawler", so the editor summarizes the following content, detailed content, clear steps, and has a certain reference value. I hope you can get something after reading this article. Let's take a look at this "how to use Xpath grammar in Python crawler" article.
What is Xpath?
XPath is the XML path language, which is a language used to determine the location of a part of an XML (Standard Universal markup language subset) document. XPath is based on the tree structure of XML and provides the ability to find nodes in the data structure tree. At first, the original intention of XPath is to regard it as a general grammar model between XPointer and XSL. But XPath was quickly adopted by developers as a small query language.
To put it simply: Xpath (XML Path Language) is a language that looks for information in XML and HTML documents and can be used to traverse elements and attributes in XML and HTML documents.
XPath development tools
Here are two widely used and convenient tools for editors:
Chrome plugin XPath Helper (requires scientific access to the Internet)
Firefox plug-in Try XPath.
Of course, the Chrome plug-in XPath Helper can also find the installation package and install it through the plug-in partner. The installation steps are as follows:
Open the plug-in partner and select the downloaded plug-in
If you choose to extract the plug-in content to the desktop, there will be an extra folder on the desktop.
Put the folder under the path you want to put
Open the Google browser, select the extension, open it in developer mode, choose to load the extracted extension, and select the path to open it.
XPath node
In XPath, there are seven types of nodes: elements, attributes, text, namespaces, processing instructions, comments, and document (root) nodes. XML documents are treated as node trees. The root of a tree is called a document node or root node.
XPath syntax
XPath uses path expressions to select nodes or node sets in an XML document. Nodes are selected by following the path (path) or step (steps).
Mode of use:
Use / / to get the elements in the entire page, then write the tag signature, and then extract it in the write predicate, such as:
/ / title [@ lang='en']
Knowledge points to pay attention to:
The difference between / and / /: / means to get only child nodes, / / get descendant nodes, generally / / use more, of course, it depends on the situation.
Contains: sometimes an attribute contains multiple values, then you can use the contains function, as shown in the following example:
/ / title [contains (@ lang,'en')]
The subscript in the predicate starts with 1, not 0.
Lxml library
Lxml is a HTML/XML parser, the main function is how to parse and extract HTML/XML data.
The library, which is a third-party library, needs to be installed using pip, with the following command:
Pip install lxml
Basic use:
The following is the html file code of the case practice, and the friends in front of the screen can save it and practice together.
First
Second
Third
Fourth
Fifth
Case 1: parsing a string into an html document
From lxml import etree
Text =''
Html = etree.HTML (text) # read
Print (html)
# serialize html by string
Result = etree.tostring (html) .decode ('utf-8')
Print (result)
Case 2: read the html code from a file:
From lxml import etree
Html = etree.parse ('hello.html') # read
# serialize html by string
Result = etree.tostring (html) .decode ('utf-8')
Print (result)
Case 3: using Xpath syntax in lxml
From lxml import etree
Html = etree.parse ('hello.html')
# get all li tags:
# result = html.xpath ('/ / li')
# print (result)
# for i in result:
# print (etree.tostring (I))
# get the values of all class attributes under all li elements:
# result = html.xpath ('/ / li/@class')
# print (result)
# get the a tag whose href is https://www.yisu.com/ under the li tag:
# result = html.xpath ('/ / li/a [@ href= "https://www.yisu.com/"]'))
# print (result)
# get all span tags under the li tag:
# result = html.xpath ('/ / li//span')
# print (result)
# get all class in the a tag under the li tag:
# result = html.xpath ('/ / li/a//@class')
# print (result)
# get the value corresponding to the a href attribute of the last li:
# result = html.xpath ('/ / li [last ()] / a _ href')
# print (result)
# get the content of the penultimate li element:
# result = html.xpath ('/ / li [last ()-1] / a')
# print (result)
# print (result [0] .text)
# the second way to get the content of the penultimate li element:
Result = html.xpath ('/ / li [last ()-1] / a/text ()')
Print (result)
The above is about the content of this article on "how to use Xpath grammar in Python crawler". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more about the relevant knowledge, please pay attention to the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.