In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
What does the lxml module in Python refer to? I believe many inexperienced people don't know what to do about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.
1. Understand lxml module and xpath syntax
To extract specific content from html or xml text, we need to master the use of lxml modules and xpath syntax.
The lxml module can use xPath rule syntax to quickly locate specific elements in HEML\ XML documents and obtain node information (text content, attribute values).
XPath (XML Path Language) is a language for finding information in HTML\ XML documents and can be used to traverse elements and attributes in HTML | XML documents.
Extracting data from xml and html requires the use of lxml module and xpath syntax.
2. Installation and use of Google browser xpath helper plug-in
Test the xpath syntax rules on the current page in Google browser.
Installation and use of Google browser xpath helper plug-in
We take windows as an example to install xpath helper.
Installation of the xpath helper plug-in:
1) download the Chrome plug-in XPath Helper
It can be downloaded from the Chrome App Mall, or from the link below if it cannot be downloaded.
2) change the suffix name of the file crx to rar, and then extract it to a folder with the same name
3) drag the unzipped folder into the chrome browser extension interface that has been enabled in developer mode
3. The node relationship of xpath
To learn xpath syntax, you need to understand the node relationships in xpath.
3.1Node in xpath what
Every label of html and xml is called a node, and the top node is called the root node. Let's take xml as an example, and html is the same. 、
3.2 Node relationships in xpath
Author is the first sibling node of title.
4. Xpath syntax-the syntax for selecting nodes and extracting attributes or text content
1) XPath uses path expressions to select nodes or node sets in XML documents.
2) these path expressions are very similar to those we see in conventional computer file systems.
3) when you select a tag using the chrome plug-in, the selected tag adds the attribute class= "xh-highlight"
4.1 xpath locates nodes and syntax for extracting attributes or text content
5. Xpath syntax-select the syntax of a specific node
You can get a specific node according to the attribute value of the tag, subscript, and so on.
5.1 Syntax for selecting a specific node
5.2 subscript for xpath
In xpath, the position of the first element is 1
The location of the last element is last ()
The penultimate one is last ()-1.
6. Xpath syntax-syntax for selecting unknown nodes
Elements of unknown html and xml can be selected with wildcards.
6.1. Syntax for selecting unknown nodes
Example of installation and use of 7.lxml module
The lxml module is a third-party module that is used after installation.
Installation of 7. 1 lxml module
The response content in the form of xml or html obtained by sending the request is extracted.
Pip install lxml
7.2 html extracted by crawler
Extract the text content from the label
Extract the value of an attribute in a tag
For example, extract the value of the href attribute in the a tag, get the url, and then continue to initiate the request.
7.3 use of the lxml module
1). Import the etree library of lxml
From lxml import etree
2), use etree.HTML to convert the html string (bytes type or str type) into an Element object, and the Element object has a xpath method that returns the category of the result.
Html = etree.HTML (text) ret_list = html.xpath ("xpath grammar rule string")
3) three cases in which the xpath method returns a list
Return empty list: no elements are located according to the xpath syntax rule string
Returns a list of strings: xpath string rules must match the text content or the value of an attribute
Returns a list of Element objects: the xpath rule string matches the label, and the Element object in the list can continue to xpath
Import requestsfrom lxml import etreeclass Tieba (object): def _ _ init__ (self,name): self.url = "https://tieba.baidu.com/f?ie=utf-8&kw={}".format(name) self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'} def get_data (self,url): response = requests.get (url,headers=self.headers) with open ("temp.html", "wb") as f: f.write (response.content) return response.content def parse_data (self Data): # create element object data = data.decode () .replace ("" "") html = etree.HTML (data) el_list = html.xpath ('/ / li [@ class= "j_thread_list clearfix"] / div/div [2] / div [1] / div [1] / a') # print (len (el_list)) data_list = [] for el in el_list: temp = {} temp ['title'] = el.xpath (". / text ()") [0] temp ['link'] =' http://tieba.baidu.com' + el.xpath (". / @ href") [0] data_list.append (temp) # get the next page url try: next_url = 'https:' + html.xpath (' / / a [contains (text ()) "next page"] / @ href') [0] except: next_url = None return data_list,next_url def save_data (self,data_list): for data in data_list: print (data) def run (self): # url # headers next_url = self.url while True: # send request Get response data = self.get_data (self.url) # extract data (data and url for page flipping) data_list from the response Next_url = self.parse_data (data) self.save_data (data_list) print (next_url) # determine whether to terminate if next_url = = None: breakif _ _ name__ = ='_ main__': tieba = Tieba ("Podcast") tieba.run () 8. Use of etree.tostring function in lxml module
Run the following code to observe and compare the original string of html with the printout
From lxml import etreehtml_str = ""
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.