Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What does the lxml module in Python refer to

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

What does the lxml module in Python refer to? I believe many inexperienced people don't know what to do about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

1. Understand lxml module and xpath syntax

To extract specific content from html or xml text, we need to master the use of lxml modules and xpath syntax.

The lxml module can use xPath rule syntax to quickly locate specific elements in HEML\ XML documents and obtain node information (text content, attribute values).

XPath (XML Path Language) is a language for finding information in HTML\ XML documents and can be used to traverse elements and attributes in HTML | XML documents.

Extracting data from xml and html requires the use of lxml module and xpath syntax.

2. Installation and use of Google browser xpath helper plug-in

Test the xpath syntax rules on the current page in Google browser.

Installation and use of Google browser xpath helper plug-in

We take windows as an example to install xpath helper.

Installation of the xpath helper plug-in:

1) download the Chrome plug-in XPath Helper

It can be downloaded from the Chrome App Mall, or from the link below if it cannot be downloaded.

2) change the suffix name of the file crx to rar, and then extract it to a folder with the same name

3) drag the unzipped folder into the chrome browser extension interface that has been enabled in developer mode

3. The node relationship of xpath

To learn xpath syntax, you need to understand the node relationships in xpath.

3.1Node in xpath what

Every label of html and xml is called a node, and the top node is called the root node. Let's take xml as an example, and html is the same. 、

3.2 Node relationships in xpath

Author is the first sibling node of title.

4. Xpath syntax-the syntax for selecting nodes and extracting attributes or text content

1) XPath uses path expressions to select nodes or node sets in XML documents.

2) these path expressions are very similar to those we see in conventional computer file systems.

3) when you select a tag using the chrome plug-in, the selected tag adds the attribute class= "xh-highlight"

4.1 xpath locates nodes and syntax for extracting attributes or text content

5. Xpath syntax-select the syntax of a specific node

You can get a specific node according to the attribute value of the tag, subscript, and so on.

5.1 Syntax for selecting a specific node

5.2 subscript for xpath

In xpath, the position of the first element is 1

The location of the last element is last ()

The penultimate one is last ()-1.

6. Xpath syntax-syntax for selecting unknown nodes

Elements of unknown html and xml can be selected with wildcards.

6.1. Syntax for selecting unknown nodes

Example of installation and use of 7.lxml module

The lxml module is a third-party module that is used after installation.

Installation of 7. 1 lxml module

The response content in the form of xml or html obtained by sending the request is extracted.

Pip install lxml

7.2 html extracted by crawler

Extract the text content from the label

Extract the value of an attribute in a tag

For example, extract the value of the href attribute in the a tag, get the url, and then continue to initiate the request.

7.3 use of the lxml module

1). Import the etree library of lxml

From lxml import etree

2), use etree.HTML to convert the html string (bytes type or str type) into an Element object, and the Element object has a xpath method that returns the category of the result.

Html = etree.HTML (text) ret_list = html.xpath ("xpath grammar rule string")

3) three cases in which the xpath method returns a list

Return empty list: no elements are located according to the xpath syntax rule string

Returns a list of strings: xpath string rules must match the text content or the value of an attribute

Returns a list of Element objects: the xpath rule string matches the label, and the Element object in the list can continue to xpath

Import requestsfrom lxml import etreeclass Tieba (object): def _ _ init__ (self,name): self.url = "https://tieba.baidu.com/f?ie=utf-8&kw={}".format(name) self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'} def get_data (self,url): response = requests.get (url,headers=self.headers) with open ("temp.html", "wb") as f: f.write (response.content) return response.content def parse_data (self Data): # create element object data = data.decode () .replace ("" "") html = etree.HTML (data) el_list = html.xpath ('/ / li [@ class= "j_thread_list clearfix"] / div/div [2] / div [1] / div [1] / a') # print (len (el_list)) data_list = [] for el in el_list: temp = {} temp ['title'] = el.xpath (". / text ()") [0] temp ['link'] =' http://tieba.baidu.com' + el.xpath (". / @ href") [0] data_list.append (temp) # get the next page url try: next_url = 'https:' + html.xpath (' / / a [contains (text ()) "next page"] / @ href') [0] except: next_url = None return data_list,next_url def save_data (self,data_list): for data in data_list: print (data) def run (self): # url # headers next_url = self.url while True: # send request Get response data = self.get_data (self.url) # extract data (data and url for page flipping) data_list from the response Next_url = self.parse_data (data) self.save_data (data_list) print (next_url) # determine whether to terminate if next_url = = None: breakif _ _ name__ = ='_ main__': tieba = Tieba ("Podcast") tieba.run () 8. Use of etree.tostring function in lxml module

Run the following code to observe and compare the original string of html with the printout

From lxml import etreehtml_str = ""

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report