Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The method of parsing html by lxml and pyquery

2025-10-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "the method of lxml and pyquery parsing html". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "the method of lxml and pyquery parsing html".

Lxml

First of all, let's take a look at lxml. Many commonly used libraries for parsing html use lxml, such as BeautifulSoup and pyquery.

Let's take a look at lxml's three Element about html parsing.

_ Element_Element get from lxml import etreetext =''first second third' # lxml.etree._Elementelement = etree.HTML (text) _ Element Common method # get node cssselect (expr) through css selector # get the first matching find (path) # get all matching findall (path) # through tag or xpath syntax Attribute value get (key) # get all attributes items () # get all attribute names keys () # get all attribute values values () # get child node getchildren () # get parent node getparent () # get adjacent next node getnext () # get adjacent previous node getprevious () # iterative node iter (tag) # get node xpath (path) _ through xpath expression Element sample from lxml import etreetext =''first second third' element = etree.HTML (text) # css selector Get li node with item-0 class lis = element.cssselect ("li.item-0") for li in lis: # get class attribute print (li.get ("class")) # get attribute name and value Tuple list print (li.items ()) # get all attribute names of node print (li.keys ()) # get all attribute values print (li.values ()) print ("-") ass = element.cssselect ("li a") for an in ass: # get text node print (a.text) print (" -") # get the first li node li = element.find (" li ") # get all li nodes lis = element.find (" li ") # get all a nodes lias = element.iter (" a ") for lia in lias: print (lia.get (" href ")) textStr = element.itertext (" a ") for ts in textStr: print (ts)

We will introduce xpath separately later.

_ ElementTree_ElementTree get from io import StringIOfrom lxml import etreetext =''first second third' parser = etree.HTMLParser () # lxml.etree._ElementTreeelementTree = etree.parse (StringIO (text), parser) # you can read directly from the file # elementTree = etree.parse (ringing F:\ tmp\ etree.html',parser) _ ElementTree Common method find (path) findall (path) iter (tag) xpath (path)

The _ ElementTree method is basically the same as the _ Element method of the same name.

There are many differences in that the find and findall methods of _ ElementTree only accept xpath expressions.

_ ElementTree example from io import StringIOfrom lxml import etreetext =''first second third' 'parser = etree.HTMLParser () elementTree = etree.parse (StringIO (text) Parser) lis = elementTree.iter ("li") for li in lis: print (type (li)) print ("-") firstLi = elementTree.find ("/ / li") print (type (firstLi)) print (firstLi.get ("class") print ("-") ass = elementTree.findall ("/ / li/a") for an in ass: print (a.text) HtmlElementHtmlElement get import lxml.htmltext =' '' first second third''# lxml.html.HtmlElementhtmlElement = lxml.html.fromstring (text)

HtmlElement inherits etree.ElementBase and HtmlMixin,etree.ElementBase inherits _ Element.

Because HtmlElement inherits _ Element, HtmlElement can use all the methods described in _ Element. HtmlElement can also use methods in HtmlMixin.

HtmlMixin common method # get node find_class (class_name) through class name # get node get_element_by_id (id) # get text node text_content () # get node cssselect (expr) xpath through css selector

Xpath is very powerful, and _ Element, _ ElementTree, and HtmlElement can all use xpath expressions, so finally I'll introduce xpath.

The expression describes / starts from the root node, the absolute path / / selects the descendant node from the current node, the relative path, does not care about the location. Select the current node.. Select the parent of the current node @ Select attribute * wildcard Select all element nodes and element names @ * Select all attributes [@ attrib] Select all elements with a given attribute [@ attrib='value'] Select all elements with a given attribute with a given value [tag] Select all direct child nodes with specified elements [tag='text'] Select all elements with the specified element and the text content is a text node expression expression description ancestorxpath ('. / ancestor:: *') Select all the ancestors of the current node ancestor-or-self ('. / ancestor-or-self:: *') select all the ancestors of the current node and the node itself attributexpath ('. / attribute:: *') select all attributes of the current node childxpath ('. / child:: *') return all child nodes of the current node descendantxpath ('. / descendant:: *') return all descendants of the current node (child nodes, Followingxpath ('. / following:: *') selects all nodes after the closing tag of the current node in the document following-sibingxpath ('. / following-sibing:: *') selects the sibling node parentxpath ('. / parent:: *') selects the parent node of the current node precedingxpath ('. / preceding:: *') selects all the nodes before the start tag of the current node in the document. ('. / preceding-sibling:: *') Select the sibling node before the current node selfxpath ('. / self:: *') Select the current node

Many times we can get xpath expressions through browsers:

Example from lxml.html.clean import Cleanerfrom lxml import etreetext =''first second third' # remove css, scriptcleaner = Cleaner (style=True, scripts=True, page_structure=False, safe_attrs_only=False) print (cleaner.clean_html (text)) # _ Elementelement = etree.HTML (text) # text node, special characters escape print (element.xpath ('/ / text () # text node Do not escape print (element.xpath ('string ()') # find, findall can only use relative paths, start with. / / print (element.findall ('. / / a [@ rel]')) print (element.find ('. / a [@ rel]')) # get a node print (element.xpath ('/ a [@ rel]')) containing rel attribute) # get the first li node under the ul element. Because ul may have multiple print (element.xpath ("/ / ul/li [1]") # get the li node print under the ul element whose rel attribute is li2 (element.xpath ("/ / ul/li [@ rel='li2']") # get the penultimate node print under the ul element (element.xpath ("/ / ul/li [last ()-1]") # get the first two li nodes print under the ul element Element.xpath ("/ / ul/li [position ())

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report