A case study of xpath data parsing in Python 07/19 Update SLTechnology News&Howtos

A case study of xpath data parsing in Python

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Most people do not understand the knowledge points of this "Python's xpath data parsing case study" article, so the editor summarizes the following contents, detailed contents, clear steps, and has a certain reference value. I hope you can get something after reading this article. Let's take a look at this "Python xpath data parsing case Analysis" article.

Basic concepts of xpath

Xpath parsing: one of the most commonly used and convenient and efficient parsing methods. It has strong versatility.

Xpath analytic principle

1. Instantiate an object of etree and need to load the parsed page source data into the object

two。 Call the xpath method in the etree object combined with xpath expression to realize the location of the tag and the capture of the content.

Environment installation pip install lxml how to instantiate an etree object from lxml import etree

1. Load the remote data from the local html file into the etree object:

Etree.parse (filePath)

two。 You can load source data obtained from the Internet into this object:

Etree.HTML ('page_text') xpath (' xpath expression')

1.Compact: indicates positioning starting from the root node. Represents a level

2.Compact: indicates multiple levels. It can mean to start positioning from any location.

3. Attribute location: / / div [@ class='song'] tag [@ attrName='attrValue']

4. Index positioning: / / div [@ class='song'] / p [3] Index starts at 1

5. Take the text:

/ text () gets the direct text content in the tag

Non-direct text content in the / / text () tag (all text content)

6. Take attribute: / @ attrName = = > img/src

An example of xpath climbing 58 second-hand house

Complete code

From lxml import etreeimport requestsif _ _ name__ = ='_ _ main__': headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'} url= 'https://xa.58.com/ershoufang/' page_text = requests.get (url=url,headers=headers). Text tree = etree.HTML (page_text) div_list = tree.xpath (' / / section [@ class= "list"] / div') fp = open ('. / 58.com second-hand house .txt','w' Encoding='utf-8') for div in div_list: title = div.xpath ('. / / div [@ class= "property-content-title"] / h4/text ()') [0] print (title) fp.write (title+'\ nsubscription'\ n')

An example of xpath image parsing and downloading

Complete code

Import requests,osfrom lxml import etreeif _ _ name__ = ='_ _ main__': headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'} url= 'https://pic.netbian.com/4kmeinv/' page_text = requests.get (url=url) Headers=headers) .text tree = etree.HTML (page_text) li_list = tree.xpath ('/ / div [@ class= "slist"] / ul/li/a') if not os.path.exists ('. / piclibs'): os.mkdir ('. / piclibs') for li in li_list: detail_url = 'https://pic.netbian.com' + li.xpath ('. / img/@src') [0] Detail_name = li.xpath ('. / img/@alt') [0] + '.jpg' detail_name = detail_name.encode ('iso-8859-1'). Decode ('GBK') detail_path ='. / piclibs/' + detail_name detail_data = requests.get (url=detail_url Headers=headers) .content with open (detail_path,'wb') as fp: fp.write (detail_data) print

An example of xpath crawling the name of a national city

Complete code

Import requestsfrom lxml import etreeif _ _ name__ = ='_ _ main__': url = 'https://www.aqistudy.cn/historydata/' headers = {' User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',} page_text = requests.get (url=url Headers=headers) .content.decode ('utf-8') tree = etree.HTML (page_text) # Hot cities / / div [@ class= "bottom"] / ul/li # all cities / / div [@ class= "bottom"] / ul/div [2] / li a_list = tree.xpath (' / / div [@ class= "bottom"] / ul/li | / div [@ class= "bottom"] / ul/div [2] / li ') fp = open ('. / citys.txt' 'wicked remanent encodingconversation utfmur8') I = 0 for an in a_list: city_name = a.xpath ('. / / a/text ()') [0] fp.write (city_name+'\ t') i=i+1 if I = 6: I = 0 fp.write ('\ n') print ('crawled successfully')

Example of xpath crawling resume template

Complete code

Import requests,osfrom lxml import etreeif _ _ name__ = ='_ _ main__': url = 'https://sc.chinaz.com/jianli/free.html' headers = {' User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',} page_text = requests.get (url=url Headers=headers) .content.decode ('utf-8') tree = etree.HTML (page_text) a_list = tree.xpath (' / div [@ class= "box col3 ws_block"] / a') if not os.path.exists ('. / resume template'): os.mkdir ('. / resume template') for ai n a_list: detail_url = 'https:'+a.xpath ('. / @ Href') [0] detail_page_text = requests.get (url=detail_url Headers=headers) .content.decode ('utf-8') detail_tree = etree.HTML (detail_page_text) detail_a_list = detail_tree.xpath (' / / div [@ class= "clearfix mt20 downlist"] / ul/li [1] / a') for an in detail_a_list: download_name = detail_tree.xpath ('/ / div [@ class= "ppt_tit clearfix"] / h2/text () ') [0] download_url = a.xpath ('. / @ href') [0] download_data = requests.get (url=download_url Headers=headers) .content download_path ='. / resume template /'+ download_name+'.rar' with open (download_path,'wb') as fp: fp.write (download_data) print

The above is about the content of this article "xpath data parsing case Analysis of Python". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more related knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.