Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use BeautifulSoup4 to modify web page content in Python

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces Python how to use BeautifulSoup4 to modify web content related knowledge, the content is detailed and easy to understand, the operation is simple and fast, has a certain reference value, I believe that after reading this Python how to use BeautifulSoup4 to modify web content article will have a harvest, let's take a look at it.

Recently, there is a small project that needs to crawl the corresponding resource data on the page, save it locally, then save the original HTML source file, modify the content of the HTML page and replace some tags.

For this kind of need to operate on HTML, nothing is more convenient than the library of BeautifulSoup4.

The HTML code for the sample is as follows:

This mainly includes the tag, which is embedded in the tag

Tag, which identifies that the tag can actually play animation. You need to change the entire tag to the player's tag according to class= "videoslide", and replace the tag without class= "videoslide" with the tag.

That is to say, it will have

Replace the label with

Your browser does not support H5 video, please use a Chrome/Firefox/Edge browser.

Set

Replace the label with

< img src="图片地址_compressed.jpg" data-zy-media-id="图片地址.jpg">

Text description (if any)

Here, the tag is found through the select () method of BeautifulSoup4, the tag and its attribute value are obtained by the get () method, and the tag is replaced by replaceWith. The specific code is as follows:

First install the BeautifulSoup4 library, the BeautifulSoup4 library depends on the lxml library, so you also need to install the lxml library.

Pip install bs4pip install lxml

The specific code implementation is as follows:

Import osfrom bs4 import BeautifulSouphtmlstr=''\'

\''\'

\''\'

\''\'

Def procHtml (htmlstr): soup = BeautifulSoup (htmlstr, 'lxml') a_tags=soup.select (' a') for a_tag in a_tags: a_tag_src = a_tag.get ('href') a_tag_filename = os.path.basename (a_tag_src) a_tag_path = os.path.join (' src' A_tag_filename) a_tag ['href'] = a_tag_path next_tag=a_tag.next # determine whether it is a video or a picture If a tag with class= "videoslide" is a video, otherwise it is a picture if a_tag.get ('class') and' videoslide'==a_tag.get ('class') [0]: # processing video file media_id = next_tag.get (' data-zy-media-id') if media_id: media_url = 'http:/ / www.test.com/travel/show_media/' + str (media_id) + '.mp4' media_filename = os.path.basename (media_url) media_path = os.path.join ('src') Media_filename) # replace the div.video tag a tag video_html = 'your browser does not support H5 video Please use the Chrome / Firefox / Edge browser. 'video_soup = BeautifulSoup (video_html 'lxml') a_tag.replaceWith (video_soup.div) else: # get picture information if' img'==next_tag.name: img_src=next_tag.get ('src') # determine whether the path is a local resource data:image and file: if img_ Src.find ('data:image') =-1 and img_src.find (' file:') =-1: img_filename = os.path.basename (img_src) img_path = os.path.join ('src') Img_filename) # will

Tag replaces a tag figcaption='' figure_html='

'+ figcaption+'' figure_soup = BeautifulSoup (figure_html,' lxml') a_tag.replaceWith (figure_soup.figure) html_content = soup.contents [0] return html_contentif _ _ name__ ='_ main__': pro_html_str=procHtml (htmlstr) print (pro_html_str)

Results:

Your browser does not support H5 video, please use Chrome / Firefox / Edge browser.

This is the end of the article on "how Python uses BeautifulSoup4 to modify web content". Thank you for reading! I believe you all have a certain understanding of "how Python uses BeautifulSoup4 to modify web content". If you want to learn more, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report