Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python crawler to crawl website pictures

2025-01-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces how to use Python crawler to crawl website pictures, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

The python3 mainly uses requests, and the parsing image URL mainly uses beautiful soup, which can basically complete the function of crawling pictures.

Crawler, of course, most people get started with pictures of beautiful women. Of course, I am also unconventional. First of all, I casually found a website to climb beautiful pictures.

From bs4 import BeautifulSoupimport requests if _ _ name__=='__main__': url=' http://www.27270.com/tag/649.html' headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0 WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36 "} req=requests.get (url=url,headers=headers) req=requests.get (url=url,headers=headers) req.encoding = 'gb2312' html=req.text bf=BeautifulSoup (html,'lxml') targets_url=bf.find (' div',class_='w1200 oh'). Find_all ('a') Target='_blank') for each in targets_url: img_req=requests.get (url=each.get ('href'), headers=headers) img_req.encoding =' gb2312' html = img_req.text bf = BeautifulSoup (html, 'lxml') img_url = bf.find (' div') Class_='articleV4Body') .find ('img') [' src'] name=each.img.get ('alt') +' .jpg 'path=r'C:\ Users\ asus\ Desktop\ New folder' file_name = path +'\\'+ name try: req1=requests.get (img_url,headers=headers) f=open (file_name) 'wb') f.write (req1.content) f.close () except: print ("some error")

One of the situations encountered during this period was that when it visited the img_url, it always reported that it failed to connect to the host by mistake. At first, I thought it was an anti-crawling and so on, but I came up with a web address alone, but it was easy to access it, and I was puzzled. Finally, the boss of the consultation told me to try every one of img_url. Maybe there was something wrong with a certain URL, and I found a traitor. The second URL in the generated img_url cannot be accessed. No wonder I always report wrong. I should try a few more. One leaf is blind.

I am also a fan of Fire Movie. After mastering the basic methods, I am going to try to climb Fire Shadow Pictures. I also found a URL for the website:

Http://desk.zol.com.cn/dongman/huoyingrenzhe/

You can see that the picture of Huo Ying is stored in the form of an atlas, so this download has a little more steps than just now.

Looking at the source code of the website, it is easy to find that the links to the atlas are all in the li tag of class='photo-list-padding', and the links are not complete.

Click on one of the links, http://desk.zol.com.cn/dongman/huoyingrenzhe/ (in front of it is automatically completed by the browser, you need to complete it yourself in the code)

You can see the download address of the picture and the link to the next picture in this atlas.

After learning about the picture construction of the website, we started to write the code. After we screened out the link of the atlas, we found the download address of the first picture and the link of the second picture through the link of the atlas. Through the link of the second picture, we found the download address of the second picture and the link of the third picture. Loop to the end of this atlas, and then start the second atlas until all the atlas were downloaded. The code is as follows, in order to facilitate the cycle. We integrate the function of downloading images as download function, and parse the URL of images as parses_picture:

From bs4 import BeautifulSoupimport requests def download (img_url,headers,n): req = requests.get (img_url,headers = headers) name ='% s'%n+'='+img_url [- 15:] path = Renec:\ Users\ asus\ Desktop\ Huo Shadow wallpaper 1 'file_name = path +'\'+ name f = open (file_name, 'wb') f.write (req.content) f.close def parses_picture (url,headers N): url = r 'http://desk.zol.com.cn/' + url img_req = requests.get (url,headers = headers) img_req.encoding =' gb2312' html = img_req.text bf = BeautifulSoup (html, 'lxml') try: img_url = bf.find (' div', class_='photo'). Find ('img'). Get (' src') download (img_url,headers N) url1 = bf.find ('div',id='photo-next') .a.get (' href') parses_picture (url1,headers,n) except: print (u'% s photo collection ends'% n) if _ name__=='__main__': url=' http://desk.zol.com.cn/dongman/huoyingrenzhe/' headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36 "} req=requests.get (url=url,headers=headers) req=requests.get (url=url,headers=headers) req.encoding = 'gb2312' html=req.text bf=BeautifulSoup (html,'lxml') targets_url=bf.find_all (' li' Class_='photo-list-padding') href' 1 for each in targets_url: url = each.a.get ('href') parses_picture (url,headers,n) n=n+1

During a situation encountered is that every time an atlas in the end, there will be an error, because the link to the next picture can not be found, so I added the try statement to capture this error, let the program continue, with bf is really simpler than regular expressions, you can find the desired information easily through the tag attributes.

Download situation

Thank you for reading this article carefully. I hope the article "how to use Python crawler to crawl website pictures" shared by the editor will be helpful to everyone. At the same time, I also hope you can support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report