Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Python how to crawl web page pictures

2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "Python how to crawl web page pictures". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Enter the desktop picture crawling

Today we are going to climb this website.

Https://tu.enterdesk.com/

There are still a lot of resources that this site can climb, but I'll just write an example, and others can be written according to ideas.

First of all, let's analyze the image acquisition process of this website.

I choose the picture library, first randomly choose a tag, I will choose the pet

So why don't you visit? Choose a random visit to see if you can get a picture https://tu.enterdesk.com/chongwu/6.html.

The result must be possible.

The question is, how to check the page number of the last page now? One is to loop indefinitely until there is no picture label, and the other is to find the page number from the source code, then you have to see if there is a page number button. Just now the scroll wheel is faster, and now slow down to see if there are page numbers.

Ctrl+U uses a wave of source code to search directly.

Select the target picture to see if it is the source address and open the original image to see if it is not.

Import requestsall_urls = [] # every page we put together links class Spider (): # constructor Initialize the data using def _ init__ (self, target_url, headers): self.target_url = target_url self.headers = headers # get all the URL def getUrls you want to grab (self): # get the last page response = requests.get (target_url% 1 headersheaders). Text html = BeautifulSoup (response 'html.parser') res = html.find (class_='wrap no_a'). Attrs [' href'] # find the tag of the last page, extract the link of the last page page_num = int (re.findall ('(\ d +)', res) [0]) # regular match page number global all_urls # loop to get the splicing URL for i in range (1 Page_num + 1): url = self.target_url% I all_urls.append (url)

Now click into the picture and find that it is the original picture, and then select the picture link of the picture view tab.

Compare the two links

Import requestsall_urls = [] # every page we put together links class Spider (): # constructor Initialize the data using def _ init__ (self, target_url, headers): self.target_url = target_url self.headers = headers # get all the URL def getUrls you want to grab (self): # get the last page response = requests.get (target_url% 1 headersheaders). Text html = BeautifulSoup (response 'html.parser') res = html.find (class_='wrap no_a'). Attrs [' href'] # find the tag of the last page, extract the link of the last page page_num = int (re.findall ('(\ d +)', res) [0]) # regular match page number global all_urls # loop to get the splicing URL for i in range (1 Page_num + 1): url = self.target_url% I all_urls.append (url)

Thumbnail edpic_360_360 original image edpic_source

This is the overall idea, we can get the link of the thumbnail to reconstruct the url, form the link of the original image, and then download it in batch!

Start playing with the code!

The first is class Spider (): we declare a class, and then we use def _ _ init__ to declare a constructor

Import requestsall_urls = [] # every page we put together links class Spider (): # constructor Initialize the data using def _ init__ (self, target_url, headers): self.target_url = target_url self.headers = headers # get all the URL def getUrls you want to grab (self): # get the last page response = requests.get (target_url% 1 headersheaders). Text html = BeautifulSoup (response 'html.parser') res = html.find (class_='wrap no_a'). Attrs [' href'] # find the tag of the last page, extract the link of the last page page_num = int (re.findall ('(\ d +)', res) [0]) # regular match page number global all_urls # loop to get the splicing URL for i in range (1 Page_num + 1): url = self.target_url% I all_urls.append (url)

Analyze how to extract the link from the last page as shown below:

Here we crawl in a multi-threaded way and introduce the following modules

From bs4 import BeautifulSoup # parsing htmlimport threading # multithreaded import re # regular matching import time # time

A new global variable is added, and it is a multithreaded operation, so we need to introduce thread locks to avoid errors in writing resources at the same time.

All_img_urls = [] # all picture links g_lock = threading.Lock () # initialize a lock

Declare a class of Producer, which is responsible for extracting the picture link and adding it to the global variable all_img_urls

Class Producer (threading.Thread): def run (self): headers = {'User-Agent':' Mozilla/5.0 (X11; Linux x86'64) Rv:52.0) Gecko/20100101 Firefox/52.0'} global all_urls while len (all_urls) > 0: g_lock.acquire () # when accessing all_urls, you need to use the locking mechanism page_url = all_urls.pop (0) # to remove the first element through the pop method And return the value g_lock.release () # release the lock in time after using it Convenient for other threads to use try: print ("analyze" + page_url) response = requests.get (page_url, headers=headers, timeout=3). Text html = BeautifulSoup (response 'html.parser') pic_link = html.find_all (class_='egeli_pic_li') [:-1] global all_img_urls g_lock.acquire () # there is also a lock for i in pic_link: link = i.find (' img') ['src'] .replace (' edpic_360_360' 'edpic_source') all_img_urls.append (link) g_lock.release () # release lock # time.sleep (0.1) except: pass

Thread lock, in the above code, when we operate all_urls.pop (0), we do not want other threads to operate on it at the same time, otherwise there will be an accident, so we use g_lock.acquire () to lock the resource, and then after the use is finished, remember to release g_lock.release () immediately, otherwise the resource will always be occupied and the program cannot proceed.

If _ _ name__ = "_ _ main__": headers = {'User-Agent':' Mozilla/5.0 (X11; Linux x8634; rv:52.0) Gecko/20100101 Firefox/52.0'} target_url = 'https://tu.enterdesk.com/chongwu/%d.html' # Picture Collection and list rules print (' start getting links to all picture pages!') Spider = Spider (target_url, headers) spider.getUrls () print ('finish getting all picture pages and start analyzing picture links!') Threads = [] for x in range (10): gain_link = Producer () gain_link.start () threads.append (gain_link) # join thread synchronization after the main thread task ends, it enters the blocking state and waits for other child threads to finish execution. After the main thread terminates for tt in threads: tt.join ()

Let's define a DownPic class to download the picture.

Class DownPic (threading.Thread): def run (self): headers = {'User-Agent':' Mozilla/5.0 (X11; Linux x86'64) Rv:52.0) Gecko/20100101 Firefox/52.0'} while True: global all_img_urls # Lock g_lock.acquire () if len (all_img_urls) = 0: # if there is no picture, unlock it # no matter what Lock g_lock.release () break else: t = time.time () down_time = str (round (t * 1000)) # millisecond timestamp pic_name ='D:\\ test\'+ down_time + '.jpg' pic = all_img_urls.pop (0) g_lock.release () response = requests.get (pic Headers=headers) with open (pic_name, 'wb') as f: f.write (response.content) f.close () print (pic_name +' download completed!')

You can see that down_time = str (round (t * 1000)) is used to generate a millisecond timestamp to name an image. In fact, you can also get the name of the picture and write one on your own.

Then from if _ _ name__ = "_ _ main__": add the following code to start multithreaded download

Print ('analyze picture link completed, start multithreaded download!') For x in range (20): download = DownPic () download.start ()

The whole process is finished! Code under run

Tips: to run this code, you need to create a test folder on D disk or modify the code yourself to achieve other functions.

Attach the complete code:

Import requestsfrom bs4 import BeautifulSoup # parse htmlimport threading # multithreaded import re # regular match import time # time all_urls = [] # every page link we spliced all_img_urls = [] # all picture links g_lock = threading.Lock () # initialize a lock class Spider (): # constructor Initialize the data using def _ init__ (self, target_url, headers): self.target_url = target_url self.headers = headers # get all the URL def getUrls you want to grab (self): # get the last page response = requests.get (target_url% 1 headersheaders). Text html = BeautifulSoup (response 'html.parser') res = html.find (class_='wrap no_a'). Attrs [' href'] # find the tag of the last page, extract the link of the last page page_num = int (re.findall ('(\ d +)', res) [0]) # regular match page number global all_urls # loop to get the splicing URL for i in range (1 Page_num + 1): url = self.target_url% I all_urls.append (url) # is responsible for extracting the picture link class Producer (threading.Thread): def run (self): headers = {'User-Agent':' Mozilla/5.0 (X11) Linux x86_64 Rv:52.0) Gecko/20100101 Firefox/52.0'} global all_urls while len (all_urls) > 0: g_lock.acquire () # when accessing all_urls, you need to use the locking mechanism page_url = all_urls.pop (0) # to remove the first element through the pop method And return the value g_lock.release () # release the lock in time after use, so that other threads can use try: print ("Analysis" > "Python how to crawl web page picture". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report