Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the use of Python multithreaded crawler

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what is the use of Python multithreaded crawler". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Basic development environment

Python 3.6

Pycharm

Wkhtmltopdf

Use of related modules

Re

Requests

Concurrent.futures

Install Python and add it to the environment variable, and pip installs the relevant modules you need.

First, define the needs

Who doesn't send a few emojis when chatting now? When chatting, memes are not only an important tool for us, but also a good helper to bring friends into distance. when chatting is in an awkward situation, we can easily use a meme to turn embarrassment into invisibility.

Second, web page data analysis

As shown in the figure, the image data on Doutu.com is contained in the a tag. You can try to request this page directly to see if the image address is also included in the data returned by response.

Import requestsdef get_response (html_url): headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'} response = requests.get (url=html_url, headers=headers) return responsedef main (html_url): response = get_response (html_url) print (response.text) if _ _ name__ = ='_ main__': url= 'https://www.doutula.com/photo/list/' main (url)

Search for ctrl + F in the output.

There is one point I would like to note here. I use python to request the results returned by the web page, including the image url address is:

Data-original= "Picture url"

Data-backup= "Picture url"

If you want to extract url addresses, you can use parsel parsing libraries, or re regular expressions. Parsel has been used before, so this article will use regular expressions.

Urls = re.findall ('data-original= "(. *?)', response.text) single page crawl complete code import requestsimport redef get_response (html_url): headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'} response = requests.get (url=html_url, headers=headers) return responsedef save (image_url, image_name): image_content = get_response (image_url). Content filename = 'images\' + image_name with open (filename Mode='wb') as f: f.write (image_content) print (image_name) def main (html_url): response = get_response (html_url) urls = re.findall ('data-original= "(. *?)', response.text) for link in urls: image_name = link.split ('/') [- 1] save (link Image_name) if _ _ name__ ='_ _ main__': url = 'https://www.doutula.com/photo/list/' main (url) multithreading crawls images of the whole site (if your memory is large enough)

3631 pages of data, all kinds of expressions, hey

Import requestsimport reimport concurrent.futuresdef get_response (html_url): headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'} response = requests.get (url=html_url, headers=headers) return responsedef save (image_url, image_name): image_content = get_response (image_url). Content filename = 'images\' + image_name with open (filename Mode='wb') as f: f.write (image_content) print (image_name) def main (html_url): response = get_response (html_url) urls = re.findall ('data-original= "(. *?)', response.text) for link in urls: image_name = link.split ('/') [- 1] save (link Image_name) if _ _ name__ = ='_ main__': # ThreadPoolExecutor thread pool object # max_workers maximum number of tasks executor = concurrent.futures.ThreadPoolExecutor (max_workers=3) for page in range (1, 3632): url = f 'https://www.doutula.com/photo/list/?page={page}' # submit add task executor.submit to the thread pool (main Url) executor.shutdown () "what is the use of the Python multithreaded crawler"? that's it. Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report