In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article is about Python and then the realization of multi-threaded climbing emoji package, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article. Let's take a look at it.
Highlights of the course
Systematic analysis of target web pages
Html tag data parsing method
Save massive picture data with one click
Environment introduction
Python 3.8
Pycharm
Module use
Requests > > pip install requests
Parsel > > pip install parsel
Time time module records run time
Process one. Analyze where the data content we want can be obtained.
Meme > Picture url address and Picture name
For the use of developer tools >
two。 Code implementation steps
1. Send a request
Determine the url address of the send request
What is the request method? get request method post request method
Request header parameter: hotlink protection cookie …
two。 Get data
Get the data content returned by the server
Response.text acquires text data
Response.json () gets json dictionary data
Response.content acquires binary data to save picture / audio / video / specific format file content is to obtain binary data content
3. Parsing data
Extract the data content we want
i. It can be parsed directly.
Data key-value pairs of II. Json dictionary
III. Re regular expression
IV. Css selector
V. Xpath
4. Save data
Text
Csv
Database
Local folder
Import module import requests # data request module third party module pip install requestsimport parsel # data parsing module third party module pip install parselimport re # regular expression module import time # time module import concurrent.futures single thread crawl 10 pages of data
1. Send a request
Start_time = time.time () for page in range (1,11): url = f 'https://fabiaoqing.com/biaoqing/lists/page/{page}html' headers = {' User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'} response = requests.get (url=url, headers=headers) # response object 200 status code indicates that the request was successful
two。 Get data, get text data / web page source code
# the corresponding tag data is seen in the element panel above the developer tool, but no such data is returned after I send the request. # We need to extract the data according to the data returned by the server # xpath parsing method parsel parsing module parsel this module can call the xpath parsing method # print (response.text)
3. Parsing data
# parsing speed bs4 parsing speed will be slower if you want to directly value the string data content, only the regular expression selector = parsel.Selector (response.text) # convert the acquired html string data content into the selector object title_list = selector.css ('.ui.image.data:: attr (title)'). Getall () img_list = selector.css ('. Ui. Image.lazy::attr (data-original)') .getall () # extract the two list extraction elements one by one # extract list elements for loop through for title Img_url in zip (title_list, img_list):
4. Save data
# split () string segmentation method takes the value # img_name_1 = img_url [- 3:] # slicing through string data according to the list index position # the index position from left to right is 0 and from right to left is-1 # print (title, img_url) title = re.sub (r'[\ /: *? "|\ n]','_' Title) # name too long error img_name = img_url.split ('.') [- 1] # take the value img_content = requests.get (url=img_url) according to the list index position by split () string segmentation. Content # gets the binary data content with open ('img\' + title +'.'+ img_name) of the picture. Mode='wb') as f: f.write (img_content) print (title)
Multithreading crawls 10 pages of data
Def get_response (html_url): "send request" headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64) " X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'} response = requests.get (url=html_url Headers=headers) return responsedef get_img_info (html_url): "" get the url address and name of the picture "" response = get_response (html_url) selector = parsel.Selector (response.text) # convert the acquired html string data content into the selector object title_list = selector.css ('.ui.image.image:: attr (title)') .getall () img_list = selector.css ('.ui.image.imagination:: attr (data-original)') .getall () zip_data = zip (title_list) Img_list) return zip_datadef save (title, img_url): "Save data" title = re.sub (r'[\ /: *? "|\ n]','_' Title) # name too long error img_name = img_url.split ('.') [- 1] # take the value img_content = requests.get (url=img_url) according to the list index position by split () string segmentation. Content # gets the binary data content with open ('img\' + title +'.'+ img_name) of the picture. Mode='wb') as f: f.write (img_content) print (title) Multi-process crawl 10 pages of data def main (html_url): zip_data = get_img_info (html_url) for title, img_url in zip_data: save (title Img_url) if _ name__ = ='_ main__': start_time = time.time () exe = concurrent.futures.ThreadPoolExecutor (max_workers=10) for page in range (1,11): # 1. Send the request url = f 'https://fabiaoqing.com/biaoqing/lists/page/{page}html' exe.submit (main, url) exe.shutdown () end_time = time.time () use_time = int (end_time-start_time) print (' program time:', use_time) above is Python and then implement the multi-thread climb meme The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.