In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
Today, the editor will share with you the relevant knowledge points of the case study of Python Ajax crawler method. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look at it.
1. Capture the street to take pictures
Street photo website
two。 Analyze the structure of street pictures
Keyword: Street shot PD: atlasdvpf: pcaid: 4916page_num: 1search_json: {"from_search_id": "20220104115420010212192151532E8188", "origin_keyword": "street shot", "image_keyword": "street shot"} rawJSON: 1search_id: 202201041159040101501341671A4749C4
You can find the rule. Page_num accumulates from 1, and other parameters remain the same.
3. According to different functions, write different methods to organize code 3.1 to obtain json format data of web pages def get_page (page_num): global headers headers = {'Host':' so.toutiao.com', # 'Referer':' https://so.toutiao.com/search?keyword=%E8%A1%97%E6%8B%8D&pd=atlas&dvpf=pc&aid=4916&page_num=0&search_json={%22from_search_id%22:%22202112272022060101510440283EE83D67%22, % 22 User-Agent': Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36', 'XmurmurRequestedmurmurWithparts: 'XMLHttpRequest',' Cookie': 'msToken=S0DFBkZ9hmyLOGYd3_QjhhXgrm38qTyOITnkNb0t_oavfbVxuYV1JZ0tT5hLgswSfmZLFD6c2lONm_5TomUQXVXjen7CIxM2AGwbhHRYKjhg; _ Seven DPROs 1.5; _ Stiles IPADOs 0; MONITOR_WEB_ID=7046351002275317255; ttwid=1%7C0YdWalNdIiSpIk3CvvHwV25U8drq3QAj08E8QOApXhs%7C1640607595%7C720e971d353416921df127996ed708931b4ae28a0a8691a5466347697e581ce8 Params = {'keyword':' street', 'pd':' atlas', 'dvpf':' pc', 'aid':' 4916', 'page_num': page_num' 'search_json':'% 7B% 22fromsearching search% 22% 3A% 2220211227202206010154403EE83D67% 22% 2C% 22% 3A% 22% E8% A1% 97% E6% 8B% 8D% 22% 2C% 22imageroomkeyword% 22% 3A% 22% E8% A1% 97% E6% 8B% 8D% 22% 7Dwords, 'rawJSON': 1 'search_id':' 2021122721183101015104402851E3883D'} url = 'https://so.toutiao.com/search?' + urlencode (params) print (url) try: response=requests.get (url,headers=headers) Params=params) if response.status_code = # print (response.json ()) return response.json () except requests.ConnectionError: return None3.2 extracts street photos from json format def get_images (json): images = json.get ('rawData'). Get (' data') for image in images: link = image .get ('img_url') yield link3.3 names the street picture after its md5 code and saves the picture
Implement a method to save the picture, save_image (), where item is a dictionary returned by the previous get_images () method. In this method, first create a folder according to the title of item, then request the picture link, get the binary data of the picture, and write the file in binary form. The name of the picture can use the MD5 value of its content, which removes repetition. Correlation
The code is as follows:
Def save_image (link): data = requests.get (link). Content with open (f'./image/ {md5 (data). Hexdigest ()} .jpg' 'wb') as data # uses the MD5 code of data as the picture name f.write (data) 3.4 main () calls other functions def main (page_num): json = get_page (page_num) for link in get_images (json): # print (link) save_image (link) 4 grabs 20page Toutiao Street photo data
Here, the starting and ending pages of paging are defined, which are GROUP_START and GROUP_END, respectively. The multithreaded thread pool is also used to call its map () method to download the program.
If _ _ name__ = ='_ main__': GROUP_START = 1 GROUP_END = 20 pool = Pool () groups = ([x for x in range (GROUP_START, GROUP_END + 1)]) # print (groups) pool.map (main Groups) pool.close () pool.join () import requestsfrom urllib.parse import urlencodefrom hashlib import md5from multiprocessing.pool import Pooldef get_page (page_num): global headers headers = {'Host':' so.toutiao.com', # 'Referer':' https://so.toutiao.com/search?keyword=%E8%A1%97%E6%8B%8D&pd=atlas&dvpf=pc&aid=4916&page_num=0&search_json={%22from_search_id%22:%22202112272022060101510440283EE83D67%22, % 22 User-Agent': Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36', 'XmurmurRequestedmurmurWithparts: 'XMLHttpRequest',' Cookie': 'msToken=S0DFBkZ9hmyLOGYd3_QjhhXgrm38qTyOITnkNb0t_oavfbVxuYV1JZ0tT5hLgswSfmZLFD6c2lONm_5TomUQXVXjen7CIxM2AGwbhHRYKjhg; _ Seven DPROs 1.5; _ Stiles IPADOs 0; MONITOR_WEB_ID=7046351002275317255; ttwid=1%7C0YdWalNdIiSpIk3CvvHwV25U8drq3QAj08E8QOApXhs%7C1640607595%7C720e971d353416921df127996ed708931b4ae28a0a8691a5466347697e581ce8 Params = {'keyword':' street', 'pd':' atlas', 'dvpf':' pc', 'aid':' 4916', 'page_num': page_num' 'search_json':'% 7B% 22fromsearching search% 22% 3A% 2220211227202206010154403EE83D67% 22% 2C% 22% 3A% 22% E8% A1% 97% E6% 8B% 8D% 22% 2C% 22imageroomkeyword% 22% 3A% 22% E8% A1% 97% E6% 8B% 8D% 22% 7Dwords, 'rawJSON': 1 'search_id':' 2021122721183101015104402851E3883D'} url = 'https://so.toutiao.com/search?' + urlencode (params) print (url) try: response=requests.get (url,headers=headers) Params=params) if response.status_code = # print (response.json ()) return response.json () except requests.ConnectionError: return Nonedef get_images (json): images = json.get ('rawData'). Get (' data') for image in images: link = image.get ('img_url') yield linkdef Save_image (link): data = requests.get (link). Content with open (f'./image/ {md5 (data). Hexdigest ()} .jpg' 'wb') as f.write # uses the MD5 code of data as the picture name f.write (data) def main (page_num): json = get_page (page_num) for link in get_images (json): # print (link) save_image (link) if _ name__ = =' _ main__': GROUP_START = 1 GROUP_END = 20 pool = Pool () groups = ([x for x in range (GROUP_START) GROUP_END + 1)]) # print (groups) pool.map (main, groups) pool.close () pool.join () these are all the contents of the article "case study of Python Ajax crawler method" Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.