Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

A case study of Python Ajax crawler method

2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

Today, the editor will share with you the relevant knowledge points of the case study of Python Ajax crawler method. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look at it.

1. Capture the street to take pictures

Street photo website

two。 Analyze the structure of street pictures

Keyword: Street shot PD: atlasdvpf: pcaid: 4916page_num: 1search_json: {"from_search_id": "20220104115420010212192151532E8188", "origin_keyword": "street shot", "image_keyword": "street shot"} rawJSON: 1search_id: 202201041159040101501341671A4749C4

You can find the rule. Page_num accumulates from 1, and other parameters remain the same.

3. According to different functions, write different methods to organize code 3.1 to obtain json format data of web pages def get_page (page_num): global headers headers = {'Host':' so.toutiao.com', # 'Referer':' https://so.toutiao.com/search?keyword=%E8%A1%97%E6%8B%8D&pd=atlas&dvpf=pc&aid=4916&page_num=0&search_json={%22from_search_id%22:%22202112272022060101510440283EE83D67%22, % 22 User-Agent': Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36', 'XmurmurRequestedmurmurWithparts: 'XMLHttpRequest',' Cookie': 'msToken=S0DFBkZ9hmyLOGYd3_QjhhXgrm38qTyOITnkNb0t_oavfbVxuYV1JZ0tT5hLgswSfmZLFD6c2lONm_5TomUQXVXjen7CIxM2AGwbhHRYKjhg; _ Seven DPROs 1.5; _ Stiles IPADOs 0; MONITOR_WEB_ID=7046351002275317255; ttwid=1%7C0YdWalNdIiSpIk3CvvHwV25U8drq3QAj08E8QOApXhs%7C1640607595%7C720e971d353416921df127996ed708931b4ae28a0a8691a5466347697e581ce8 Params = {'keyword':' street', 'pd':' atlas', 'dvpf':' pc', 'aid':' 4916', 'page_num': page_num' 'search_json':'% 7B% 22fromsearching search% 22% 3A% 2220211227202206010154403EE83D67% 22% 2C% 22% 3A% 22% E8% A1% 97% E6% 8B% 8D% 22% 2C% 22imageroomkeyword% 22% 3A% 22% E8% A1% 97% E6% 8B% 8D% 22% 7Dwords, 'rawJSON': 1 'search_id':' 2021122721183101015104402851E3883D'} url = 'https://so.toutiao.com/search?' + urlencode (params) print (url) try: response=requests.get (url,headers=headers) Params=params) if response.status_code = # print (response.json ()) return response.json () except requests.ConnectionError: return None3.2 extracts street photos from json format def get_images (json): images = json.get ('rawData'). Get (' data') for image in images: link = image .get ('img_url') yield link3.3 names the street picture after its md5 code and saves the picture

Implement a method to save the picture, save_image (), where item is a dictionary returned by the previous get_images () method. In this method, first create a folder according to the title of item, then request the picture link, get the binary data of the picture, and write the file in binary form. The name of the picture can use the MD5 value of its content, which removes repetition. Correlation

The code is as follows:

Def save_image (link): data = requests.get (link). Content with open (f'./image/ {md5 (data). Hexdigest ()} .jpg' 'wb') as data # uses the MD5 code of data as the picture name f.write (data) 3.4 main () calls other functions def main (page_num): json = get_page (page_num) for link in get_images (json): # print (link) save_image (link) 4 grabs 20page Toutiao Street photo data

Here, the starting and ending pages of paging are defined, which are GROUP_START and GROUP_END, respectively. The multithreaded thread pool is also used to call its map () method to download the program.

If _ _ name__ = ='_ main__': GROUP_START = 1 GROUP_END = 20 pool = Pool () groups = ([x for x in range (GROUP_START, GROUP_END + 1)]) # print (groups) pool.map (main Groups) pool.close () pool.join () import requestsfrom urllib.parse import urlencodefrom hashlib import md5from multiprocessing.pool import Pooldef get_page (page_num): global headers headers = {'Host':' so.toutiao.com', # 'Referer':' https://so.toutiao.com/search?keyword=%E8%A1%97%E6%8B%8D&pd=atlas&dvpf=pc&aid=4916&page_num=0&search_json={%22from_search_id%22:%22202112272022060101510440283EE83D67%22, % 22 User-Agent': Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36', 'XmurmurRequestedmurmurWithparts: 'XMLHttpRequest',' Cookie': 'msToken=S0DFBkZ9hmyLOGYd3_QjhhXgrm38qTyOITnkNb0t_oavfbVxuYV1JZ0tT5hLgswSfmZLFD6c2lONm_5TomUQXVXjen7CIxM2AGwbhHRYKjhg; _ Seven DPROs 1.5; _ Stiles IPADOs 0; MONITOR_WEB_ID=7046351002275317255; ttwid=1%7C0YdWalNdIiSpIk3CvvHwV25U8drq3QAj08E8QOApXhs%7C1640607595%7C720e971d353416921df127996ed708931b4ae28a0a8691a5466347697e581ce8 Params = {'keyword':' street', 'pd':' atlas', 'dvpf':' pc', 'aid':' 4916', 'page_num': page_num' 'search_json':'% 7B% 22fromsearching search% 22% 3A% 2220211227202206010154403EE83D67% 22% 2C% 22% 3A% 22% E8% A1% 97% E6% 8B% 8D% 22% 2C% 22imageroomkeyword% 22% 3A% 22% E8% A1% 97% E6% 8B% 8D% 22% 7Dwords, 'rawJSON': 1 'search_id':' 2021122721183101015104402851E3883D'} url = 'https://so.toutiao.com/search?' + urlencode (params) print (url) try: response=requests.get (url,headers=headers) Params=params) if response.status_code = # print (response.json ()) return response.json () except requests.ConnectionError: return Nonedef get_images (json): images = json.get ('rawData'). Get (' data') for image in images: link = image.get ('img_url') yield linkdef Save_image (link): data = requests.get (link). Content with open (f'./image/ {md5 (data). Hexdigest ()} .jpg' 'wb') as f.write # uses the MD5 code of data as the picture name f.write (data) def main (page_num): json = get_page (page_num) for link in get_images (json): # print (link) save_image (link) if _ name__ = =' _ main__': GROUP_START = 1 GROUP_END = 20 pool = Pool () groups = ([x for x in range (GROUP_START) GROUP_END + 1)]) # print (groups) pool.map (main, groups) pool.close () pool.join () these are all the contents of the article "case study of Python Ajax crawler method" Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report