Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to capture the picture data of Jinri Toutiao Street

2025-03-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces how to use Python to capture today's Toutiao Street picture data, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

(1) capture Jinri Toutiao Street and take pictures.

(2) analyze the picture structure of Jinri Toutiao Street.

Keyword: Street shot PD: atlasdvpf: pcaid: 4916page_num: 1search_json: {"from_search_id": "20220104115420010212192151532E8188", "origin_keyword": "street beat", "image_keyword": "street beat"} rawJSON: 1search_id: 202201041159040101501341671A4749C4 can find the rule, page_num starts to accumulate from 1, and other parameters remain the same. (3) organize the code according to different functions.

Get web page json format data

Def get_page (page_num): global headers headers = {'Host':' so.toutiao.com', # 'Referer':' https://so.toutiao.com/search?keyword=%E8%A1%97%E6%8B%8D&pd=atlas&dvpf=pc&aid=4916&page_num=0&search_json={%22from_search_id%22:%22202112272022060101510440283EE83D67%22,%22origin_keyword%22:%22%E8%A1%97%E6%8B%8D%22, % 22image_keyword%22:%22%E8%A1%97%E6%8B%8D%22}', 'User-Agent':' Mozilla/5.0 (Windows NT 10.0 WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36', 'XmurmurRequestedmurmurWithparts: 'XMLHttpRequest',' Cookie': 'msToken=S0DFBkZ9hmyLOGYd3_QjhhXgrm38qTyOITnkNb0t_oavfbVxuYV1JZ0tT5hLgswSfmZLFD6c2lONm_5TomUQXVXjen7CIxM2AGwbhHRYKjhg; _ Seven DPROs 1.5; _ Stiles IPADOs 0; MONITOR_WEB_ID=7046351002275317255; ttwid=1%7C0YdWalNdIiSpIk3CvvHwV25U8drq3QAj08E8QOApXhs%7C1640607595%7C720e971d353416921df127996ed708931b4ae28a0a8691a5466347697e581ce8 Params = {'keyword':' street', 'pd':' atlas', 'dvpf':' pc', 'aid':' 4916', 'page_num': page_num' 'search_json':'% 7B% 22fromsearching search% 22% 3A% 2220211227202206010154403EE83D67% 22% 2C% 22% 3A% 22% E8% A1% 97% E6% 8B% 8D% 22% 2C% 22imageroomkeyword% 22% 3A% 22% E8% A1% 97% E6% 8B% 8D% 22% 7Dwords, 'rawJSON': 1 'search_id': '2021122721183101015104402851E3883D'} url =' https://so.toutiao.com/search?' + urlencode (params) print (url) try: response=requests.get (url,headers=headers,params=params) if response.status_code = = 200: # if response.content: # print (response.json ()) return response.json () except requests.ConnectionError: return None

Extracting street pictures from json format data

Def get_images (json): images = json.get ('rawData'). Get (' data') for image in images: link = image.get ('img_url') yield link

Name the street picture after its md5 code and save the picture.

Implement a method to save the picture, save_image (), where item is a dictionary returned by the previous get_images () method. In this method, first create a folder according to the title of item, then request the picture link, get the binary data of the picture, and write the file in binary form. The name of the picture can use the MD5 value of its content, which removes repetition. The related code is as follows:

Def save_image (link): data = requests.get (link). Content with open (f'./image/ {md5 (data). Hexdigest ()} .jpg', 'wb') as FRV # uses data's MD5 code as the picture name f.write (data)

Main () calls other functions

Def main (page_num): json = get_page (page_num) for link in get_images (json): # print (link) save_image (link) (4) capture 20page Toutiao street photo data

Here, the starting and ending pages of paging are defined, which are GROUP_START and GROUP_END, respectively. The multithreaded thread pool is also used, and the map () method is called to realize multithreaded download.

If _ _ name__ = ='_ main__': GROUP_START = 1 GROUP_END = 20 pool = Pool () groups = ([x for x in range (GROUP_START, GROUP_END + 1)]) # print (groups) pool.map (main Groups) pool.close () pool.join () import requestsfrom urllib.parse import urlencodefrom hashlib import md5from multiprocessing.pool import Pooldef get_page (page_num): global headers headers = {'Host':' so.toutiao.com', # 'Referer':' https://so.toutiao.com/search?keyword=%E8%A1%97%E6%8B%8D&pd=atlas&dvpf=pc&aid=4916&page_num=0&search_json={%22from_search_id%22:%22202112272022060101510440283EE83D67%22, % 22 User-Agent': Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36', 'XmurmurRequestedmurmurWithparts: 'XMLHttpRequest',' Cookie': 'msToken=S0DFBkZ9hmyLOGYd3_QjhhXgrm38qTyOITnkNb0t_oavfbVxuYV1JZ0tT5hLgswSfmZLFD6c2lONm_5TomUQXVXjen7CIxM2AGwbhHRYKjhg; _ Seven DPROs 1.5; _ Stiles IPADOs 0; MONITOR_WEB_ID=7046351002275317255; ttwid=1%7C0YdWalNdIiSpIk3CvvHwV25U8drq3QAj08E8QOApXhs%7C1640607595%7C720e971d353416921df127996ed708931b4ae28a0a8691a5466347697e581ce8 Params = {'keyword':' street', 'pd':' atlas', 'dvpf':' pc', 'aid':' 4916', 'page_num': page_num' 'search_json':'% 7B% 22fromsearching search% 22% 3A% 2220211227202206010154403EE83D67% 22% 2C% 22% 3A% 22% E8% A1% 97% E6% 8B% 8D% 22% 2C% 22imageroomkeyword% 22% 3A% 22% E8% A1% 97% E6% 8B% 8D% 22% 7Dwords, 'rawJSON': 1 'search_id':' 2021122721183101015104402851E3883D'} url = 'https://so.toutiao.com/search?' + urlencode (params) print (url) try: response=requests.get (url,headers=headers) Params=params) if response.status_code = # print (response.json ()) return response.json () except requests.ConnectionError: return Nonedef get_images (json): images = json.get ('rawData'). Get (' data') for image in images: link = image.get ('img_url') yield linkdef Save_image (link): data = requests.get (link). Content with open (f'./image/ {md5 (data). Hexdigest ()} .jpg' 'wb') as f.write # uses the MD5 code of data as the picture name f.write (data) def main (page_num): json = get_page (page_num) for link in get_images (json): # print (link) save_image (link) if _ name__ = =' _ main__': GROUP_START = 1 GROUP_END = 20 pool = Pool () groups = ([x for x in range (GROUP_START) GROUP_END + 1)]) # print (groups) pool.map (main, groups) pool.close () pool.join ()

Thank you for reading this article carefully. I hope the article "how to use Python to capture the picture data of Today's Toutiao Street" shared by the editor will be helpful to you. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report