Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does Python crawl yy site-wide Mini videos

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "Python how to crawl yy station small video". In the actual case operation process, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations! I hope you can read carefully and learn something!

basic development environment

Python 3.6

Pycharm

import osimport requests

Install Python and add it to the environment variables, pip install the relevant modules you need.

I. Determining target needs

Baidu search YY, click classification to select small videos, the Short Video of Miss Selfie inside are the data we need.

As shown in the figure, the url address selected in the box is the playback address of Short Video.

Data request parameters on page 3:

Obviously this is based on page changes in the data parameter.

Build a page turning loop, get the URL of the video and the name of the publisher, and save it locally.

III. Code implementation

1. Request data interface

import requestsurl = 'https://api-tinyvideo-web.yy.com/home/tinyvideosv2'params = { 'callback': 'jQuery112409962628943012035_1613628479734', 'appId': 'svwebpc', 'sign': '', 'data': '{"uid":0,"page":0,"pageSize":10}', '_': '1613628479737',}headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}response = requests.get(url=url, params=params, headers=headers)

The question arises, is the returned data JSON data?

As shown in the above figure, many people must think that this is not a json data?

We know by viewing the response. The data returned to us is an extra paragraph jQuery112409962628943012035_1613628479734()

The json data is contained in it, and there are three ways to extract the data.

1, return response.text, use the regular expression to extract the url address and the name of the publisher

video_url = re.findall('"resurl":"(.*?) "', response.text)user_name = re.findall('"username":"(.*?) "', response.text)

2. Return response.text, extract the data in jQuery112409962628943012035_1613628479734() using regular expressions, and then convert the string to json data through the json module, and then traverse the extracted data.

string = re.findall('jQuery112409962628943012035_1613628479734\((.*?)\) ', response.text)[0]json_data = json.loads(string)result = json_data['data']['data']pprint.pprint(result)

3. Delete the callback in the URL address of the request, and you can directly obtain the json data.

import pprintimport requestsurl = 'https://api-tinyvideo-web.yy.com/home/tinyvideosv2'params = { 'appId': 'svwebpc', 'sign': '', 'data': '{"uid":0,"page":1,"pageSize":10}', '_': '1613628479737',}headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}response = requests.get(url=url, params=params, headers=headers)json_data = response.json()result = json_data['data']['data']pprint.pprint(result)

2. Save data

for index in result: video_url = index['resurl'] user_name = index['username'] video_content = requests.get(url=video_url, headers=headers).content with open('video\\' + user_name + '.mp4', mode='wb') as f: f.write(video_content) print(user_name)

Note: The user name has special characters, and an error will be reported when saving it.

So you need to replace special characters with regular expressions

def change_title(title): pattern = re.compile(r"[\/\\\:\*\?\ "\\|]") # '/ \ : * ? "> Complete implementation code import reimport requestsimport redef change_title(title): pattern = re.compile(r"[\/\\\:\*\?\ "\\|]") # '/ \ : * ? "

< >

|' new_title = re.sub(pattern, "_", title) #replaced with underscore return new_titlepage = 0while True: page += 1 url = 'https://api-tinyvideo-web.yy.com/home/tinyvideosv2' params = { 'appId': 'svwebpc', 'sign': '', 'data': '{"uid":0,"page":%s,"pageSize":10}' % str(page), '_': '1613628479737', } headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36' } response = requests.get(url=url, params=params, headers=headers) json_data = response.json() result = json_data['data']['data'] for index in result: video_url = index['resurl'] user_name = index['username'] new_title = change_title(user_name) video_content = requests.get(url=video_url, headers=headers).content with open('video\\' + new_title + '.mp4', mode='wb') as f: f.write(video_content) print(user_name)

"Python how to crawl yy station small video" content is introduced here, thank you for reading. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report