In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the relevant knowledge of "Python how to crawl yy station small video". In the actual case operation process, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations! I hope you can read carefully and learn something!
basic development environment
Python 3.6
Pycharm
import osimport requests
Install Python and add it to the environment variables, pip install the relevant modules you need.
I. Determining target needs
Baidu search YY, click classification to select small videos, the Short Video of Miss Selfie inside are the data we need.
As shown in the figure, the url address selected in the box is the playback address of Short Video.
Data request parameters on page 3:
Obviously this is based on page changes in the data parameter.
Build a page turning loop, get the URL of the video and the name of the publisher, and save it locally.
III. Code implementation
1. Request data interface
import requestsurl = 'https://api-tinyvideo-web.yy.com/home/tinyvideosv2'params = { 'callback': 'jQuery112409962628943012035_1613628479734', 'appId': 'svwebpc', 'sign': '', 'data': '{"uid":0,"page":0,"pageSize":10}', '_': '1613628479737',}headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}response = requests.get(url=url, params=params, headers=headers)
The question arises, is the returned data JSON data?
As shown in the above figure, many people must think that this is not a json data?
We know by viewing the response. The data returned to us is an extra paragraph jQuery112409962628943012035_1613628479734()
The json data is contained in it, and there are three ways to extract the data.
1, return response.text, use the regular expression to extract the url address and the name of the publisher
video_url = re.findall('"resurl":"(.*?) "', response.text)user_name = re.findall('"username":"(.*?) "', response.text)
2. Return response.text, extract the data in jQuery112409962628943012035_1613628479734() using regular expressions, and then convert the string to json data through the json module, and then traverse the extracted data.
string = re.findall('jQuery112409962628943012035_1613628479734\((.*?)\) ', response.text)[0]json_data = json.loads(string)result = json_data['data']['data']pprint.pprint(result)
3. Delete the callback in the URL address of the request, and you can directly obtain the json data.
import pprintimport requestsurl = 'https://api-tinyvideo-web.yy.com/home/tinyvideosv2'params = { 'appId': 'svwebpc', 'sign': '', 'data': '{"uid":0,"page":1,"pageSize":10}', '_': '1613628479737',}headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}response = requests.get(url=url, params=params, headers=headers)json_data = response.json()result = json_data['data']['data']pprint.pprint(result)
2. Save data
for index in result: video_url = index['resurl'] user_name = index['username'] video_content = requests.get(url=video_url, headers=headers).content with open('video\\' + user_name + '.mp4', mode='wb') as f: f.write(video_content) print(user_name)
Note: The user name has special characters, and an error will be reported when saving it.
So you need to replace special characters with regular expressions
def change_title(title): pattern = re.compile(r"[\/\\\:\*\?\ "\\|]") # '/ \ : * ? "> Complete implementation code import reimport requestsimport redef change_title(title): pattern = re.compile(r"[\/\\\:\*\?\ "\\|]") # '/ \ : * ? "
< >|' new_title = re.sub(pattern, "_", title) #replaced with underscore return new_titlepage = 0while True: page += 1 url = 'https://api-tinyvideo-web.yy.com/home/tinyvideosv2' params = { 'appId': 'svwebpc', 'sign': '', 'data': '{"uid":0,"page":%s,"pageSize":10}' % str(page), '_': '1613628479737', } headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36' } response = requests.get(url=url, params=params, headers=headers) json_data = response.json() result = json_data['data']['data'] for index in result: video_url = index['resurl'] user_name = index['username'] new_title = change_title(user_name) video_content = requests.get(url=video_url, headers=headers).content with open('video\\' + new_title + '.mp4', mode='wb') as f: f.write(video_content) print(user_name)
"Python how to crawl yy station small video" content is introduced here, thank you for reading. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.