In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "how Python crawls Himalayan audio data". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "how Python crawls Himalayan audio data".
Project goal
Crawl Himalayan audio data
Victim address
Https://www.ximalaya.com/
Knowledge points of this article:
1. Systematic analysis of the nature of web pages.
2. Multi-tier data parsing
3. Save massive audio data
Environment:
Python 3.6
Pycharm
Requests
Parsel
Idea: (crawler case)
1. Determine the link address (url) where the data is located
two。 Send a request for an url address through code
3. Parsing data (wanted, filtered unwanted)
4. Data persistence (save)
Case study:
1. Get the id value of audio in static data
two。 Send a specified id value json data request (src)
3. Parse the URL address corresponding to audio from json data
Start writing code.
Import the required modules first
Import requestsimport parsel # data parsing module import re
1. Determine the link address where the data is located (url) reverse analyze the nature of the web page (static / dynamic web page)
Open the developer tool, play an audio, and you can find a package in Madie.
Copy URL, search
Continue to search and find the request header parameters
Url = 'https://www.ximalaya.com/youshengshu/4256765/p{}/'.format(page)headers = {' user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
two。 Send a request for an url address through code
Response = requests.get (url=url, headers=headers) html_data = response.text
3. Parse data (want, filter unwanted) parse the id value of audio
Selector = parsel.Selector (html_data) lis = selector.xpath ('/ / div [@ class= "sound-list _ is"] / ul/li') for li in lis: try: title = li.xpath ('. / / a Universe title`). Get () + '.m4a' href = li.xpath ('. / / a Universe ref.). Get () # print (title) Href) m4a_id = href.split ('/') [- 1] # print (href, m4a_id) # send specified id value json data request (src) json_url = 'https://www.ximalaya.com/revision/play/v1/audio?id={}&ptype=1'.format(m4a_id) json_data = requests.get (url=json_url) Headers=headers) .json () # print (json_data) # extract audio address m4a_url = json_data ['data'] [' src'] # print (m4a_url) # request audio data m4a_data = requests.get (url=m4a_url, headers=headers). Content new_title = change_title (title)
4. Data persistence (save)
With open ('video\\' + new_title, mode='wb') as f: f.write (m4a_data) print ('save complete:', title)
Finally, we have to deal with illegal characters in the file name.
Def change_title (title): pattern = re.compile (r "[[\ /\\:\ *\?\"\ |] ") #'/\: *?"
< >| | new_title = re.sub (pattern, "_", title) # replace with an underscore return new_title complete code import reimport requestsimport parsel # data parsing module def change_title (title): "method for handling illegal characters in a file name"pattern = re.compile (r" [\ /\\:\ *\?\ "\ |]") #'/\: *? "
< >| 'new_title = re.sub (pattern, "_", title) # replace with an underscore return new_titlefor page in range (13,33): print ('-crawling the data on page {}-'.format (page)) # 1. Determine the link address (url) where the data is located. Reverse analyze the nature of the web page (static / dynamic web page) url = 'https://www.ximalaya.com/youshengshu/4256765/p{}/'.format(page) headers = {' user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'} # 2. Send the request for url address through code response = requests.get (url=url, headers=headers) html_data = response.text # print (html_data) # 3. Parsing data (desired Filter unwanted) parse the id value of the audio selector = parsel.Selector (html_data) lis = selector.xpath ('/ / div [@ class= "sound-list _ is"] / ul/li') for li in lis: try: title = li.xpath ('. / / an is title'). Get () + '.m4a' href = li.xpath ('. / / AGUD @ Href'). Get () # print (title Href) m4a_id = href.split ('/') [- 1] # print (href, m4a_id) # send specified id value json data request (src) json_url = 'https://www.ximalaya.com/revision/play/v1/audio?id={}&ptype=1'.format(m4a_id) json_data = requests.get (url=json_url) Headers=headers) .json () # print (json_data) # extract audio address m4a_url = json_data ['data'] [' src'] # print (m4a_url) # request audio data m4a_data = requests.get (url=m4a_url Headers=headers). Content new_title = change_title (title) # print (new_title) # 4 Data persistence (save) with open ('video\\' + new_title, mode='wb') as f: f.write (m4a_data) print ('save complete:', title) except: pass
Run the code, and the effect is as follows
At this point, I believe you have a deeper understanding of "how Python crawls Himalayan audio data". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.