Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does Python crawl Himalayan audio data

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how Python crawls Himalayan audio data". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "how Python crawls Himalayan audio data".

Project goal

Crawl Himalayan audio data

Victim address

Https://www.ximalaya.com/

Knowledge points of this article:

1. Systematic analysis of the nature of web pages.

2. Multi-tier data parsing

3. Save massive audio data

Environment:

Python 3.6

Pycharm

Requests

Parsel

Idea: (crawler case)

1. Determine the link address (url) where the data is located

two。 Send a request for an url address through code

3. Parsing data (wanted, filtered unwanted)

4. Data persistence (save)

Case study:

1. Get the id value of audio in static data

two。 Send a specified id value json data request (src)

3. Parse the URL address corresponding to audio from json data

Start writing code.

Import the required modules first

Import requestsimport parsel # data parsing module import re

1. Determine the link address where the data is located (url) reverse analyze the nature of the web page (static / dynamic web page)

Open the developer tool, play an audio, and you can find a package in Madie.

Copy URL, search

Continue to search and find the request header parameters

Url = 'https://www.ximalaya.com/youshengshu/4256765/p{}/'.format(page)headers = {' user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}

two。 Send a request for an url address through code

Response = requests.get (url=url, headers=headers) html_data = response.text

3. Parse data (want, filter unwanted) parse the id value of audio

Selector = parsel.Selector (html_data) lis = selector.xpath ('/ / div [@ class= "sound-list _ is"] / ul/li') for li in lis: try: title = li.xpath ('. / / a Universe title`). Get () + '.m4a' href = li.xpath ('. / / a Universe ref.). Get () # print (title) Href) m4a_id = href.split ('/') [- 1] # print (href, m4a_id) # send specified id value json data request (src) json_url = 'https://www.ximalaya.com/revision/play/v1/audio?id={}&ptype=1'.format(m4a_id) json_data = requests.get (url=json_url) Headers=headers) .json () # print (json_data) # extract audio address m4a_url = json_data ['data'] [' src'] # print (m4a_url) # request audio data m4a_data = requests.get (url=m4a_url, headers=headers). Content new_title = change_title (title)

4. Data persistence (save)

With open ('video\\' + new_title, mode='wb') as f: f.write (m4a_data) print ('save complete:', title)

Finally, we have to deal with illegal characters in the file name.

Def change_title (title): pattern = re.compile (r "[[\ /\\:\ *\?\"\ |] ") #'/\: *?"

< >

| | new_title = re.sub (pattern, "_", title) # replace with an underscore return new_title complete code import reimport requestsimport parsel # data parsing module def change_title (title): "method for handling illegal characters in a file name"pattern = re.compile (r" [\ /\\:\ *\?\ "\ |]") #'/\: *? "

< >

| 'new_title = re.sub (pattern, "_", title) # replace with an underscore return new_titlefor page in range (13,33): print ('-crawling the data on page {}-'.format (page)) # 1. Determine the link address (url) where the data is located. Reverse analyze the nature of the web page (static / dynamic web page) url = 'https://www.ximalaya.com/youshengshu/4256765/p{}/'.format(page) headers = {' user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'} # 2. Send the request for url address through code response = requests.get (url=url, headers=headers) html_data = response.text # print (html_data) # 3. Parsing data (desired Filter unwanted) parse the id value of the audio selector = parsel.Selector (html_data) lis = selector.xpath ('/ / div [@ class= "sound-list _ is"] / ul/li') for li in lis: try: title = li.xpath ('. / / an is title'). Get () + '.m4a' href = li.xpath ('. / / AGUD @ Href'). Get () # print (title Href) m4a_id = href.split ('/') [- 1] # print (href, m4a_id) # send specified id value json data request (src) json_url = 'https://www.ximalaya.com/revision/play/v1/audio?id={}&ptype=1'.format(m4a_id) json_data = requests.get (url=json_url) Headers=headers) .json () # print (json_data) # extract audio address m4a_url = json_data ['data'] [' src'] # print (m4a_url) # request audio data m4a_data = requests.get (url=m4a_url Headers=headers). Content new_title = change_title (title) # print (new_title) # 4 Data persistence (save) with open ('video\\' + new_title, mode='wb') as f: f.write (m4a_data) print ('save complete:', title) except: pass

Run the code, and the effect is as follows

At this point, I believe you have a deeper understanding of "how Python crawls Himalayan audio data". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report