Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to write a crawler listening to novels

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to use Python to write a crawler listening to the novel related knowledge, the content is detailed and easy to understand, the operation is simple and fast, has a certain reference value, I believe that after reading this article on how to use Python to write a crawler to listen to the novel will have a harvest, let's take a look at it.

List of book titles and chapters

Click on a book at random, and this page can use BeautifulSoup to get the title and a list of audio of all individual chapters. Copy the address of the browser, such as https://www.tingchina.com/yousheng/disp_31086.htm.

From bs4 import BeautifulSoupimport requestsimport reimport randomimport osheaders = {'user-agent':' Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'} def get_detail_urls (url): url_list = [] response = requests.get (url, headers=headers) response.encoding = 'gbk' soup = BeautifulSoup (response.text) 'lxml') name = soup.select (' .red12') [0] .room.text if not os.path.exists (name): os.makedirs (name) div_list = soup.select ('div.list a') for item in div_list: url_list.append ({'name': item.string,' url': 'https://www.tingchina.com/yousheng/{}'.format(item['href'])}) return name) Url_list audio address

Open the link to a single chapter, use the chapter name as the search term in the Elements panel, and find a script at the bottom, which is the address of the sound source.

As you can see in the Network panel, the url domain name of the sound source is different from the domain name of the chapter list. You need to be aware of this when getting the download link.

Def get_mp3_path (url): response = requests.get (url, headers=headers) response.encoding = 'gbk' soup = BeautifulSoup (response.text,' lxml') script_text = soup.select ('script') [- 1] .string fileUrl_search = re.search (' fileUrl= "(. *?)";', script_text, re.S) if fileUrl_search: return 'https://t3344.tingchina.com' + fileUrl_search.group (1) download

Surprises always come all of a sudden, putting this https://t3344.tingchina.com/xxxx.mp3 into a browser and running it is 404.

The key parameter must be missing, so go back to the above Network to take a closer look at mp3's url and find that there is a key keyword after url. As shown in the figure below, this key is the return value from https://img.tingchina.com/play/h6_jsonp.asp?0.5078556568562795, and the key can be extracted using a regular expression.

Def get_key (url): url = 'https://img.tingchina.com/play/h6_jsonp.asp?{}'.format(str(random.random())) headers [' referer'] = url response = requests.get (url, headers=headers) matched = re.search ('(key=.*?) ";', response.text, re.S) if matched: temp = matched.group (1) return temp [len (temp)-42:]

Finally, concatenate the above code in _ _ main__.

If _ _ name__ = = "_ _ main__": url = input ("Please enter the address of the browser page:") dir Url_list = get_detail_urls () for item in url_list: audio_url = get_mp3_path (item ['url']) key= get_key (item [' url']) audio_url = audio_url +'? key=' + key headers ['referer'] = item [' url'] r = requests.get (audio_url, headers=headers,stream=True) with open (os.path.join (dir) Item ['name']),' ab') as f: f.write (r.content) f.flush () complete code from bs4 import BeautifulSoupimport requestsimport reimport randomimport osheaders = {'user-agent':' Mozilla/5.0 (Windows NT 10.0) Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'} def get_detail_urls (url): url_list = [] response = requests.get (url, headers=headers) response.encoding = 'gbk' soup = BeautifulSoup (response.text) 'lxml') name = soup.select (' .red12') [0] .room.text if not os.path.exists (name): os.makedirs (name) div_list = soup.select ('div.list a') for item in div_list: url_list.append ({'name': item.string,' url': 'https://www.tingchina.com/yousheng/{}'.format(item['href'])}) return name) Url_list def get_mp3_path (url): response = requests.get (url, headers=headers) response.encoding = 'gbk' soup = BeautifulSoup (response.text,' lxml') script_text = soup.select ('script') [- 1] .string fileUrl_search = re.search (' fileUrl= "(. *?)" ', script_text, re.S) if fileUrl_search: return' https://t3344.tingchina.com' + fileUrl_search.group (1) def get_key (url): url = 'https://img.tingchina.com/play/h6_jsonp.asp?{}'.format(str(random.random())) headers [' referer'] = url response = requests.get (url, headers=headers) matched = re.search ('(key=.*?) " ', response.text, re.S) if matched: temp = matched.group (1) return temp [len (temp)-42:] if _ _ name__ = "_ _ main__": url = input ("Please enter the address of the browser page:") dir Url_list = get_detail_urls () for item in url_list: audio_url = get_mp3_path (item ['url']) key= get_key (item [' url']) audio_url = audio_url +'? key=' + key headers ['referer'] = item [' url'] r = requests.get (audio_url, headers=headers,stream=True) with open (os.path.join (dir) Item ['name']),' ab') as f: f.write (r.content) f.flush () on "how to use Python to write a crawler listening to novels" ends here. Thank you for reading! I believe that everyone has a certain understanding of "how to use Python to write a crawler to listen to novels". If you want to learn more, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report