In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "how to use Python to make a daily news hot spot", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "how to use Python to make a daily news hot spot"!
Basic development environment
Python 3.6
Pycharm
Analysis of import parselimport requestsimport re target web pages
Climb the international news column in the news network today.
You can see the relevant data interface, which contains the url address of news headlines and news details.
How to extract url address
1. Convert to json, and take the value of the key-value pair.
2. Use regular expressions to match url addresses
Both methods can be realized, depending on personal preference.
Turn the page according to the pager change in the interface data link, which corresponds to the page number.
On the details page, you can see that the news content is in the p tag inside the div tag, and you can get the news content according to the normal analysis of the website.
Preservation mode
1. You can save the txt text form
2. It can also be saved in PDF form.
Summary of overall crawling ideas
On the column list page, click more news content to get the interface data url
Match the news details page url in the data content returned in the API data url
Extract news content using regular parsing website operations (re, css, xpath)
Save data
Code implementation
Get the source code of the web page
Def get_html (html_url): "" get web source code response: param html_url: webpage url address: return: webpage source code "response = requests.get (url=html_url, headers=headers) return response
Get the url address of each news article
Def get_page_url (html_data): "" get the url address of each news article: param html_data: response.text: return: the url address of each news article "" page_url_list = re.findall ('"url": "(. *?)", html_data) return page_url_list
File saving names cannot contain special characters, and news headlines need to be dealt with.
Def file_name (name): "" File naming cannot carry special characters: param name: news headlines: return: headlines without special characters "" replace = re.compile (r'[\\ /\:\ *\?\ "\ |]') new_name = re.sub (replace,'_', name) return new_name
Save data
Def download (content, title): "" with open saves news content txt: param content: news content: param title: news headlines: return: "path = 'News\' + title + '.txt' with open (path, mode='a', encoding='utf-8') as f: f.write (content) print ('saving', title)
Principal function
Def main (url): "" main function: param url: news list page url address: return: "" html_data = get_html (url). Text # get the interface data response.text lis = get_page_url (html_data) # get the news url address list for li in lis: page_data = get_html (li) .content.decode ('utf-8' 'ignore') # News details page response.text selector = parsel.Selector (page_data) title = re.findall (' (. *?)', page_data Re.S) [0] # get the headline new_title = file_name (title) new_data = selector.css ('# cont_1_1_2 div.left_zw pjag title text'). Getall () content = '.join (new_data) download (content, new_title) if _ _ name__ = =' _ main__': for page in range (1) Url_1 = 'https://channel.chinanews.com/cns/cjs/gj.shtml?pager={}&pagenum=9&t=5_58'.format(page) main (url_1) running effect diagram
Thank you for your reading, the above is the content of "how to use Python to make a daily news hot spot". After the study of this article, I believe you have a deeper understanding of how to use Python to make a daily news hot spot, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.