Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Combined with Python web crawler to do a method tutorial of Today's News Mini Program

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "how to do a method course of Today's News Mini Program with Python Web Crawler". In daily operation, I believe many people have doubts about how to do a method course of Today's News Mini Program with Python Web Crawler. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the question of "combining Python web crawler to do a method tutorial of today's news Mini Program"! Next, please follow the editor to study!

Core code

Requests.get downloads html web page

Bs4.BeautifulSoup analyzes html content

From requests import getfrom bs4 import BeautifulSoup as bsfrom datetime import datetime as dt def Today (style=1): date = dt.today () if stylekeeper 1: return f' {date.month} month {date.day} 'return f' {date.year}-{date.month:02}-{date.day:02}' def SinaNews (style=1): url1 = 'http://news.***.com.cn/' if style==1: url1 + =' world' elif style==2: Url1 + = 'china' else: url1=' https://mil.news.sina.com.cn/' text = get (url1) text.encoding='uft-8' soup = bs (text.text) 'html.parser') aTags = soup.find_all ("a") return [(t.textrecoery t [' href']) for t in aTags if Today () in str (t)] crawls the title

> for iJet news in enumerate (SinaNews (1)):

Print (f'No {iTun1}:', news [0])

No1: foreign media: *

No2: Japanese media: *

.

.

The content is mosaic!

> > >

To be a crawler for the first time, in order to find a news website that does not need to crack the web page, you can download the web page and get the content directly. Three web pages of international, domestic and military news are used as content sources. After requests.get downloads the web pages, the html text is analyzed, and all the tags are marked with exactly what the date needs.

Crawl the text

Then download the body web page according to url, and the analysis shows that the layer of id='article' is the location of the text. Get_text () is the key function to get the text, and then do some formatting appropriately:

> def NewsDownload (url): html = get (url) html.encoding='uft-8' soup = bs (html.text,'html.parser') text = soup.find ('div',id='article'). Get_text (). Strip () text = text.replace (' Click to enter the topic:', 'related topics:') text = text.replace ('' '\ n') while'\ n\ n\ n' in text: text = text.replace ('\ n\ n\ n' '\ n\ n') return text > > url =' https://******/w/2021-09-29/doc-iktzqtyt8811588.shtml'>>> NewsDownload (url) 'original title: *' > > interface code

Use the built-in graphics interface library tkinter controls Text, Listbox, Scrollbar, Button. Set basic properties, place, bind commands, and debug until the program is complete!

Source code News.pyw: the name of the website involved has been mosaic!

From requests import getfrom bs4 import BeautifulSoup as bsfrom datetime import datetime as dtfrom os import pathimport tkinter as tk def Today (style=1): date = dt.today () if stylekeeper 1: return f' {date.month} month {date.day} 'return f' {date.year}-{date.month:02}-{date.day:02}' def SinaNews (style=1): url1 = 'http://news.****.com.cn/' if style==1: url1 + =' World' elif style==2: url1 + = 'china' else: url1=' https://mil.****.com.cn/' text = get (url1) text.encoding='uft-8' soup = bs (text.text 'html.parser') aTags = soup.find_all ("a") return [(t.textMagnet [' href']) for t in aTags if Today () in str (t)] def NewsList (I): global news news = SinaNews (I) tList.delete (0dtk.END) for idx,item in enumerate (news): tList.insert (tk.END) F'{idx+1:03} {item [0]}') tText.config (state=tk.NORMAL) tText.delete (0.0def NewsList3 tk.END) tText.config (state=tk.DISABLED) NewsShow (0) def NewsList1 (): NewsList (1) def NewsList2 (): NewsList (2) def NewsList3 (): NewsList (3) def NewsShow (idx): if idxkeeper 0: idx = tList.curselection () [0] title,url = news [idx] [0] News [idx] [1] html = get (url) html.encoding='uft-8' soup = bs (html.text,'html.parser') text = soup.find ('div',id='article'). Get_text (). Strip () text = text.replace (' Click to enter the topic:', 'related topics:') text = text.replace ('' '\ n') while'\ n\ n\ n' in text: text = text.replace ('\ n\ n\ n') tText.config (state=tk.NORMAL) tText.delete (0.0state=tk.DISABLED tk.END) tText.insert (tk.END, title+'\ n\ n'+text) tText.config (state=tk.DISABLED) def InitWindow (self,W) H): y = self.winfo_screenheight () winPosition = str (W) + 'x'+str (H) +' + 8+'+str (Y-H-100) self.geometry (winPosition) icoFile = 'favicon.ico' f = path.exists (icoFile) if f: win.iconbitmap (icoFile) self.resizable (False,False) self.wm_attributes ('-topmost') True) self.title (bTitle [0]) SetControl () self.update () self.mainloop () def SetControl (): global tList,tText tScroll = tk.Scrollbar (win, orient=tk.VERTICAL) tScroll.place tList = tk.Listbox (win,selectmode=tk.BROWSE,yscrollcommand=tScroll.set) tScroll.config (command=tList.yview) for idx,item in enumerate (news): tList.insert F'{idx+1:03} {item [0]}') tList.place (win,text=bTitle [2], command=NewsList2) tBtn2.place (win,text=bTitle [2], command=NewsList2) tBtn2.place (bX,bY = 95270 # coordinates of the button tBtn1 = tk.Button (win,text=bTitle [1], command=NewsList1) tBtn1.place) ) tBtn3 = tk.Button (win,text=bTitle [3], command=NewsList3) tBtn3.place (xambibXhammer 20014) tk.Scrollbar (win, orient=tk.VERTICAL) tScroll2.place (tText = tk.Text (win,yscrollcommand=tScroll2.set) tScroll2.config (command=tText.yview) tText.place) tText.config (state=tk.DISABLED,bg='azure',font= ('Song style') ) NewsShow (0) tList.bind ("", NewsShow) if _ _ name__=='__main__': win = tk.Tk () bTitle = ('Today News', 'International News', 'domestic News', 'military News') news = SinaNews () InitWindow (win,480640)

Please send us all the codes and will not make a detailed analysis here. If necessary, please leave a message for discussion. My use environment Win7+Python3.8.8 can run without error! The name of the website involved in the article has been mosaic. If you can't guess the name, you can ask me in private.

Software compilation

Using pyinstaller.exe to compile into a single running file, note that the suffix name of the source file should be .pyw or there will be a cmd black window. There is also a small knowledge point. The Logo icon icon file of any website can generally be downloaded from the root directory, namely:

Http (s): / / websiteurl.com (.cn) / favicon.ico

The compilation command is as follows:

D:\ > pyinstaller-onefile-nowindowed-icon= "D:\ favicon.ico" News.pyw

After the compilation is completed, a News.exe executable file is generated under the dist folder, the size of which is about 15m.

Anyway, you can use it directly if you take it away. Remember to collect it before you leave. Thank you!

This is the end of the study on the method tutorial of combining Python web crawler to make a news Mini Program today. I hope I can solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report