How to implement News Crawler with Python Code 07/01 Update SLTechnology News&Howtos

How to implement News Crawler with Python Code

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces "how to use Python code to achieve news crawler" related knowledge, in the actual case operation process, many people will encounter such a dilemma, then let the editor lead you to learn how to deal with these situations! I hope you can read it carefully and be able to achieve something!

News source: Reddit

We can submit and vote for news links through Reddit, so Reddit is a good source of news. But the next question is: how can I get the most popular news every day? Before considering crawling, we should first consider whether the target site provides API. Because the use of API is perfectly legal, and more importantly, it provides machine-readable data, so there is no need to analyze HTML. Fortunately, Reddit provides API. We can find the function we need from the API list: / top. This feature can return Reddit or specify the most popular news on subreddit. The next question is: how to use this API? After carefully reading the documentation of Reddit, I found the most effective use. Step 1: create an application on Reddit. Log in and go to the "preferences → apps" page, where there is a page named "create another app..." at the bottom. The button. Click to create a "script" type of application. We don't need to provide "about url" or "redirect url" because the app is not open to the public and will not be used by others.

After the application is created, you can find App ID and Secret in the application information.

The next question is how to use App ID and Secret. Since we only need to get the most popular news on the specified SubReddit without having to access any user-related information, we theoretically don't need to provide personal information such as usernames or passwords. Reddit provides the form of "Application Only OAuth", in which applications can access public information anonymously. Run the following command:

$curl-X POST-H 'User-Agent: myawesomeapp/1.0'-d grant_type=client_credentials-- user 'OUR_CLIENT_ID:OUR_CLIENT_SECRET' https://www.reddit.com/api/v1/access_token

The command returns access token:

{"access_token": "ABCDEFabcdef0123456789", "token_type": "bearer", "expires_in": 3600, "scope": "*"}

Great! With access token, you can do a lot of things. Finally, if you don't want to write your own access code for API, you can use the Python client: https://github.com/praw-dev/praw first do a test to get the five most popular news items from / r/Python:

> import praw > import pprint > reddit = praw.Reddit (client_id='OUR_CLIENT_ID',. Client_secret='OUR_SECRET',... Grant_type='client_credentials',... User_agent='mytestscript/1.0') > subs = reddit.subreddit ('Python') .top (limit=5) > pprint.pprint ([(s.score, s.title) for s in subs]) [(6555,' Automate the boring stuff with python-tinder'), (4548,'MS is considering official Python integration with Excel, and is asking for''input'), (4102,' Python Cheet Sheet for begineers'), (3285,'We started late) But we managed to leave Python footprint on rhand place places'), (2899, "Python Section at Foyle's, London")]

Succeed!

Grab the news page

The next task is to grab the news page, which is actually very simple. From the previous step, we can get the Submission object, whose URL property is the address of the news. We can also filter out those URL that belong to Reddit itself through the domain attribute:

Subs = [sub for sub in subs if not sub.domain.startswith ('self.')]

We just need to grab the URL, which can be easily done with Requests:

For sub in subs: res = requests.get (sub.url) if (res.status_code = = 200and 'content-type' in res.headers and res.headers.get (' content-type') .startswith ('text/html')): html = res.text

Here we skip the news address where content type is not text/html, because Reddit users are likely to submit links directly to images, which we don't need.

Extract news content

The next step is to extract the content from HTML. Our goal is to extract the title and text of the news, and can ignore other content that does not need to be read, such as the header, footer, sidebar, etc. The job is difficult, but there is no universal perfect solution. Although BeautifulSoup can help us extract the text content, it will be extracted along with the footer of the first page. Fortunately, however, I find that the structure of the website is much better than before. No table layout, no and

The entire article page clearly uses and

The title and each paragraph are marked. And most websites put the title and text in the same container element, such as this:

Site Navigation Page Title

Paragraph 1

Paragraph 2

Sidebar Copyright...

At the top of this example is the container for the title and text. So you can use the following algorithm to find the text:

Find it as the title. For SEO purposes, there is usually only one on the page

Find the parent element and check whether the parent element contains enough

Repeat step 2 until you find one that contains enough

Or arrive at the parent element of the If it is found, it contains enough

The parent element is the container for the body If you find enough

As I encountered before, the page does not contain anything to read.

Although this algorithm is very crude and does not take into account any semantic information, it is completely feasible. After all, when the algorithm fails, you just need to ignore the article. It's no big deal to read one less article. Of course, you can implement a more accurate algorithm by parsing, or # main, .sidebar and other semantic elements. With this algorithm, you can easily write parsing code:

Soup = BeautifulSoup (text, 'html.parser') # find the article title H2 = soup.body.find (' H2') # find the common parent for and all

S. Root = H2 while root.name! = 'body' and len (root.find_all (' p'))

< 5: root = root.parent if len(root.find_all('p')) < 5: return None # find all the content elements. ps = root.find_all(['h3', 'h4', 'h5', 'h6', 'h7', 'p', 'pre']) 这里我利用len(root.find_all('p')) < 5作为正文过滤的条件，因为真正的新闻不太可能少于5个段落。大家可以根据需要调整这个值。转换成易于阅读的格式最后一步是将提取出的内容转换为易于阅读的格式。我选择了Markdown，不过你可以写出更好的转换器。本例中我只提取了和和、，所以简单的函数就能满足要求： ps = root.find_all(['h3', 'h4', 'h5', 'h6', 'h7', 'p', 'pre']) ps.insert(0, h2) # add the title content = [tag2md(p) for p in ps] def tag2md(tag): if tag.name == 'p': return tag.text elif tag.name == 'h2': return f'{tag.text}\n{"=" * len(tag.text)}' elif tag.name == 'h3': return f'{tag.text}\n{"-" * len(tag.text)}' elif tag.name in ['h4', 'h5', 'h6', 'h7']: return f'{"#" * int(tag.name[1:])} {tag.text}' elif tag.name == 'pre': return f'```\n{tag.text}\n```' 完整的代码跑一下试试： Scraping /r/Python... - Retrieving https://imgs.xkcd.com/comics/python_environment.png x fail or not html - Retrieving https://thenextweb.com/dd/2017/04/24/universities-finally-realize-java-bad-introductory-programming-language/#.tnw_PLAz3rbJ =>

Done, title = "Universities finally realize that Java is a bad introductory programming language"-Retrieving https://github.com/numpy/numpy/blob/master/doc/neps/dropping-python2.7-proposal.rst x fail or not html-Retrieving http://www.thedurkweb.com/sms-spoofing-with-python-for-good-and-evil/ = > done, title = "SMS Spoofing with Python for Good and Evil".

Crawled news files:

"how to use Python code to achieve news crawler" content is introduced here, thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.