Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use BeautifulSoup to crawl web content

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to use BeautifulSoup to climb web content", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "how to use BeautifulSoup to crawl web content" bar!

Recently, if you want to do a food safety project, you need to climb the news. So I thought that it was very convenient to use BeautifulSoup crawler before, and I just tried it today, which is feasible.

The crawled links are as follows: http://news.sohu.com/1/0903/61/subject212846158.shtml

The structure is as follows:

The format of the link from the second page is: http://news.sohu.com/1/0903/61/subject212846158_1091.shtml

Decrease page by page (that is, 1091, 1090).

Required content: title, time, source, author, full text.

Ready: urllib2, BeautifulSoup, lxml

Introduce these libraries first.

Import urllib2

Import lxml

From bs4 import BeautifulSoup

First use developer tools to get headers (of course, we can do without headers here)

Headers = {

"User-Agent": "ozilla/5.0 (Macintosh; Intel Mac OS X 10 / 12 / 2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"}

Def sina_news (url,i): request = urllib2.Request (url,headers=headers) # send request Get response html_doc = response.read () # with headers response = urllib2.urlopen (request) # read the HTML file soup = BeautifulSoup (html_doc) 'lxml') # parsing HTML using lxml parser titles = soup.select (' td.newsblue1 > a:nth-of-type ('+ str (I) +')') # using selector to get titles time = soup.select ('td.newsblue1 > span:nth-of-type (' + str (I) +')) # ditto print titles [0] .get _ text () # because select returns a table The first element of the table is what we want, so titles [0], .get_text () is to remove some HTML code and get the content value print time [0] .get _ text () print titles [0] ['href']

When parsing with selector, the location function of the developer's tool is used. After locating the element, right-click copy-selector. Of course, you should note that nth-child (x) needs to be changed to nth-of-type (x). Here we use

Nth-of-type ('+ str (I) +')

The reason for this expression is that in the structure of the page, news is arranged by sub-items. For example, the first is nth-of-type (1), the second is nth-of-type (2), and so on. Test the results:

For i in range (1201):

Sina_news ('http://news.sohu.com/1/0903/61/subject212846158.shtml',i)

The results are as follows:

Now we just solve the title, time, link, we still have the source, the author. But we've got links to every piece of news, so it's easy to do.

Let's first take a look at the structure of each piece of news:

By the same token, it is easy to extract the source and responsible editor. The code is as follows:

Def get_source (url):

Request = urllib2.Request (url,headers=headers)

Response = urllib2.urlopen (request)

Html_doc = response.read ()

Soup = BeautifulSoup (html_doc,'lxml')

Sources = soup.select ('# media_span')

Editor = soup.select ('# editor_baidu')

Return sources,editor

Add the following code to the original function:

Sources,editor = get_source (titles [0] ['href']) if (sources):

Print sources [0] .get _ text ()

If (editor):

Print editor [0] .get _ text ()

Since sources and responsible editors do not necessarily have every piece of news, a judgment condition is added here. Now let's see how it works.

The effect is OK, and then extract the content of all the pages.

Def get_singalpage_all (url):

For i in range (1201):

Sina_news (url,i)

Def get_all_page ():

Url = 'http://news.sohu.com/1/0903/61/subject212846158'

For i in range (1091 and 990):

WholeURL = url +'_'+ str (I) + '.shtml'

Get_singalpage_all (wholeURL)

Call:

Get_singalpage_all ('http://news.sohu.com/1/0903/61/subject212846158.shtml')

Get_all_page ()

Successfully crawled all the domestic news.

The above is all the source code, of course, if you find it troublesome to look at it this way, you can download it here:

Https://alltoshare.com/product/2747.html

Thank you for your reading, the above is the content of "how to use BeautifulSoup to crawl web content", after the study of this article, I believe you have a deeper understanding of how to use BeautifulSoup to crawl web content, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report