How to crawl the information of a website in big data 07/04 Update SLTechnology News&Howtos

How to crawl the information of a website in big data

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to crawl information from a website in big data, for this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more small partners who want to solve this problem find a simpler and easier way.

Suppose we want to crawl this website: stuu.scnu.edu.cn/articles for a summary and pictures.

First introduce the library:

from bs4 import BeautifulSoup

import requests

These two libraries are crawler tools, and there are installation methods at the bottom of the article.

Then define url as the URL we need to crawl, and use requests.get() to get the content of this URL stored in wb_data.

url = "http://stuu.scnu.edu.cn/articles"

wb_data = requests.get(url)

To parse these contents, lxml is used here, which is also a library.

soup = BeautifulSoup(wb_data.text,'lxml')

Let's start with a random column:

Then we use developer tools to extract the information we want from the web page. Remember the mouse + box pattern in the upper left corner (blue in the picture)? Click on it, then click on the title.

The code on the right will automatically jump to it:

Select the next level above it, i.e. the next small triangle. As shown in the figure:

Right-click-Copy-selector, and you get this:

#main-wrap-left > div.bloglist-container.clr > article:nth-child(5) > div.home-blog-entry-text.clr

Remove the part from '#' to '>' and remove:nth-child(5), which refers to the fifth label. If you don't remove it, you will only get the content of the fifth label.

Get: div.bloglist-container.clr > article > div.home-blog-entry-text.clr

If you do the same thing for the picture, for the abstract, separately, you get

Image: div.bloglist-container.clr > article> a > div > img

Abstract: div.bloglist-container.clr > article > div.home-blog-entry-text.clr > p

Then define one: variable =soup.select("what you just got")

titles = soup.select("div.bloglist-container.clr > article > div.home-blog-entry-text.clr")

texts = soup.select("div.bloglist-container.clr > article > div.home-blog-entry-text.clr > p")

imgs = soup.select("div.bloglist-container.clr > article > a > div > img")

And here's how to extract that.

Put them in the dictionary (title,text,img are all new variables):

for title,text,img in zip(titles,texts,imgs):

data = {

"title":title.get_text(),

"Summary":text.get_text(),

"Image":img.get ('src ')

}

Then print(data), you will get the following information, so we are finished extracting. (Click to enlarge)

Complete Code:

from bs4 import BeautifulSoup

import requests

url = "http://stuu.scnu.edu.cn/articles"

wb_data = requests.get(url)

soup = BeautifulSoup(wb_data.text,'lxml')

titles = soup.select("div.bloglist-container.clr > article > div.home-blog-entry-text.clr")

texts = soup.select("div.bloglist-container.clr > article > div.home-blog-entry-text.clr > p")

imgs = soup.select("div.bloglist-container.clr > article > a > div > img")

for title,text,img in zip(titles,texts,imgs):

data = {

"title":title.get_text(),

"Summary":text.get_text(),

"Image":img.get ('src ')

}

print(data)

If you want more than just these ten pieces of information, you can construct functions.

We noticed that the domain name on page 2 was stuu.scnu.edu.cn/articles? paged=2

That's the point. We can replace 2 with something else:

url = "http://stuu.scnu.edu.cn/articles? paged="

And then turn that into a function:

def get_page(url):

wb_data = requests.get(url)

soup = BeautifulSoup(wb_data.text,'lxml')

titles = soup.select("div.bloglist-container.clr > article > div.home-blog-entry-text.clr")

texts = soup.select("div.bloglist-container.clr > article > div.home-blog-entry-text.clr > p")

imgs = soup.select("div.bloglist-container.clr > article > a > div > img")

for title,text,img in zip(titles,texts,imgs):

data = {

"title":title.get_text(),

"Summary":text.get_text(),

"Image":img.get ('src ')

}

print(data)

Add a function that adjusts the number of pages you want and calls get_page:

def getmorepage(start,end):

for i in range (start,end):

get_page(url+str(i))

How many pages of data do you want in the end?

getmorepage(1,10)

Final results (click to enlarge):

As many as you want, fast and efficient.

Of course, I'm just throwing a brick at the wall.

How to install the library:

First we need three libraries, one for Beautifulsoup, one for requests, and one for lxml.

If you are using PyCharm, you can add these two libraries from File->Default Settings->project interpreter, as shown below:

Click on the + sign on the right and enter the library you want to install.

Linux installation:

1. PyCharm can be operated directly according to the above operation

2. If not, please do this:

sudo apt-get install Python-PackageName

packageName is the name of the library you want to install.

Windows installation:

1. First make sure you have pip installed:

Enter pip --version on the command line (Run-CMD)

If there is no error, you can prove that you have installed, you can continue with the following steps.

2. If you already have PyCharm, you can operate it as described in the PyCharm section above.

If not, you have two choices, one is compressed package, the other is pip command, here said the second

On the command line type:

pip3 install packageName

packageName is the name of the library you want to install

If you have permission issues, enter:

sudo pip install packageName

After successful installation will prompt:

Successfully installed PackageName

If you are Python 2. Change the pip3 to pip.

About how to crawl a website in big data, the answer to the question is shared here. I hope the above content can be of some help to everyone. If you still have a lot of doubts, you can pay attention to the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.