How to use Python crawler to grab data 04/28 Update SLTechnology News&Howtos

How to use Python crawler to grab data

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article is about how to use Python crawlers to crawl data. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Tool installation

First, you need to install Python's requests and BeautifulSoup libraries. We use the Requests library to crawl the content of the web page and the BeautifulSoup library to extract data from the web page.

Install python

Run pip install requests

Run pip install BeautifulSoup

Crawl a web page

After completing the installation of the necessary tools, we officially began to write our crawler. Our first task is to grab all the book information on Douban. Let's take https://book.douban.com/subject/26986954/ as an example, first take a look at how Kai crawls the content of a web page.

Extract content

After crawling the content of the web page, all we have to do is extract the content we want. In our first example, we just need to extract the book title. First of all, we import the BeautifulSoup library, using BeautifulSoup we can very easily extract the specific content of the web page.

Continuously crawl web pages

So far, we have been able to crawl the content of a single web page, now let's look at how to crawl the content of the entire site. We know that web pages are connected to each other through hyperlinks, through which we can access the entire network. So we can extract links from each page that point to other pages, and then repeatedly crawl the new links.

Prepare for

IDE:pyCharm

Libraries: requests, lxm

A brief introduction to what these two libraries mainly do for us.

Requests: get the source code of a web page

Lxml: get the specified data in the source code of the web page

Is there any ^ _ ^ to the point?

Build an environment

The building environment here is not to build a python development environment. It means that we use pycharm to create a new python project, and then finish requests and lxml to create a new project. There is nothing left. Create a new src folder and directly create a new Test.py in it.

Dependent library import

In Test.py, enter:

Import requests

At this time, requests will report a red line, at this time, we will point the cursor at requests, press the shortcut key: alt + enter,pycharm will give the solution, at this time, select install package requests,pycharm will automatically install for us, we only need to wait a moment, the library is installed. Lxml is installed in the same way.

After installing these two libraries, the compiler will not report a red line.

Get the source code of the web page

Requests makes it easy for us to get the source code of the web page.

Get the source code:

# get the source code

Html = requests.get ("https://blog.csdn.net/it_xf?viewmode=contents")

# print source code

Print html.text

The code is that simple. This html.text is the source code of this URL.

Get specified data

Now that we have the source code of the web page, we need to use lxml to filter out the information we need

First of all, we need to analyze the source code. I'm using a chrome browser here, so right-click to check.

Then locate the first article in the source code.

First click the arrow in the upper right corner of the source page, and then select the title of the article in the content of the webpage. At this time, the source code will be located here.

At this time, select the title element of the source code and right-click to copy.

Get xpath, which is the equivalent of an address. For example, the position of a long picture on the web page in the source code.

Expression: / / * [@ id= "mainBox"] / main/div [2] / div [1] / h5Compa

First of all, / / represents the root node, that is to say, ah, the thing behind this / / is the root, which means that there is only one. What we need is in here.

Then / means to look at the lower layer, according to the picture, it is also obvious that div-> main-> div [2]-> div [1]-> h5-> a

Trace to a here, and then we add a / text to the end to indicate that we want to extract the contents of the element, so our final expression looks like this:

/ / * [@ id= "mainBox"] / main/div [2] / div [1] / h5/a/text ()

It is not difficult to understand that this expression is only for this element of this web page.

So how do you use this thing?

All codes:

Import requests

From lxml import etree

Html = requests.get ("https://blog.csdn.net/it_xf?viewmode=contents")

# print html.text

Etree_html = etree.HTML (html.text)

Content = etree_html.xpath ('/ / * [@ id= "mainBox"] / main/div [2] / div [1] / h5/a/text ()')

For each in content:

Print (each)

At this point, the data in each is the data we want.

Print the results:

How to play an ArrayList

The print result is this result, and we remove the line breaks and spaces.

Import requests

From lxml import etree

Html = requests.get ("https://blog.csdn.net/it_xf?viewmode=contents")

# print html.text

Etree_html = etree.HTML (html.text)

Content = etree_html.xpath ('/ / * [@ id= "mainBox"] / main/div [2] / div [1] / h5/a/text ()')

For each in content:

Replace = each.replace ('\ nmom,''). Replace (',')

If replace ='\ n' or replace = ='':

Continue

Else:

Print (replace)

Print the results:

How to play an ArrayList

Thank you for reading! This is the end of the article on "how to use Python crawler to capture data". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.