How to use Python to crawl the data of web pages 07/09 Update SLTechnology News&Howtos

How to use Python to crawl the data of web pages

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to use Python to crawl web page data". In daily operation, I believe many people have doubts about how to use Python to crawl web page data. Xiaobian consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts of "how to use Python to crawl web page data". Next, please follow the editor to study!

Prepare for

IDE:PyCharm

Libraries: requests, lxml

Note:

Requests: get the source code of a web page

Lxml: get the specified data in the source code of the web page

Build an environment

The building environment here is not to build a python development environment, it means that we use pycharm to create a new python project, and then finish requests and lxml.

Create a new project:

Dependent library import

Since we are using pycharm, it will be very easy for us to import these two libraries

Import requests

At this time, requests will report a red line, at this time, we will point the cursor at requests, press the shortcut key: alt + enter,pycharm will give the solution, at this time, select install package requests,pycharm will automatically install for us, we only need to wait a moment, the library is installed. Lxml is installed in the same way.

Get the source code of the web page

As I said before, requests makes it easy for us to get the source code of the web page.

Take my blog address for example: https://coder-lida.github.io/

Get the source code:

# get the source code

Html = requests.get ("https://coder-lida.github.io/")

# print source code

Print html.text

The code is that simple. This html.text is the source code of this URL.

Complete code:

Import requests

Import lxml

Html = requests.get ("https://coder-lida.github.io/")

Print (html.text)

Print:

Get specified data

Now that we have the source code of the web page, we need to use lxml to filter out the information we need

Here I will take getting a list of my blog as an example. I can find the original page and view XPath through F12, as shown in the figure.

Get the content of the web page through the syntax of XPath.

Check the title of the first article

/ / * [@ id= "layout-cart"] / div [1] / a/@title

/ / locate the root node

/ look at the lower level

Extract text content: / text ()

Extract attribute content: / @ xxxx

Import requests

From lxml import etree

Html = requests.get ("https://coder-lida.github.io/")

# print (html.text)

Etree_html = etree.HTML (html.text)

Content = etree_html.xpath ('/ / * [@ id= "layout-cart"] / div [1] / a dynamic title')

Print (content)

View all article titles

/ / * [@ id= "layout-cart"] / div/a/@title

Code:

Import requests

From lxml import etree

Html = requests.get ("https://coder-lida.github.io/")

# print (html.text)

Etree_html = etree.HTML (html.text)

Content = etree_html.xpath ('/ / * [@ id= "layout-cart"] / div/a/@title')

Print (content)

Output:

['springboot reverse engineering', 'implement a simple version of HashMap','25 single lines of JavaScript commonly used in development', 'shiro encrypted login password with salt treatment', 'Spring Boot build RESTful API and unit test', 'remember the use of jsoup']

At this point, the study on "how to use Python to crawl the data of web pages" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.