Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to crawl requests Library and BeautifulSoup Library of Job search website

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to use Python to crawl the requests library and BeautifulSoup library of the job search site". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to use Python to crawl the requests library and BeautifulSoup library of the job search site".

Catalogue

1. Requests library

1. Introduction to requests

2. Install the requests library

3. Using requests to obtain web page data, we import the module first.

4. Summarize some methods of requests.

2. BeautifulSoup library

1. Introduction to BeautifulSoup

2. Install BeautifulSoup library

3. Use BeautifulSoup to parse and extract the acquired data

4. The method of extracting data by BeautifulSoup.

1. Introduction of requests Library 1 and requests

The requests library is a third-party library that initiates requests. Requests allows you to send HTTP/1.1 requests. You don't need to manually add query strings to URL or encode POST data. The functions of Keep-alive and HTTP connection pooling are 100% automated, and all power comes from urllib3, which is rooted in requests. To put it simply, with this library, we can easily initiate a request to the corresponding website to obtain the web page data, as well as the response content and status code returned by the server.

Requesets Chinese document page https://requests.kennethreitz.org/zh_CN/latest/

2. Install the requests library

Python installed on computers usually comes with this library. If not, you can type the following line of code on the command line to install it.

Pip install requests3, using requests to obtain web page data, we import the module import requests first.

To initiate a request for a website that wants to obtain data, take QQ Music's official website as an example

Res = requests.get ('https://y.qq.com/') # initiate request print (res) # output

The output 200 is actually a response status code. Here is a list of the possible returned status codes.

Status code meaning 1xx continues to send information 2xx request succeeded 3xx redirect 4xx client error 5xx server error

Get the source code of QQ Music's home page

Res = requests.get ('https://y.qq.com/') # initiate request print (res.text) # res.text is the source code of the web page 4. Summarize some methods of requests attribute meaning res.status_codeHTTP status code res.text response content text res.content response content binary form text res.encoding response content coding

Now that we have learned how to get the source code of a web page, let's learn how to extract the content we obtained using the BeautifulSoup library.

II. Introduction of BeautifulSoup Library 1 and BeautifulSoup

BeautifulSoup is a third-party library in Python, which is very practical to deal with data. With this library, we can extract the data purposefully according to the corresponding HTML tags in the source code of the web page. BeautifulSoup libraries are generally used in conjunction with requests libraries. Unfamiliar HTML tags had better go to Baidu to learn about some commonly used tags.

2. Install BeautifulSoup library

Similarly, if you do not use this library, you can install it by entering the following code from the command line

Pip install beautifulsoup43, parsing and extracting the acquired data import requestsfrom bs4 import BeautifulSoupheader= {'User-Agent':' Mozilla/5.0 (Windows NT 6.1; Win64) using BeautifulSoup X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'} res = requests.get ('https://y.qq.com/',headers=header) # headers is an anti-crawler measure soup = BeautifulSoup (res.text,'html.parser') # the first parameter is HTML text, and the second parameter html.parser is Python built-in compiler print (soup) # outputs the source code of QQ Music's home page

Partial output result

Seeing the output, we have successfully parsed the web page source code into a BeautifulSoup object. At this point, someone may ask res.text output is not the web page code, why bother to turn it into a BeautifulSoup object?

Let's first look at their types through the type () function.

Import requestsfrom bs4 import BeautifulSoupheader= {'User-Agent':' Mozilla/5.0 (Windows NT 6.1; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'} res = requests.get ('https://y.qq.com/',headers=header) # headers is an anti-crawler measure soup = BeautifulSoup (res.text,'html.parser') # the first parameter is HTML text, and the second parameter html.parser is Python built-in compiler print (type (res.text)) print (type (soup))

Output result

We can see that the type of res.text is a string type, while soup is a BeautifulSoup object type. Soup's BeautifulSoup object type has more methods available than res.text 's string type so that we can quickly extract the data we need. That's why we have to take one more step.

4. The method of extracting data by BeautifulSoup.

Let's first take a look at the two most commonly used methods

The method acts as find () to return the first data that meets the requirements find_all () returns all the data that meets the requirements

The parameters passed in by these two functions are our filter conditions for the data. what parameters can we pass to these two functions respectively?

Let's take the following source code snippet intercepted from QQ Music's home page as an example to try two functions.

The playlist recommends you the online songs, variety classic, official singles, love songs.

Find () function

If we want to get the content of the recommended line of the playlist, we need to first identify the HTML tag recommended by the playlist. We find that it is under the I tag of class= "icon_txt", and then we can extract it by the following method

Import requests from bs4 import BeautifulSoup header= {'User-Agent':' Mozilla/5.0 (Windows NT 6.1; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'} res = requests.get ('https://y.qq.com/',headers=header) # headers is an anti-crawler measure soup = BeautifulSoup (res.text,'html.parser') # the first parameter is HTML text The second parameter, html.parser, is Python's built-in compiler print (soup.find ('class_='icon_txt', class_='icon_txt')) # finds the I tag of class_='icon_txt'

Because class is the keyword that defines the class in Python, use class_ to represent class in HTML

Output result

Find_all () function

If we want to extract all the themes recommended by the playlist, we need to use the find_all () function

Similarly, we find that these topics are all under the a tag of class= "index_tab__item js_tag". In order to avoid filtering other tags in the source code that are also class= "index_tab__item js_tag", we need to add another condition data-type= "playlist". How to do this?

Import requestsfrom bs4 import BeautifulSoupheader= {'User-Agent':' Mozilla/5.0 (Windows NT 6.1; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'} res = requests.get ('https://y.qq.com/',headers=header) # headers is an anti-crawler measure soup = BeautifulSoup (res.text,'html.parser') # the first parameter is HTML text The second parameter html.parser is Python's built-in compiler print (soup.find ('class_='icon_txt', class_='icon_txt')) items = soup.find_all (' class': "index_tab__item js_tag", "data-type": "playlist"})

The way to do this is to pass in a key-value pair in the second parameter and add the filtered attribute to it.

Output result

Through the above two small cases, we find that the find () and find_all () functions return a list of Tag objects and Tag objects. What we need is not this large list of things, but only the text property or href (link) property of the Tag object. The implementation code is as follows

Import requestsfrom bs4 import BeautifulSoupheader= {'User-Agent':' Mozilla/5.0 (Windows NT 6.1; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'} res = requests.get ('https://y.qq.com/',headers=header) # headers is an anti-crawler measure soup = BeautifulSoup (res.text,'html.parser') # the first parameter is HTML text The second parameter html.parser is Python's built-in compiler tag1=soup.find ('class, class_='icon_txt') print (tag1.text) items = soup.find_all (' 'class': "index_tab__item js_tag", "data-type": "playlist"}) for i in items: # traversal list tag2=i.text print (tag2)

Output result

In this way, we have successfully obtained the text content of the topic, and if we want to extract the attribute value in the tag, we can get it using the object name ['attribute']. This will not be demonstrated here.

Thank you for your reading, the above is the content of "how to use Python to crawl the requests library and BeautifulSoup library of the job search website". After the study of this article, I believe you have a deeper understanding of how to use Python to crawl the requests library and BeautifulSoup library of the job search website. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report