What are the basic libraries of Python crawlers 07/06 Update SLTechnology News&Howtos

What are the basic libraries of Python crawlers

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what are the basic libraries of Python reptiles". In the daily operation, I believe that many people have doubts about the basic libraries of Python reptiles. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the questions of "what are the basic libraries of Python reptiles?" Next, please follow the editor to study!

Crawlers have three basic libraries: Requests, BeautifulSoup and Lxml, which are the most frequently used for beginners. Now let's take a look at the use of these three libraries.

1. Requests library

The function of the Requests library is to request the website to obtain web page data.

Code:res=requests.get (url)

Return:

Return 200 indicates that the request is successful

If 404 and 400 are returned, the request failed.

Code:res=request.get (url,headers=headers)

Adding request header information disguised as a browser can better request data information.

Code:res.text

Detailed text of web page information

2. BeautifulSoup library

The BeautifulSoup library is used to parse the web pages extracted by Requests to get structured data.

Soup=BeautifulSoup (res.text,'html.parser')

Detailed data extraction:

Infos=soup.select ('path')

Path extraction method: right-click in the fixed data location-copy-copy selector

3. Lxml library

Lxml is a XML parsing library, which can modify HTML code to form a structured HTML structure.

Code:

From lxml import etree

Html=etree.HTML (text)

Infos=Html.xpath ('path')

Path extraction method: right-click in the fixed data location-Copy-Copy Xpath

Practice case:

1. Climb the TOP500 music information of KuGou list.

2. If there is no page turning on the web page, how to find URL, and find that the URL of the first page is:

Https://www.kugou.com/yy/rank/home/1-8888.html?from=rank

Try to change 1 to 2, you can get a new web page, and so on, to get an iterative web page URL

3. Crawl information for song names, singers, etc.

4. The detailed code is as follows:

Import requestsfrom bs4 import BeautifulSoupimport time

Headers= {"User-Agent": "xxxx"}

Def get_info (url): print (url) # through request headers and links Get the overall information of the web page web_data=requests.get (url,headers=headers) # print (web_data.text) # parse the returned result soup=BeautifulSoup (web_data.text) 'lxml') # find the specific content location and content of the same data ranks = soup.select (' span.pc_temp_num') titles = soup.select ('div.pc_temp_songlist > ul > li > a') times = soup.select (' span.pc_temp_tips_r > span') # extract specific text content for rank, title, time in zip (ranks, titles Times): data = {'rank': rank.get_text (). Strip (),' singer': title.get_text (). Split ('-') [0], 'song': title.get_text (). Split (' -') [1], 'time': time.get_text (). Strip ()} print (data)

If _ _ name__=='__main__': urls = ['https://www.kugou.com/yy/rank/home/{}-8888.html?from=rank'.format(i) for i in range (1,2)] for url in urls: get_info (url) time.sleep (1) at this point, the study on "what are the basic libraries of Python crawlers" is over. I hope I can solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.