How to write the code for Python to crawl the inner and outer links of a web page? 04/19 Update SLTechnology News&Howtos

How to write the code for Python to crawl the inner and outer links of a web page?

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how to write the code of Python crawling the inner and outer chain of the web page". In the daily operation, I believe that many people have doubts about how to write the code of Python crawling the inner and outer chain of the web page. The editor consulted all kinds of materials and sorted out a simple and useful method of operation. I hope it will be helpful for everyone to answer the question of "how to write the code of Python crawling the inner and outer chain of the web page". Next, please follow the editor to study!

Project introduction

The breadth-first search method is used to obtain all the outer links on a website.

First of all, we enter a web page, get all the inner chain and outer chain of the web page, and then enter the inner chain respectively, and get all the inner chain and outer chain of the inner chain, until all the inner chain is unknown.

Code outline

1. Define a queue with class class, first-in, first-out, end-of-queue, and head-of-line.

2. Define four functions, namely, crawling the outer chain of the web page, climbing the inner chain of the web page, entering the inner chain, and adjusting the function.

3. To crawl Baidu pictures (https://image.baidu.com/), first define two queues and two arrays to store the inner chain and outer chain respectively. At the beginning of the program, climb the inner chain and outer chain of the current web page respectively, and then join the queue to judge the inner chain and outer chain. If it does not exist in the array, this is added to the array.

4. Then call the deepLinks () function and adopt the loop structure. If the number of the current inner chain is not empty, then dequeue the queue of the stored inner chain and enter the inner chain, and then repeatedly call the function of crawling the inner chain and the outer chain of the web page to judge whether the link of the web page is repeated. If it is not repeated, add the inner chain and outer chain to the corresponding queue and iterate continuously.

5. Enter all the inner chains in the web page, search out all the outer chains and store them in the queue, and then output them.

Code details queue

A queue is a special linear table. An one-way queue can only insert data at one end (back) and delete data at the other end (front).

Like the stack, a queue is a linear table with limited operations.

The insert operation is called the end of the line, and the deletion operation is called the head of the line.

Data in a queue is called an element; a queue without an element is called an empty queue.

Because only one end can be deleted or inserted, only those who are the first to enter the queue can be deleted, so it is also known as the first-in-first-out (FIFO-first in first out) linear table.

Here, we use the class class to define a queue, first in first out, at the end of the queue, and at the head of the queue. The queue must have the following functions: out of the queue, join the queue, determine whether it is empty, output queue length, and return queue header elements.

The difference between inner chain and outer chain:

Inner chain: refers to the mutual links between content pages under the same domain name of the same website.

Outside the chain: refers to the introduction of their own website links in other websites, such as friendship links, the construction of the chain, and so on.

Generally speaking, the inner chain is the link with the same domain name, while the domain name of the outer chain is different.

When it comes to inner chain and outer chain, it must be inseparable from the urllib library. Import the library first.

From urllib.parse import urlparse

Use the urlparse module to parse the url link, and the urlparse () module splits the url into six parts:

Scheme (protocol) netloc (domain name) path (path) params (optional parameter) query (connection key value pair) fragment (special anchor) url=' https://image.baidu.com/'a, b = urlparse (url) .scheme, urlparse (url) .netlocprint (a) print (b) #-output result-# httpsimage.baidu.com request header

The Header source uses the browser to open the web page you need to visit, press F12, click network, press the prompt to press ctr+R, click name to select the site name, and then see the first headers in the right box, find the request headers, this is the browser's request header, copy the user-agent, copy the content.

The request header here is:

Headers_= {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 Edg/89.0.774.68'} html = requests.get (url,headers=headers_)

Complete code:

Class Queue (object): # initialize queue def _ _ init__ (self): self.items = [] # inbound def enqueue (self, item): self.items.append (item) # outbound def dequeue (self): if self.is_Empty (): print ("the current team is listed as empty!") Else: return self.items.pop (0) # determines whether def is_Empty (self): return self.items = [] # queue length def size (self): return len (self.items) # returns the queue header element if the queue is empty Return None def front (self): if self.is_Empty (): print ("current team is listed as empty!") Else: return self.items [len (self.items)-1] class Queue (object): # initialize queue def _ _ init__ (self): self.items = [] # join the queue def enqueue (self Item): self.items.append (item) # outbound def dequeue (self): if self.is_Empty (): print ("current team is listed as empty!") Else: return self.items.pop (0) # determines whether def is_Empty (self): return self.items = [] # queue length def size (self): return len (self.items) # returns the queue header element if the queue is empty Return None def front (self): if self.is_Empty (): print ("current team is listed as empty!") Else: return self.items [len (self.items)-1] # Import library from urllib.request import urlopenfrom urllib.parseimport urlparsefrom bs4 import BeautifulSoupimport requestsimport reimport urllib.parseimport timeimport randomqueueInt = Queue () # queue in the chain queueExt = Queue () # queue in the chain externalLinks = [] internalLinks = [] # get the list def getExterLinks (bs) of all the outer chains on the page Exterurl): # find all links that start with www or http and do not contain the current URL ('for link in bs.find_all, href = re.compile (' ^ (http | www) ((?!'+ urlparse (exterurl) .netloc +')) * $'): # according to the standard URL only allows part of the ASCII characters, other characters (such as Chinese characters) are not up to the standard, # our link URL may have Chinese characters, it is necessary to encode at this time. Link.attrs ['href'] = urllib.parse.quote (link.attrs [' href'] Safe='?=&:/') if link.attrs ['href'] is not None: if link.attrs [' href'] not in externalLinks: queueExt.enqueue (link.attrs ['href']) externalLinks.append (link.attrs [' href']) print (link.attrs ['href']) # return externalLinks# gets the list def getInterLinks (bs) of all the links in the page Interurl): interurl ='{}: / / {} '.format (urlparse (interurl) .scheme, urlparse (interurl) .netloc) # find all internal links starting with "/" for link in bs.find_all (' a') Href = re.compile ('^ (/ |. *'+ urlparse (interurl) .netloc +')): link.attrs ['href'] = urllib.parse.quote (link.attrs [' href'] Safe='?=&:/') if link.attrs ['href'] is not None: if link.attrs [' href'] not in internalLinks: # startsWith () method is used to determine whether the current string "starts" with another given substring (link.attrs ['href'] .startswith (' / /)): If interurl+link.attrs ['href'] not in internalLinks: queueInt.enqueue (interurl+link.attrs [' href']) internalLinks.append (interurl+link.attrs ['href']) elif (link.attrs [' href'] .startswith ('/')): if interurl+link.attrs ['href'] not in internalLinks: QueueInt.enqueue (interurl+link.attrs ['href']) internalLinks.append (interurl+link.attrs [' href']) else: queueInt.enqueue (link.attrs ['href']) internalLinks.append (link.attrs [' href']) # return internalLinksdef deepLinks (): num = queueInt.size () While num > 1: I = queueInt.dequeue () if i is None: break else: print ('accessed inner chain') print (I) print ('new outer chain found') # html= urlopen (I) html=requests.get (I) Headers=headers_) time.sleep (random.random () * 3) domain1 ='{}: / / {} '.format (urlparse (I) .scheme, urlparse (I) .netloc) bs = BeautifulSoup (html.content,' html.parser') getExterLinks (bs, domain1) getInterLinks (bs) Domain1) def getAllLinks (url): global num# html = urlopen (url) headers_= {'User-Agent':' Mozilla/5.0 (Windows NT 10.0 Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 Edg/89.0.774.68'} html = requests.get (url,headers=headers_) time.sleep (random.random () * 3) # simulate human behavior Random intervals domain ='{}: / / {} '.format (urlparse (url) .scheme, urlparse (url) .netloc) bs = BeautifulSoup (html.content,' html.parser') getInterLinks (bs, domain) getExterLinks (bs, domain) deepLinks () getAllLinks ('https://image.baidu.com/'))

Crawl result

I'm just intercepting part of it here.

All inner chains in the array

InternalLinks- output-['http://image.baidu.com',' https://image.baidu.com/img/image/imageplus/index.html?fr=image', 'http://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1567133149621_R&pv=&ic=0&nc=1&z=0&hd=0&latest=0©right=0&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&sid=&word=%25E5%25A3%2581%25E7%25BA%25B8', 'http://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1461834053046_R&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&itg=0&ie=utf-8&word=%25E5%25A4%25B4%25E5%2583%258F%23z=0&pn=&ic=0&st=-1&face=0&s=0&lm=-1', 'https://image.baidu.com/search/albumslist?tn=albumslist&word=%25E8%25AE%25BE%25E8%25AE%25A1%25E7%25B4%25A0%25E6%259D%2590&album_tab=%25E8%25AE%25BE%25E8%25AE%25A1%25E7%25B4%25A0%25E6%259D%2590&rn=15&fr=searchindex', 'https://image.baidu.com/search/albumsdetail?tn=albumsdetail&word=%25E5%259F%258E%25E5%25B8%2582%25E5%25BB%25BA%25E7%25AD%2591%25E6%2591%2584%25E5%25BD%25B1%25E4%25B8%2593%25E9%25A2%2598&fr=searchindex_album%2520&album_tab=%25E5%25BB%25BA%25E7%25AD%2591&album_id=7&rn=30', 'https://image.baidu.com/search/albumsdetail?tn=albumsdetail&word=%25E6%25B8%2590%25E5%258F%2598%25E9%25A3%258E%25E6%25A0%25BC%25E6%258F%2592%25E7%2594%25BB&fr=albumslist&album_tab=%25E8%25AE%25BE%25E8%25AE%25A1%25E7%25B4%25A0%25E6%259D%2590&album_id=409&rn=30', 'https://image.baidu.com/search/albumsdetail?tn=albumsdetail&word=%25E7%259A%25AE%25E5%25BD%25B1&fr=albumslist&album_tab=%25E8%25AE%25BE%25E8%25AE%25A1%25E7%25B4%25A0%25E6%259D%2590&album_id=394&rn=30', 'https://image.baidu.com/search/albumsdetail?tn=albumsdetail&word=%25E5%25AE%25A0%25E7%2589%25A9%25E5%259B%25BE%25E7%2589%2587&fr=albumslist&album_tab=%25E5%258A%25A8%25E7%2589%25A9&album_id=688&rn=30', 'https://image.baidu.com/search/albumsdetail?tn=albumsdetail&word=%25E8%2588%25AA%25E6%258B%258D%25E5%259C%25B0%25E7%2590%2583%25E7%25B3%25BB%25E5%2588%2597&fr=albumslist&album_tab=%25E8%25AE%25BE%25E8%25AE%25A1%25E7%25B4%25A0%25E6%259D%2590&album_id=312&rn=30', 'https://image.baidu.com/search/albumslist?tn=albumslist&word=%25E4%25BA%25BA%25E7%2589%25A9&album_tab=%25E4%25BA%25BA%25E7%2589%25A9&rn=15&fr=searchindex_album', 'http://image.baidu.com/static/html/advanced.html',' https://image.baidu.com/', 'http://image.baidu.com/']

All outer chains in the array

ExternalLinks- output-['http://news.baidu.com/',' https://www.hao123.com/', 'http://map.baidu.com/',' https://haokan.baidu.com/?sfrom=baidu-top/', 'http://tieba.baidu.com/',' https://xueshu.baidu.com/', 'http://www.baidu.com/more/', 'https://pan.baidu.com',' https://zhidao.baidu.com', 'https://baike.baidu.com',' https://baobao.baidu.com', 'https://wenku.baidu.com',' https://jingyan.baidu.com', 'http://music.taihe.com',' https://www.baidu.com', 'https://www.baidu.com/', 'http://www.baidu.com/duty/',' http://www.baidu.com/search/image_help.html', 'http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001',' http://help.baidu.com/question', 'http://www.baidu.com/search/jubao.html',' http://www.baidu.com/search/faq_image.html%2305'] On the "Python crawl page inside and outside the chain of the code how to write" the end of the study, I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.