What is the process of Python crawler? 07/15 Update SLTechnology News&Howtos

What is the process of Python crawler?

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

In this article, the editor introduces in detail "how the process of Python crawler is", with detailed content, clear steps and proper handling of details. I hope this article "how the process of Python crawler is" can help you solve your doubts.

Starting from the URL of one or more initial web pages, the web crawler obtains the URL on the initial web page. In the process of crawling the web page, the web crawler constantly extracts new URL from the current page and puts it in the queue until certain stopping conditions of the system are met. Simply understand the web crawler as a while loop with termination conditions. When the condition is not triggered, the crawler will constantly send requests from each and the acquired url to get page data, and then parse the url of the current page and continue to iterate. In the crawl project, the crawler class completes this process. Instead of using a breadth-first or depth-first crawler, it suspends the current task through python when the current request fails, and then schedules it later. This can be understood as an A* search based on network connectivity, which works as follows:

For an initialized crawler object, there is a url, a todo collection, which stores a busy collection of url; that has not yet been crawled, saves a url collection waiting for other crawler data, and a done collection, which stores the url collection that has completed page crawling. The core of the crawler is this endless loop. First, the crawler gets a url from the todo collection, and then initializes the fetch object to get the url,*** on the page to schedule and execute a url request task. The code for this process is shown below.

@ asyncio.coroutine def crawl (self): "Run the crawler until all finished." With (yield from self.termination): while self.todo or self.busy: if self.todo: url, max_redirect= self.todo.popitem () fetcher = Fetcher (url, crawler=self, max_redirect=max_redirect Max_tries=self.max_tries ) self.busy [url] = fetcher fetcher.task = asyncio.Task (self.fetch (fetcher)) else: yield from self.termination.wait () self.t1 = time.time ()

Obviously, a crawler is not just composed of an endless loop. Other modules are needed to support its operations in the outer layer of crawl, including network connection, url acquisition, task scheduling and other tasks. The scheduling framework of the entire crawl project is as follows:

First create a ConnectionPool when initializing the crawl creation:

Self.pool = ConnectionPool (max_pool, max_tasks)

The attributes connections and queue are retained, and the set and queue of connections are saved respectively for subsequent scheduling, while host and port number are stored in connection and ssl is supported, and connections are obtained through asyncio.open_connection ().

Self.connections = {} # {(host, port, ssl): [Connection,...],...} self.queue = [] # [Connection,...]

When the task is executed, the crawl method is first loaded into the event loop through loop.run_until_complete (crawler.crawl ()), then the connection object is saved in the link pool ConnectionPool constructed by the above statement, the connection object is obtained, and then the data is crawled through the fetch method of the fetcher object. For a url request task, fetcher is used for processing, and scheduling is scheduled using the asyncio.Task method. The fetch method gets the suspended generator and gives it to asyncio.Task for execution.

Through the yield from and asynico.coroutine statements, this method is changed to generator during execution, and if suspended during the execution of the fetcher.fetch () method, it is processed by the scheduler.

The fetcher.fetch () method is the core method of the web crawler, which is responsible for obtaining page data from the network and loading the url into the todo collection. This method attempts to obtain page data and stops when the number of attempts reaches the upper limit. The successful html data, external links and redirect links will be stored. When the number of url links reaches the limit, the link operation of the url will be stopped and the error log will be output. After that, we take different ways to deal with the different states of the page.

The following code is the area where the crawling.py file starts at line 333 (crawling.py) to the end of the corresponding method, and selects a different processing method by judging the status of the page. The url information on the page is obtained through regular expressions. Here, select the string that begins with href. The code extracted by the core url is as follows:

# Replace href with (?: href | src) to follow image links. Self.urls = set (re.findall (r'(? I) href= ["\']? ([^\ s"\'] +)', body)) if self.urls: logger.warn ('got% r distinct urls from% rang self.urls, self.url) self.new_urls = set () for url in self.urls: url = unescape (url) url = urllib.parse.urljoin (self.url, url) url Frag = urllib.parse.urldefrag (url) if self.crawler.add_url (url): self.new_urls.add (url)

From the code, it is clear that the regular matching results are stored in the urls collection and processed in turn through the for loop, adding to the todo collection of the crawler object of the current fetcher.

Based on the previous analysis, the architecture of the overall crawler can be obtained by further analyzing the main file crawl.py:

In the main file, we first parse through argparse.ArgumentParser to set up the data reading and control of the console, in which IOCP is selected as the event loop object in the windows environment. The main method, first returns the dictionary that stores the command line data through parse_args, and gives a prompt if there is no root attribute. Then configure the log level to indicate the output level of the log, and no output if it is lower than the * * level.

When entering the program through the entry function main method, first initialize the Crawler according to the parameters from the command line, get the loop event object using asyncio, and execute the run_until_complete method until the end of the program.

In addition, reporting.py is used to print the execution of the current task. Where fetcher_report (fetcher, stats, file=None) prints the working status of the url, url is the url attribute of fetcher; report (crawler, file=None) prints all the completed url working status of the whole project.

After reading this, the article "what is the process of Python crawler" has been introduced. If you want to master the knowledge of this article, you still need to practice and use it yourself. If you want to know more about the article, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.