What are Python multithreaded crawlers and common search algorithms 04/27 Update SLTechnology News&Howtos

What are Python multithreaded crawlers and common search algorithms

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

Today Xiaobian to share with you what the Python multithreaded crawler and common search algorithms are related to knowledge, detailed content, clear logic, I believe that most people still know too much about this knowledge, so share this article for your reference, I hope you have something to gain after reading this article, let's take a look at it.

Advantages of multithreaded crawlers

Once you've mastered requests and regular expressions, you can start crawling some simple web addresses.

However, the crawler at this time has only one process and one thread, so it is called a single-threaded crawler. The single-threaded crawler only visits one page at a time and cannot make full use of the network bandwidth of the computer. A page can only be a few hundred KB at most, so when a crawler crawls a page, the extra network speed and the time between initiating the request and getting the source code are wasted. If you can let the crawler visit 10 pages at the same time, it is equivalent to a 10-fold increase in crawling speed. To achieve this, you need to use multithreading technology.

The language Python has a global interpreter lock (Global Interpreter Lock, GIL). As a result, the multithreading of Python is pseudo-multithreaded, that is, it is essentially a thread, but this thread only does each thing for a few milliseconds, saves the scene after a few milliseconds, does something else, does something else after a few milliseconds, returns to the first thing after a round, resumes the scene for a few milliseconds, and continues to change. Microscopically, single-threading is like doing several things at the same time. This mechanism has little impact on Input/Output O (input / output) intensive operations, but in CPU computationally intensive operations, because only one core of CPU can be used, it will have a very significant impact on performance. Therefore, when it comes to computing-intensive programs, you need to use multiple processes, and the multi-processes of Python are not affected by GIL. Crawler is an Icano-intensive program, so the use of multithreading can greatly improve crawling efficiency.

Multi-process Library: multiprocessing

Multiprocessing itself is a multi-process library for Python and is used to handle operations related to multiple processes. However, because memory and stack resources cannot be directly shared between processes, and the cost of starting a new process is much higher than that of threads, using multi-threads to crawl has more advantages than using multiple processes.

There is a dummy module under multiprocessing that allows Python threads to use various methods of multiprocessing.

There is a Pool class under dummy that implements the thread pool.

This thread pool has a map () method that allows all threads in the thread pool to execute a function "at the same time".

For example:

After learning the for cycle

For i in range (10): print (iTuni)

Of course, this way of writing can get the results, but the code is calculated one by one, which is not efficient. If you use multithreading technology to let the code square many numbers at the same time, you need to use multiprocessing.dummy to achieve:

Examples of the use of multithreading:

From multiprocessing.dummy import Pooldef cal_pow (num): return num*numpool=Pool (3) num= [x for x in range (10)] result=pool.map (cal_pow,num) print ('{} '.format (result))

In the above code, a function is defined to calculate the square, and then a thread pool with three threads is initialized. These three threads are responsible for calculating the square of 10 digits, and whoever first calculates the number on his hand will take the next number and continue to calculate until all the numbers have been calculated.

In this example, the thread pool's map () method takes two parameters, the first of which is the function name and the second of which is a list. Note: the first argument is only the name of the function and cannot be parenthesized. The second parameter is an iterable object, and every element in this iteratable object is accepted as an argument by the function clac_power2 (). In addition to lists, tuples, collections, or dictionaries can all be used as the second parameter of map ().

Multithreaded crawler development

Because the crawler is an Istroke O-intensive operation, especially when requesting the web page source code, if you use a single thread to develop, it will waste a lot of time waiting for the web page to return, so applying multithreading technology to the crawler can greatly improve the running efficiency of the crawler. Give me an example. The washing machine needs 50min to finish washing, 15min to boil the kettle and 1 hour to recite the words. If you wait for the washing machine to wash the clothes, then boil the water after washing, and then recite the words after the water is boiled, you will need 125min altogether.

But if you look at it another way, on the whole, three things can be run at the same time, suppose you suddenly split up with two other people, one of whom is responsible for putting the clothes in the washing machine and waiting for the washing machine to finish. the other person is responsible for boiling water and waiting for it to boil, and you only need to memorize the words. When the water boils, the person responsible for boiling the water disappears first. Wait until the washing machine has finished washing the clothes, and then the person in charge of washing the clothes will disappear. Finally, you memorize the words by yourself. You only need 60min to do three things at the same time.

Of course, you will certainly find that the above example is not the actual situation in life. In reality, no one will split up. In real life, people memorize words attentively; when the water boils, the kettle will sound a reminder; and when the clothes are washed, the washing machine will make a "drip" sound. So it's good to do the corresponding action when it's time to remind you, and there's no need to check it every minute. The two differences above are actually the differences between multithreaded and event-driven asynchronous models. This section deals with multithreaded operations, and later on the crawler framework that uses asynchronous operations. Now just keep in mind that when the number of actions required is small, there is no difference in performance between the two methods, but once the number of actions increases massively, the efficiency of multithreading decreases, even worse than that of a single thread. At that time, only asynchronous operation is the solution.

Here are two pieces of code to compare the performance difference between single-threaded crawler and multithreaded crawler crawling bd home page:

As can be seen from the running results, one thread takes about 16.2 seconds and five threads take about 3.5 seconds, and the time is about 1/5 of that of a single thread. You can also see the effect of five threads "running at the same time" in terms of time. But this is not to say that the larger the thread pool, the better. As you can see from the above results, five threads actually run a little more than 1/5 of the running time of a thread. This extra point is actually the time for thread switching. This also reflects from the side that the multithreading of Python is still serial microscopically. Therefore, if the thread pool is set too large, the overhead caused by thread switching may offset the performance improvement caused by multithreading. The size of the thread pool needs to be determined according to the actual situation, and there is no exact data. Readers can set different sizes in specific application scenarios to test and compare to find the most appropriate data.

Common search algorithms for crawlers depth-first search

The course classification of an online education website needs to crawl the course information above. Starting from the home page, the course has several major categories, such as Python, Node.js and Golang according to the language. There are many courses under each category, such as crawlers, Django and machine learning under Python. Each course is divided into many class hours.

In the case of depth-first search, the crawling route is shown in the figure (serial number from small to large)

Breadth first search

The order is as follows

Algorithm selection

For example, to crawl a website all the restaurant information and each restaurant order information. Suppose you use the depth-first algorithm, then climb from a link to restaurant A, and then immediately climb restaurant A's order information. As there are more than 100,000 restaurants in the country, it may take 12 hours to complete the climb. The problem is that orders for restaurant A may have climbed at 8 a.m., while restaurant B climbed at 8 p.m. Their orders are 12 hours behind. For popular restaurants, 12 hours can lead to an income gap of millions. In this way, when doing data analysis, the 12-hour time difference will make it difficult to compare the sales performance of An and B restaurants. The change in the number of restaurants is much smaller than that of orders. So if you use breadth-first search, crawl all the restaurants from 0: 00 a.m. to 12:00 the next day, and then crawl orders for each restaurant from 14:00 to 20:00 the next day. In doing so, the order crawling task was completed in only 6 hours, narrowing the order quantity difference caused by the time difference. At the same time, because the store has little impact on catching every few days, the number of requests has also been reduced, making it more difficult for the crawler to be found by the website.

For example, to analyze real-time public opinion, you need to climb Baidu Tieba. A popular post bar may have tens of thousands of pages of posts, assuming that the earliest posts date back to 2010. If you use breadth-first search, first get the title and URL of all the posts in this post bar, and then enter each post according to these URLs to get the information on each floor. However, since it is real-time public opinion, then the post seven years ago is of little significance to the current analysis, what is more important should be the new post, so we should give priority to grabbing new content. Compared with the past content, real-time content is the most important. Therefore, for the crawling of post bar content, depth-first search should be adopted. When you see a post, go in quickly and crawl the information of each floor of it. One post has finished climbing and then the next post. Of course, these two search algorithms are not either or, need to choose flexibly according to the actual situation, and often can be used at the same time.

These are all the contents of this article entitled "what are Python multithreaded crawlers and common search algorithms?" Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.