How to analyze the application of Python multithreading in crawlers 04/16 Update SLTechnology News&Howtos

How to analyze the application of Python multithreading in crawlers

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

This article shows you how to analyze the application of Python multithreading in the crawler, the content is concise and easy to understand, it can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

As a test engineer, you often need to solve the problem of the source of test data. There are only three ways to solve the problem:

1. Copy real data directly from the production environment

2. Crawl data from the Internet

3. Create your own data with scripts or tools.

Some time ago, in order to obtain more test data, the author made a crawler program to crawl data from the Internet, although the function basically meets the needs of the project, but the efficiency of crawling is still not too high.

As an Excelsior test engineer, I decided to study the application of multithreading in the field of crawlers in order to improve the efficiency of crawlers.

First, why multithreading is needed

If you know everything, you should also know why. Before we learn about multithreading, let's take a look at why multithreading is needed. For example, if you want to move, single threading is similar to hiring a mover, who is responsible for a series of operation processes such as packing, handling, driving, unloading, etc., which is conceivably slow; and multithreading is equivalent to hiring four movers, giving them to the car after packing, and then driving C to the destination, and finally Ding to unload the goods.

Thus it can be seen that the advantage of multithreading is that it is efficient and can make full use of resources, and the disadvantage is that each thread should coordinate with each other, otherwise it is easy to mess up (similar to the dilemma of one monk carrying water, two monks carrying water and three monks having no water to drink). Therefore, in order to improve the efficiency of crawlers, we should pay special attention to the management of multithreading when using multithreading.

Second, the basic knowledge of multithreading

Process: it is composed of three parts: the program, the data set and the process control block. It is a running process of the program on the data set. If the same program runs twice on a dataset, two processes are started. Process is the basic unit of resource management. In the operating system, each process has an address space, and by default there is a control process.

Thread: an entity of a process, the basic unit of CPU scheduling and dispatch, and the smallest unit of execution. It reduces the consumption of context switching, improves the concurrency of the system, and overcomes the defect that a process can only do one thing. Threads are managed by processes, and multiple threads share the resource space of the parent process.

The relationship between processes and threads:

A thread can only belong to one process, and a process can have multiple threads, but at least one thread.

Resources are assigned to a process, and all threads of the same process share all resources of that process.

CPU is allocated to threads, that is, what is actually running on the CPU is the thread.

How threads work:

As shown in the following figure, serial means that threads execute on CPU one by one; parallel means running multiple CPU

Thread; and concurrency is a kind of "pseudo-parallelism", in which a CPU can only execute one task at a time, taking the time of CPU

Slicing, a thread only takes up a very short time slice, and then each thread takes turns, because the time slice is very short.

It seems to the user that all threads are "simultaneous". Concurrency is also the actual operation of most single CPU multithreading.

Click to add picture description (60 words maximum)

Working status of the process:

A process has three states: running, blocking, and ready. The transition relationship between the three states is shown in the following figure: the process in the running state may actively enter the blocking state because it is waiting for input, or it may passively enter the ready state because the scheduler chooses another process (usually the CPU time allocated to it is up); the process in the blocking state may enter the ready state because it waits for valid input The process in the ready state enters the running state again because the scheduler selects it again.

Click to add picture description (60 words maximum)

III. Examples of multithreaded communication

Coming back to the question of crawlers, we know that when crawling blog posts, we always crawl the list page first, and then crawl the details of the article according to the crawling results of the list page. And the crawling speed of the list page is definitely faster than that of the details page.

In this way, we can design thread A to crawl the article list page, thread B, thread C, and thread D to crawl the article details. A puts the list URL results in a structure similar to global variables, and threads B, C, and D take the results from this structure.

In PYTHON, there are two modules that support multithreading: the threading module, which is responsible for thread creation, opening, and so on, and the queque module, which is responsible for maintaining the structure similar to the global variable.

I would like to add one more point here: maybe some students will ask, can't you just use a global variable? Why do you have to use queue?

Because the global variable is not thread-safe, for example, there is only one url in the global variable (list type), thread B determines that the global variable is not empty, and before taking out the url, cpu gives the time slice to thread C, and thread C takes away the last url, and then it's cpu's turn for cpu time slice B to report an error because it takes data from an empty list.

The queue module implements multi-producer and multi-consumer queues, which is thread-safe when setting values.

No more nonsense, just go to the code to show everyone:

Import threading # Import threading module

From queue import Queue # Import queue module

Import time # Import time module

# crawling the article details page

Def get_detail_html (detail_url_list, id):

While True:

Url = detail_url_list.get () # the get method of the Queue queue is used to extract elements from the queue

Time.sleep (2) # delay 2s to simulate the process of network request and crawling article details

Print ("thread {id}: get {url} detail finished" .format (id=id,url=url)) # print thread id and url that crawled the content of the article

# crawling the article list page

Def get_detail_url (queue):

For i in range (10000):

Time.sleep (1) # delay 1s. Simulation is faster than crawling article details.

Queue.put ("the put method of the http://testedu.com/{id}".format(id=i))#Queue queue is used to place elements in the Queue queue. Because Queue is a first-in, first-out queue, the URL that is first Put will be get out first."

Print ("get detail url {id} end" .format (id=i)) # prints out the url of which articles you got

# main function

If _ name__ = = "_ _ main__":

Detail_url_queue = Queue (maxsize=1000) # construct a thread-safe FIFO queue with size 1000 with Queue

# create four threads first

Thread = threading.Thread (target=get_detail_url, args= (detail_url_queue,)) # A thread is responsible for fetching the list

Url

Html_thread= []

For i in range (3):

Thread2 = threading.Thread (target=get_detail_html, args= (detail_url_queue,i))

Html_thread.append (thread2) # B C D thread crawl article details

Start_time = time.time ()

# start four threads

Thread.start ()

For i in range (3):

Html_ thread [I] .start ()

# wait for all threads to finish, and the thread.join () function means that the parent process is in a blocked state until the child thread completes.

Thread.join ()

For i in range (3):

Html_ thread [I] .join ()

After four ABCD threads such as print ("last time: {} s" .format (time.time ()-start_time)) # are finished, the total crawl time is calculated in the main process.

Running result:

From the running results, we can see that the threads are working in an orderly manner without any errors or alarms. It can be seen that using Queue queues to communicate between multiple threads is much safer than using global variables directly. And using multithreading is much less than not using multithreading, which not only improves the crawler efficiency, but also takes into account the thread safety, which can be said to be a very practical way in the process of crawling test data.

The above content is how to analyze the application of Python multithreading in crawlers. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.