Example Analysis of Python Multithreaded Crawler 05/01 Update SLTechnology News&Howtos

Example Analysis of Python Multithreaded Crawler

2025-05-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "Python multithreaded crawler example analysis". In daily operation, I believe many people have doubts in Python multithreaded crawler example analysis. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "Python multithreaded crawler example analysis". Next, please follow the editor to study!

How threads and processes work

When the program is running, a process containing code and status is created. These processes are executed by one or more CPU. However, each CPU executes only one process at a time, and then quickly switches between different processes, so it feels like multiple programs are running at the same time. Similarly, in a process, the execution of the program is also switched between different threads, each thread executing a different part of the program. This means that when one thread waits for execution, the process switches to another thread to execute, which avoids wasting CPU time.

Threading thread module

In the Python standard library, the threading module is used to support multithreading. The Threading module encapsulates thread, and in most cases, you only need to use threading. It is also very easy to use:

T1=threading.Thread (target=run,args= ("T1",)) creates a thread instance # target is the name of the function to be executed (not the function), args is the corresponding argument to the function, and t1.start () starts the thread instance as a tuple. Normal creation method

The creation of a thread is simple, as follows:

Import threadingimport timedef printStr (name): print (name+ "- python blue light") 0.5 time.sleep (s) print (name+ "- python green light") t1=threading.Thread (target=printStr,args= ("Hello!" ) t2=threading.Thread (target=printStr,args= ("Welcome!" ,) t1.start () t2.start () Custom Thread

The essence is to inherit threading.Thread and reconstruct the run method in the Thread class.

Import threadingimport timeclass testThread (threading.Thread): def _ _ init__ (self,s): super (testThread Self). _ init__ () self.s=s def run (self): print (self.s+ "--python") time.sleep (0.5) print (self.s+ "--blue light") if _ name__=='__main__': t1=testThread ("test 1") t2=testThread ("test 2") t1.start () t2.start () daemon thread

Use setDaemon (True) to turn the child thread into a daemon thread for the main thread, so when the main thread ends, the child thread ends with it. That is, the main thread does not wait for its daemon thread to finish execution before shutting down.

Import threadingimport timedef run (s): print (s, "python") time.sleep (0.5) print (s, "blue light") if _ _ name__ = = "_ _ main__": t=threading.Thread (target=run,args= ("Hello!") t.setDaemon (True) t.start () print ("end")

Results:

Hello! Python

End

When the main thread ends, the daemon thread ends automatically, whether it ends or not.

The main thread waits for the child thread to finish

Using the join method, let the main thread wait for the child thread to execute. As follows:

Results:

Hello! Python

Hello! Green light

End

These are a few simple uses of multithreading, so what else does the threading module do? Please look down.

Lock lock

In fact, when introducing diskcache cache, we also introduced locks. In fact, it is not difficult to understand why the concept of locks also appears in multi-threads. When shared resources are not protected, dirty data may occur when multiple threads are dealing with the same resource, resulting in unexpected results, that is, thread is not safe.

The following example produces unexpected results:

Import threadingprice=0def changePrice (n): global price price=price+n price=price-ndef runChange (n): for i in range (2000000): changePrice (n) if _ _ name__ = = "_ _ main__": t1=threading.Thread (target=runChange,args= (5,) t2=threading.Thread (target=runChange,args= (8,)) t1.start () t2.start () t1.join () t2.join () print (price)

The theoretical result is 0, but the result of each run may be different.

So at this time, you need to lock to deal with it, as follows:

Import threadingimport timefrom threadingimport Lockprice=0def changePrice (n): global price lock.acquire () # acquire lock price=price+n print ("price:" + str (price)) price=price-n lock.release () # release lock def runChange (n): for i in range (2000000): changePrice (n) if _ _ name__ = "_ _ main__": lock=Lock () t1=threading.Thread (target=runChange,args= (5,)) t2=threading.Thread (target=runChange,args= (8) ) t1.start () t2.start () t1.join () t2.join () print (price)

The result value is consistent with the theoretical value. The meaning of a lock is that only one thread is allowed to modify the same data at a time to ensure thread safety.

Semaphore

The BoundedSemaphore class, while allowing a certain number of threads to change the data, as follows:

Import threadingimport timedef work (n): semaphore.acquire () print ("serial number:" + str (n)) time.sleep (1) semaphore.release () if _ _ name__ = = "_ _ main__": semaphore=threading.BoundedSemaphore (5) for i in range (100): t=threading.Thread (target=work,args= (iTun1) ) t.start () # active_count gets the number of running threads while threading.active_count ()! = 1: pass else: print ("end")

The result is a pause every five times until the end.

GIL global interpreter lock

Speaking of multithreading, I have to mention GIL. The full name of GIL is Global Interpreter Lock (Global interpreter Lock), which is a decision made by python at the beginning of its design for data security. If a thread wants to execute, it must first get the GIL, and there is only one GIL in a process. Only the thread that gets the GIL can enter the CPU execution. GIL is only available in cpython, because cpython calls native threads of the c language, so it cannot operate cpu directly, and can only use GIL to ensure that only one thread can get the data at a time. There is no GIL in pypy and jpython.

At this point, the study on "Python multithreaded crawler example analysis" is over. I hope to be able to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.