In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces the knowledge of "Python multi-process batch crawling novel code sharing". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Use python to run the same code for multiple processes.
Multithreading in python is not really multithreading, and if you want to make full use of the resources of multicore CPU, you need to use multiprocesses in most cases in python. Python provides a very useful multi-process package multiprocessing, you only need to define a function, Python will do everything else. With this package, you can easily complete the transformation from a single process to concurrent execution. Multiprocessing supports child processes, communicating and sharing data, performing different forms of synchronization, and provides components such as Process, Queue, Pipe, Lock, and so on.
1. Process
The class that creates the process: Process ([group [, target [, name [, args [, kwargs]), target represents the calling object, and args represents the location parameter tuple of the calling object. Kwargs represents the dictionary of the calling object. Name is an alias. Group is essentially not used.
Methods: is_alive (), join ([timeout]), run (), start (), terminate (). Where Process starts a process with start ().
Is_alive (): determines whether the process is still alive
Join ([timeout]): the main process blocks and waits for the exit of the child process. The join method is used after close or terminate.
Run (): run () is automatically called when process p calls start ().
Properties: authkey, daemon (to be set by start ()), exitcode (the process is None at run time, if-N means it is terminated by signal N), name, pid. Where daemon is automatically terminated after the parent process terminates, and it cannot generate a new process itself, so it must be set before start ().
The following demo. Climb a pen interesting Pavilion novel network, only climb 4 novels, start four threads at the same time. The startup mode is a bit low. In order to count the time, so write it that way, if there is any better way to leave a message, welcome your guidance.
#! / usr/bin/env python
#-*-coding: utf-8-*-
# @ Time: 17:15 on 2019-1-3
# @ Author: jia.zhao
# @ Desc:
# @ File: process_spider.py
# @ Software: PyCharm
From multiprocessing import Process, Lock, Queue
Import time
From selenium import webdriver
From selenium.webdriver.chrome.options import Options
Import requests
From lxml import etree
ExitFlag = 0
Q = Queue ()
Chrome_options = Options ()
Chrome_options.add_argument ('--headless')
Class scrapy_biquge ():
Def get_url (self):
Browser = webdriver.Chrome (chrome_options=chrome_options)
Browser.get ('http://www.xbiquge.la/xuanhuanxiaoshuo/')
# get a novel
Content = browser.find_element_by_class_name ("r")
Content = content.find_elements_by_xpath ('/ / ul/li/span [@ class= "S2"] / a')
For i in range (len (content)):
# the name of the novel
Title = content [I] .text
# url of the novel
Href = content [I]. Get _ attribute ('href')
Print (title +'+'+ href)
# load in queue
Q.put (title +'+'+ href)
If I = = 3:
Break
Browser.close ()
Browser.quit ()
Def get_dir (title, href):
Time.sleep (2)
Res = requests.get (href, timeout=60)
Res.encoding = 'utf8'
Novel_contents = etree.HTML (res.text)
Novel_dir = novel_contents.xpath ('/ / div [@ id= "list"] / dl/dd/a//text ()')
Novel_dir_href = novel_contents.xpath ('/ / div [@ id= "list"] / dl/dd/a/@href')
Path = 'novel/' + title +' .txt'
List_content = []
I = 0
For novel in range (len (novel_dir)):
Novel_dir_content = get_content ('http://www.xbiquge.la'+novel_dir_href[novel])
Print (title, novel_ dirt])
List_content.append (novel_ dirt] +'\ n' + '.join (novel_dir_content) +'\ n')
I = I + 1
If I = = 2:
Try:
With open (path, 'asides, encoding='utf8') as f:
F.write ('\ n'.join (list_content))
F.close ()
List_content = []
I = 0
Except Exception as e:
Print (e)
Def get_content (novel_dir_href):
Time.sleep (2)
Res = requests.get (novel_dir_href, timeout=60)
Res.encoding = 'utf8'
Html_contents = etree.HTML (res.text)
Novel_dir_content = html_contents.xpath ('/ / div [@ id= "content"] / / text ()')
Return novel_dir_content
Class MyProcess (Process):
Def _ _ init__ (self, Q, lock):
Process.__init__ (self)
Self.q = Q
Self.lock = lock
Def run (self):
Print (self.q.qsize (), 'queue size')
Print ('Pid:' + str (self.pid) + 'LoopCount:')
Self.lock.acquire ()
While not self.q.empty ():
Item = self.q.get ()
Print (item)
Self.lock.release ()
Title = item.split ('+') [0]
Href = item.split ('+') [1]
Try:
Get_dir (title, href)
Except Exception as e:
Print (e, 'skip loop with exception')
Continue
If _ _ name__ = ='_ _ main__':
Start_time = time.time ()
Print (start_time)
Scrapy_biquge () .get_url ()
Lock = Lock ()
P0 = MyProcess (Q, lock)
P0.start ()
P1 = MyProcess (Q, lock)
P1.start ()
P2 = MyProcess (Q, lock)
P2.start ()
P3 = MyProcess (Q, lock)
P3.start ()
P0.join ()
P1.join ()
P2.join ()
P3.join ()
End_time = time.time ()
Print (start_time, end_time, end_time-start_time, 'time difference')
The queue processing in multi-process is used to realize the data sharing between processes. The code should run directly
If you have any questions, you can leave a message about "Python multi-process batch crawling novel code sharing". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.