Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Python multi-process batch crawling novel code sharing

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article introduces the knowledge of "Python multi-process batch crawling novel code sharing". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Use python to run the same code for multiple processes.

Multithreading in python is not really multithreading, and if you want to make full use of the resources of multicore CPU, you need to use multiprocesses in most cases in python. Python provides a very useful multi-process package multiprocessing, you only need to define a function, Python will do everything else. With this package, you can easily complete the transformation from a single process to concurrent execution. Multiprocessing supports child processes, communicating and sharing data, performing different forms of synchronization, and provides components such as Process, Queue, Pipe, Lock, and so on.

1. Process

The class that creates the process: Process ([group [, target [, name [, args [, kwargs]), target represents the calling object, and args represents the location parameter tuple of the calling object. Kwargs represents the dictionary of the calling object. Name is an alias. Group is essentially not used.

Methods: is_alive (), join ([timeout]), run (), start (), terminate (). Where Process starts a process with start ().

Is_alive (): determines whether the process is still alive

Join ([timeout]): the main process blocks and waits for the exit of the child process. The join method is used after close or terminate.

Run (): run () is automatically called when process p calls start ().

Properties: authkey, daemon (to be set by start ()), exitcode (the process is None at run time, if-N means it is terminated by signal N), name, pid. Where daemon is automatically terminated after the parent process terminates, and it cannot generate a new process itself, so it must be set before start ().

The following demo. Climb a pen interesting Pavilion novel network, only climb 4 novels, start four threads at the same time. The startup mode is a bit low. In order to count the time, so write it that way, if there is any better way to leave a message, welcome your guidance.

#! / usr/bin/env python

#-*-coding: utf-8-*-

# @ Time: 17:15 on 2019-1-3

# @ Author: jia.zhao

# @ Desc:

# @ File: process_spider.py

# @ Software: PyCharm

From multiprocessing import Process, Lock, Queue

Import time

From selenium import webdriver

From selenium.webdriver.chrome.options import Options

Import requests

From lxml import etree

ExitFlag = 0

Q = Queue ()

Chrome_options = Options ()

Chrome_options.add_argument ('--headless')

Class scrapy_biquge ():

Def get_url (self):

Browser = webdriver.Chrome (chrome_options=chrome_options)

Browser.get ('http://www.xbiquge.la/xuanhuanxiaoshuo/')

# get a novel

Content = browser.find_element_by_class_name ("r")

Content = content.find_elements_by_xpath ('/ / ul/li/span [@ class= "S2"] / a')

For i in range (len (content)):

# the name of the novel

Title = content [I] .text

# url of the novel

Href = content [I]. Get _ attribute ('href')

Print (title +'+'+ href)

# load in queue

Q.put (title +'+'+ href)

If I = = 3:

Break

Browser.close ()

Browser.quit ()

Def get_dir (title, href):

Time.sleep (2)

Res = requests.get (href, timeout=60)

Res.encoding = 'utf8'

Novel_contents = etree.HTML (res.text)

Novel_dir = novel_contents.xpath ('/ / div [@ id= "list"] / dl/dd/a//text ()')

Novel_dir_href = novel_contents.xpath ('/ / div [@ id= "list"] / dl/dd/a/@href')

Path = 'novel/' + title +' .txt'

List_content = []

I = 0

For novel in range (len (novel_dir)):

Novel_dir_content = get_content ('http://www.xbiquge.la'+novel_dir_href[novel])

Print (title, novel_ dirt])

List_content.append (novel_ dirt] +'\ n' + '.join (novel_dir_content) +'\ n')

I = I + 1

If I = = 2:

Try:

With open (path, 'asides, encoding='utf8') as f:

F.write ('\ n'.join (list_content))

F.close ()

List_content = []

I = 0

Except Exception as e:

Print (e)

Def get_content (novel_dir_href):

Time.sleep (2)

Res = requests.get (novel_dir_href, timeout=60)

Res.encoding = 'utf8'

Html_contents = etree.HTML (res.text)

Novel_dir_content = html_contents.xpath ('/ / div [@ id= "content"] / / text ()')

Return novel_dir_content

Class MyProcess (Process):

Def _ _ init__ (self, Q, lock):

Process.__init__ (self)

Self.q = Q

Self.lock = lock

Def run (self):

Print (self.q.qsize (), 'queue size')

Print ('Pid:' + str (self.pid) + 'LoopCount:')

Self.lock.acquire ()

While not self.q.empty ():

Item = self.q.get ()

Print (item)

Self.lock.release ()

Title = item.split ('+') [0]

Href = item.split ('+') [1]

Try:

Get_dir (title, href)

Except Exception as e:

Print (e, 'skip loop with exception')

Continue

If _ _ name__ = ='_ _ main__':

Start_time = time.time ()

Print (start_time)

Scrapy_biquge () .get_url ()

Lock = Lock ()

P0 = MyProcess (Q, lock)

P0.start ()

P1 = MyProcess (Q, lock)

P1.start ()

P2 = MyProcess (Q, lock)

P2.start ()

P3 = MyProcess (Q, lock)

P3.start ()

P0.join ()

P1.join ()

P2.join ()

P3.join ()

End_time = time.time ()

Print (start_time, end_time, end_time-start_time, 'time difference')

The queue processing in multi-process is used to realize the data sharing between processes. The code should run directly

If you have any questions, you can leave a message about "Python multi-process batch crawling novel code sharing". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report