Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the Python crawler videos and instances crawled using python3?

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article will explain in detail about Python crawler videos and examples of crawling using python3. The content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Use Python3 to crawl a novel, the code looks a little messy, here is a screenshot

Import requests

From lxml import etree

From multiprocessing.dummy import Pool

Import os

Import re

# the chapter address of the novel

Chapter_url = "https://www.biqudu.com/43_43821/"

# ignore warning https and sometimes report errors when setting verify=False request. This statement can solve the problem.

Requests.packages.urllib3.disable_warnings ()

Def get_response (url):

''

Get the response data according to the specified URL,

Returns data in xpath selector format

''

Html = requests.get (url,verify=False)

Return etree.HTML (html.text)

Def get_chapter_content (selector):

''

Input data in xpath selector format to get the desired data

Returns an array of chapter titles and chapter addresses

''

Html = []

# obtain title based on xpath

Title = selector.xpath ('/ / * [@ id= "list"] / dl/dd/a/text ()')

# obtain url based on xpath

Href = selector.xpath ('/ / * [@ id= "list"] / dl/dd/a/@href')

# the reason why we start traversing with 12 is because the first few data are not needed

For i in range (12pr len (title))

Tit = title [I]

Url = "https://www.biqudu.com" + href [I]

Chapter = {"title": tit, "url": url}

Html.append (chapter)

Return html

Def save_content (url):

''

Get the data and save it according to the URL passed in

The URL passed in here is the address stored in a dictionary and the name of the corresponding saved file

''

# Save the folder where you downloaded the file

Folder = 'novel'

# get selector

Html = get_response (url ['url'])

# extract what you want

Con = html.xpath ('/ / * [@ id= "content"] / text ()')

# create a folder to determine whether it exists

If not os.path.exists (folder):

Os.mkdir (folder)

# remove illegal characters

FileName = re.sub ('[\ /: *? "|]','-', url ['name'])

# Save the file

With open (folder + "/" + fileName + ".txt", "w+", encoding= "utf-8") as f:

# what you get is a list that is converted to str.

Content = '.join (con)

# traversal string saved as no more than 50 characters per line

For i in range (0jinlen (content), 50):

F.write (content [I: iTun50] + "\ n")

Def get_content (html):

''

Parallel crawling to save data

''

Urls = []

For con in html:

Url = con ['url']

Name = con ['title']

Urls.append ({'name':name,'url':url})

# number of threads

Pool = Pool (4)

# use map for parallel crawling, and save_content saves functions for crawling

# urls is a list, in which the URL list and the corresponding save name are stored.

Pool.map (save_content,urls)

Pool.close ()

Pool.join ()

Def main ():

Selector = get_response (chapter_url)

Html = get_chapter_content (selector)

Get_content (html)

If _ _ name__ = ='_ _ main__':

Main ()

This is the end of the video about Python crawler and how to crawl with python3. I hope the above content can be helpful to you and learn more. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report