In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article will explain in detail about Python crawler videos and examples of crawling using python3. The content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.
Use Python3 to crawl a novel, the code looks a little messy, here is a screenshot
Import requests
From lxml import etree
From multiprocessing.dummy import Pool
Import os
Import re
# the chapter address of the novel
Chapter_url = "https://www.biqudu.com/43_43821/"
# ignore warning https and sometimes report errors when setting verify=False request. This statement can solve the problem.
Requests.packages.urllib3.disable_warnings ()
Def get_response (url):
''
Get the response data according to the specified URL,
Returns data in xpath selector format
''
Html = requests.get (url,verify=False)
Return etree.HTML (html.text)
Def get_chapter_content (selector):
''
Input data in xpath selector format to get the desired data
Returns an array of chapter titles and chapter addresses
''
Html = []
# obtain title based on xpath
Title = selector.xpath ('/ / * [@ id= "list"] / dl/dd/a/text ()')
# obtain url based on xpath
Href = selector.xpath ('/ / * [@ id= "list"] / dl/dd/a/@href')
# the reason why we start traversing with 12 is because the first few data are not needed
For i in range (12pr len (title))
Tit = title [I]
Url = "https://www.biqudu.com" + href [I]
Chapter = {"title": tit, "url": url}
Html.append (chapter)
Return html
Def save_content (url):
''
Get the data and save it according to the URL passed in
The URL passed in here is the address stored in a dictionary and the name of the corresponding saved file
''
# Save the folder where you downloaded the file
Folder = 'novel'
# get selector
Html = get_response (url ['url'])
# extract what you want
Con = html.xpath ('/ / * [@ id= "content"] / text ()')
# create a folder to determine whether it exists
If not os.path.exists (folder):
Os.mkdir (folder)
# remove illegal characters
FileName = re.sub ('[\ /: *? "|]','-', url ['name'])
# Save the file
With open (folder + "/" + fileName + ".txt", "w+", encoding= "utf-8") as f:
# what you get is a list that is converted to str.
Content = '.join (con)
# traversal string saved as no more than 50 characters per line
For i in range (0jinlen (content), 50):
F.write (content [I: iTun50] + "\ n")
Def get_content (html):
''
Parallel crawling to save data
''
Urls = []
For con in html:
Url = con ['url']
Name = con ['title']
Urls.append ({'name':name,'url':url})
# number of threads
Pool = Pool (4)
# use map for parallel crawling, and save_content saves functions for crawling
# urls is a list, in which the URL list and the corresponding save name are stored.
Pool.map (save_content,urls)
Pool.close ()
Pool.join ()
Def main ():
Selector = get_response (chapter_url)
Html = get_chapter_content (selector)
Get_content (html)
If _ _ name__ = ='_ _ main__':
Main ()
This is the end of the video about Python crawler and how to crawl with python3. I hope the above content can be helpful to you and learn more. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.