In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
What this article shares with you is an example analysis of free or paid novels crawled by python. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.
I believe we all like to read the novel chapter after chapter has great attraction and want to read, of course, the price of paid novels is not cheap to see half of the sudden charges caught off guard! With us programmers, fees don't exist. Everything can climb.
What is a web crawler?
Web crawler (also known as web spider, web robot, in the FOAF community, more often called web chaser), is a program or script that automatically grabs the information of the World wide Web according to certain rules. Other infrequently used names include ants, automatic indexing, simulators, or worms.
Environment: Python3.6+Windows
Development tools: you can use whichever you like, as long as you are happy!
Main ideas:
1 get the source code of the home page
2 get hyperlinks to chapters
3 get the chapter hyperlink source code
4 get the content of the novel
5 download, file operation
Learn about the Python code.
Import urllib.requestimport re# 1 get homepage source code # 2 get chapter hyperlink # 3 get chapter hyperlink source code # 4 get novel content # 5 download, file operation
# Hump nomenclature # get the novel content def getNovertContent (): # html = urllib.request.urlopen ("http://www.quanshuwang.com/book/0/269").read() html = html.decode (" gbk ") # No parentheses do not match # regular expression. *? Match all reg = r'(. *?)'# reg = re.compile (reg) urls = re.findall (reg,html) # print (urls) # list # [(http://www.quanshuwang.com/book/0/269/78850.html, Chapter 1 Hillside Village), # (http://www.quanshuwang.com/book/0/269/78854.html, Chapter II Qingniu Town)] for url in urls: # the URL address of the chapter novel_url = url [0] # chapter title novel_title = url [1]
Chapt = urllib.request.urlopen (novel_url). Read () chapt_html = chapt.decode ("gbk") # r represents the native string\\ d "\ d" reg = r'(. *?)'# S represents multi-line matching reg = re.compile (reg,re.S) chapt_content = re.findall (reg Chapt_html) # print (chapt_content) # list ["& nbsp two fools with their eyes wide open Look straight at the thatch and mud paste "]
# string replaced by the first parameter chapt_content = chapt_content [0] .replace (",") # print (chapt_content) the string stared wide-eyed at chapt_content = chapt_content.replace (",")
Print (saving% s' novel_title) # w read-write mode wb # f = open ("{} .txt" .format (novel_title),'w') # f.write (chapt_content)
With open ("{} .txt" .format (novel_title),'w') as f: f.write (chapt_content)
# f.close ()
GetNovertContent ()
Running result:
The above is an example analysis of python crawling free or paid novels, and the editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.