In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
How to use regular crawling data in Python, aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.
1. The basis of regular expression
(1) General characters
(2) predefined character set
(3) quantifier
(4) Boundary matching
Note: one of the most commonly used matching methods (. *?) represents matching any character.
2. How to use re module
The re module enables Python to have all the regular expression functions.
The common function 1:search () function matches and extracts the first content that conforms to the rule, and returns a regular expression object.
The common function 2:findall () function matches all the contents that conform to the rules and returns the results in the form of a list.
Note: findall is usually used when crawling data.
Re module modifier
3. Case practice
Case name: crawling the full-text novel "fighting through the Sky"
Web link: http://www.doupoxs.com/doupocangqiong/
Crawl ideas:
(1) Open the web page and learn about the URL information of the web page. By opening the first chapter and the second chapter, you can find the following links.
Http://www.doupoxs.com/doupocangqiong/1.html
Http://www.doupoxs.com/doupocangqiong/2.html
Http://www.doupoxs.com/doupocangqiong/3.html
Obvious links paginate each chapter by adding numbers.
(2) crawl the full-text information and find the corresponding location as follows
(3) data is stored in TXT text
4. The detailed code is as follows:
Import requestsimport reimport time
Headers= {"User-Agent": request header}
F=open ('doupo.txt','a+')
Def get_info (url): res=requests.get (url,headers=headers) if res.status_code==200: contents = re.findall ('
(. *)
', res.content.decode (' utf-8'), re.S) for content in contents: f.write (content+'\ n') print (content) else: pass
If _ _ name__=='__main__': urls= ['http://www.doupoxs.com/doupocangqiong/{}.html'.format(i) for i in range (2je 10)] for url in urls: get_info (url) time.sleep (1) f.close ()
The running results are as follows:
This is the answer to the question about how to use regular crawling data in Python. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.