Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to write the basic implementation code of web crawler based on Python

2025-03-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Python to achieve the basic implementation of web crawler code how to write, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

Python is not only a powerful computer programming language, but also can be regarded as an object-oriented universal language. Its functional features are more prominent, and it is very convenient for developers to apply. Here, let's take a look at how Python implements web crawlers.

Today, I saw a web page, and because I surfed the Internet with a phone line at home, it was very troublesome to read online all the time. So I wrote a simple program to grab the web page and read it offline to save the phone bill:) this program because the home page links to the pages in the same directory, the structure is very simple, only one layer. So I wrote some hard coding to do the analysis of the link address.

The code of the web crawler implemented by Python is as follows:

#! / usr/bin/env python #-*-coding: GBK-* import urllib from sgmllib import SGMLParser class URLLister (SGMLParser): def reset (self): SGMLParser.reset (self) self.urls = [] def start_a (self, attrs): href = [v for k V in attrs if k = = 'href'] if href: self.urls.extend (href) url = r' http://www.sinc.sunysb.edu/Clubs/buddhism/JinGangJingShuoShenMo/' sock = urllib.urlopen (url) htmlSource = sock.read () sock.close () # print htmlSource f = file ('jingangjing.html' 'w') f.write (htmlSource) f.close () mypath = r 'http://www.sinc.sunysb.edu/Clubs/buddhism/JinGangJingShuoShenMo/' parser = URLLister () parser.feed (htmlSource) for url in parser.urls: myurl = mypath + url print "get:" + myurl sock2 = urllib.urlopen (myurl) html2 = sock2.read () sock2.close () # Save to file print "save as:" + url f2 = file (url 'w') f2.write (html2) f2.close () is it helpful for you to finish reading the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report