In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
Most people do not understand the knowledge points of this article "Python how to crawl housing data", so the editor summarizes the following contents for you. The content is detailed, the steps are clear, and it has a certain reference value. I hope you can get something after reading this article. Let's take a look at this "how Python crawls housing data" article.
What is a reptile?
When carrying out big data analysis or data mining, data sources can be obtained from some websites that provide data statistics, as well as from some literature or internal materials, but these ways of obtaining data, sometimes it is difficult to meet our needs for data, and it takes too much energy to find the data manually from the Internet. At this time, we can use crawler technology to automatically obtain the data content we are interested in from the Internet, and crawl it back as our data source, so as to carry out more in-depth data analysis. and get more valuable information. Before using a crawler, you should first understand the library (requests) or (urllib.request) that the crawler needs, which is created for crawling data tasks.
Second, use steps
All url in this article are not available url and cannot be run directly! It is illegal to crawl other people's data, please pay attention when learning crawlers! )
1. Import the library
The code is as follows (example):
Import osimport urllib.requestimport randomimport timeclass BeikeSpider: def _ init__ (self, save_path= ". / beike"): "shell reptile constructor: param save_path: Web page save directory"2. Read in data
The code is as follows:
# URL mode self.url_mode = "http://{}.***.com/loupan/pg{}/" # cities to be crawled self.cities = [" cd "," sh "," bj "] # pages crawled per city self.total_pages = 20 # Let crawlers sleep randomly for 5-10 seconds self.sleep = (5 10) # download the web page and save the root directory self.save_path = save_path # set the user agent It's a crawler disguised as a browser self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0) Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36 "} # Information of proxy IP self.proxies = [{" https ":" 123.163.67.50 IP 8118 "}, {" https ":" 58.56.149.198 IP 53281 "} {"https": "14.115.186.161 self.save_path 8118"] # create a save directory if not os.path.exists (self.save_path): os.makedirs (self.save_path) def crawl (self): "perform crawl task: return: None"
Data requested by the url network used there.
3. Randomly select an ip address to build a proxy server for city in self.cities: print ("the city being crawled:", city) # the web pages of each city are stored in a separate directory path = os.path.join (self.save_path) City) if not os.path.exists (path): os.makedirs (path) for page in range (1, self.total_pages+1): # build complete url url = self.url_mode.format (city, page) # build Request object Put url and request headers into the object request = urllib.request.Request (url Headers=self.headers) # randomly select a proxy IP proxy = random.choice (self.proxies) # build proxy server processor proxy_handler = urllib.request.ProxyHandler (proxy) # build opener opener = urllib.request.build_opener (proxy_handler) # Open a web page using the built opener response = opener.open (request) html = response.read () .decode ("utf-8") # save the file name (including the path) filename = os.path.join (path) Str (page) + ".html") # Save the web page self.save (html, filename) print ("page d saved successfully!" % page) # Random hibernation sleep_time = random.randint (self.sleep [0], self.sleep [1]) time.sleep (sleep_time)
In addition to randomly selecting ip addresses, this place will also limit the speed of crawling data and avoid violent crawling.
4. Run the code def save (self, html, filename): "" Save the downloaded web page: param html: webpage content: param filename: saved file name: return: "f = open (filename,'w') Encoding= "utf-8") f.write (html) f.close () def parse (self): "parsing web page data: return:" passif _ _ name__ = "_ _ main__": spider = BeikeSpider () spider.crawl ()
The result of the run will be like this and will be saved in your folder.
The above is about the content of this article on "how Python crawls housing data". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more about the relevant knowledge, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.