In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
Today, the editor will share with you the relevant knowledge points about how to climb data with python. The content is detailed and the logic is clear. I believe most people still know too much about this, so share this article for your reference. I hope you can get something after reading this article. Let's take a look at it.
The editor first put a sample code (take Dangdang praise list TOP500 as an example), the following introduction is based on this code.
Import requests # introduce request library for page request from requests.exceptions import RequestException # introduce RequestException to catch possible exceptions in request import re # introduce re library for regular matching import json # introduce json for json format conversion def get (url): # encapsulate the request method It is convenient to use the try statement to catch the exception try: response = requests.get (url) # use the get method of request to get the response flow if response.status_code = = 200: # handle the response flow, if not 200 responses Then return None return response.text # return the response stream as text return None except RequestException: return Nonedef parse (text): # encapsulate the regular matching method Modularize the code pattern = re.compile ('. *? list_num.*? > (. *?). *? pic.*?src= "(. *?)". *? name ">. *? tuijian" > (. *?). *? publisher_info.*?title= "(. *?)". *? biaosheng.*? ('. *?). Re.S) # set the regular expression matching rule items = re.findall (pattern, text) # use regular matching to regularly match incoming text text And save the matching result in items return items # return the matching result to if _ _ name__ = = "_ _ main__": target_url = "http://bang.dangdang.com/books/fivestars/" # the page crawled by the target url html = get (target_url) # crawled the entire target HTML page using the encapsulated get method for item in parse (html): # Regular matching of target HTML using encapsulated regular matching method Then use a loop to process the result print (item) # next is the operation of writing the result to the txt file with open ('book.txt',' asides, encoding='UTF-8') as f: f.write (json.dumps (item, ensure_ascii=False) +'\ n') # use the dumps method of the json library to convert the list object into json object (string) Then write the text f.close ()
The first step in crawler development: Web page analysis
The first step in crawler development is to analyze the target web page. the first step is to know where the target data you need is. Here, you use the developer tool to look at the entire page structure and find that the target data is in one element, so the development idea is to get the data from this page, and then get the elements in the data (the useful data in it).
The second step of crawler development, data crawling
After the first step of the analysis, we have roughly the idea of a crawler, so what we need to do now is to climb down this page. That's when the request library came out. Using the get () method of request, you can climb down the html of the target page. You get the html of the target page (stored in the html string in the code). You can proceed with the next operation.
The third step of crawler development, data processing
Use regular expressions to match the target data in the code (that is, useful data from the li elements analyzed earlier). And put it in an items list. After the completion of this step, the crawling analysis of the data is basically over. Just save the crawled results.
The fourth step of crawler development is data preservation.
The editor here uses the file read and write function that comes with python to save the data in a file called book.txt in json format.
These are all the contents of the article "how to climb data with python". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.