In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly introduces python how to achieve the crawler program, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.
What is a web crawler?
To put it simply, a web crawler simulates the behavior of people visiting web sites to obtain valuable data. Professional explanation: Baidu encyclopedia
Analyze the needs of reptiles
Determine the goal
Crawl some information about movies that are hot within Top100, including the title, rating, director, screenwriter, starring, type, country of production, language, release date, length, IMDb link, etc.
Analysis target
1. Analyze the target web page with the help of tools
First of all, when we open Douban movie hit movies, we will find that there are a total of 20 movies on the page, but when we look at the source code of the page, we can't find these movies as information at all in the source code. Why is that? It turns out that Douban is here to obtain movie information through ajax technology, and then dynamically load the data into the page. This requires the help of Chrome developer tools to first find the API to get movie information.
Then analyze the details page of the movie.
Train of thought analysis
Concrete realization
Development environment
Python3.6
Pycharm
Main dependent library
Urllib-basic network-related operations
Lxml-parsing HTML pages through xpath syntax
Json-manipulate JSON data obtained through API
Re-regular operation
Code implementation
From urllib import requestfrom lxml import etreeimport jsonimport reimport ssl# global cancel certificate verification ssl._create_default_https_context = ssl._create_unverified_contextdef get_headers (): "return request header information: return:" headers = {'User-Agent': "Mozilla/5.0 (Macintosh) Intel Mac OS X 10 / 12 / 6) "" AppleWebKit/537.36 (KHTML, like Gecko) "" Chrome/65.0.3325.181 Safari/537.36 "} return headersdef get_url_content (url):" get the request content of the specified url: param url:: return: "content =''headers = get_headers () res = request.Request (url Headers=headers) try: resp = request.urlopen (res Timeout=10) content = resp.read (). Decode ('utf-8') except Exception as e: print (' exception:% s'% e) return contentdef parse_content (content): "analyze the web page: param content:: return:" movie = {} html = etree.HTML (content) try: info = html.xpath ("/ / div [@ id='info']") [0] movie ['director] '] = info.xpath (". / span [1] / span [2] / a/text ()") [0] movie [' screenwriter'] = info.xpath (". / span [2] / span [2] / a/text ()") [0] movie ['actors'] =' / '.join (info.xpath (". / span [3] / span [2] / a/text ()") movie [' type'] ='/'. Join (info.xpath (". / span [@ property='v:genre'] /"text ()") movie ['initialReleaseDate'] =' /'.\ join (info.xpath (". / / span [@ property='v:initialReleaseDate'] / text ()") movie ['runtime'] =\ info.xpath (". / / span [@ property='v:runtime'] / text ()") [ 0] def str_strip (s): return s.strip () def re_parse (key Regex): ret = re.search (regex, content) movie [key] = str_strip (ret [1]) if ret else''re_parse (' region', r' producer country: (. *?)') Re_parse ('language', r' language: (. *)') Re_parse ('imdb' R'IMDb link:') except Exception as e: print ('parsing exception:% s'% e) return moviedef spider (): "Top 100 popular movies climbing Douban: return:" recommend_moives = [] movie_api = 'https://movie.douban.com/j/search_subjects?'\' type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend' & page_limit=100&page_start=0' content = get_url_content (movie_api) json_dict = json.loads (content) subjects = json_dict ['subjects'] for subject in subjects: content = get_url_content (subject [' url']) movie = parse_content (content) movie ['title'] = subject [' title'] movie ['rate'] = subject [' rate'] recommend_moives.append (movie) print (len (recommend_moives)) print (recommend_moives) if _ _ name__ ='_ main__': spider ()
Effect.
Thank you for reading this article carefully. I hope the article "how to implement the crawler program in python" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 275
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.