How python crawls movies and downloads them 07/19 Update SLTechnology News&Howtos

How python crawls movies and downloads them

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article is about how python crawls movies and downloads them. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

I. Overview

For an otaku, like to watch movies, every time you open the movie website, all kinds of pop-up ads are very troublesome, or you still have to copy and download links to Xunlei and paste and download them, and there is also choice difficulty in this process; this series of actions are very uncomfortable, so it's better to have a good one and just light it. As a python enthusiast, combined with a little knowledge of crawlers, I spent some time on the weekend writing a crawl for the latest movie section on a movie website using python.

Train of thought:

For a movie website, the crawler collects movie names, download links, ratings, and other information; specially prints out movies that are updated that day; at the same time, it calls Thunderbolt to download through ratings, of course, to judge whether it has already been downloaded, and then decide whether to download it; then, you can watch it.

This version is based on python3.x, and you can only call Xunlei ~ linux platform on windows to get relevant information!

Python installation and related module installation are not described here, if you do not understand, please leave me a message.

Run on jupyter as follows:

II. Code

Don't talk more nonsense. Let's talk about the code.

# coding:utf-8# version 20181027 by sanimport re,time,osfrom urllib import requestfrom lxml import etree # python xpath alone using imported import platformimport sslssl._create_default_https_context = ssl._create_unverified_context # cancel global certificate # crawler movies such as class getMovies: def _ _ init__ (self,url Thuder):''instance initialization' 'self.url = url self.Thuder = Thuder def getResponse (self Url): url_request = request.Request (self.url) url_response = request.urlopen (url_request) return url_response # return this object def newMovie (self):''get the latest movie download address and url' 'http_response = self.getResponse (webUrl) # the context object (HTTPResponse object) data after getting the http request = http_response.read (). Decode ('gbk') # print (data) # get web content html = etree.HTML (data) newMovies = dict () lists = html.xpath (' / html/body/div [1] / div/div [3] / div [2] / div [2] / div [1] / div/div [2] / div [2] / ul/table//a') For k in lists: if "app.html" in k.items () [0] [1] or "latest movie download" in k.text: continue else: movieUrl = webUrl + k.items () [0] [1] movieName = k.text.split ('") [1] .split (") [0] NewMovies [k.text.split ('") [1] .split (") [0]] = movieUrl = webUrl + k.items () [0] [1] return newMovies def Movieurl (self Url):''get score and update time' 'url_request = request.Request (url) movie_http_response = request.urlopen (url_request) data = movie_http_response.read (). Decode (' gbk') if len (re.findall (r'Douban score. +. + users',data)): # get the score Return null pingf = re.findall (r 'Douban score. +?. + users',data) [0] .split (' /') [0] .replace ("\ u3000", ":") else: pingf = "Douban score: null" desc = re.findall (r 'simple\ s + introduction. *', data) [0] .replace ("\ u3000", ") .replace ('' "). Split (" src ") [0]. Replace ('& ldquo',"). Replace ('& rdquo',"). Replace ('

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.