How Python crawls movies 07/12 Update SLTechnology News&Howtos

How Python crawls movies

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how Python crawls movies". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Implement the function:

Crawl the m3u8 segmented video file from the website, decrypt the encrypted "ts" file, and merge the "ts" file in two ways. In order to prevent IP from being blocked, use a proxy, and finally delete the temporary file.

Environment & dependence

Win10 64bit

IDE:Pycharm

Python 3.8

Python-site-package:requests + BeautifulSoup + lxml + m3u8 + AES

Creating a project in PyCharm will create a temporary directory to store the environment and the required package packages, so add all the necessary packages in the project interpreter (Project Interpreter) in PyCharm. This screenshot is a list of packages for this project, and the red box is the necessary package. I don't know what to do with other packages.

[the outer link picture is being transferred. (img-kZvgLUlH-1598605086326)]

Let's start our dinner, the first step of crawling data, we need to parse the target website, find the address where we need to crawl video, and F12 open the developer tool.

[the picture of the outer chain is being transferred. (img-yKJgUKe8-1598605086327)

[the picture of the outer chain is being transferred. (img-vXsh0B96-1598605086353)]

Unfortunately, the videos on this website are packaged and loaded in m3u8 video segments.

To popularize science: the m3u8 file is essentially a playlist (playlist), which may be a media playlist (Media Playlist) or a main list (Master Playlist). But no matter what kind of playlist it is, the internal text is encoded in utf-8.

When a m3u8 file is used as a media playlist (Meida Playlist), its internal information records a series of media fragment resources, which can be played sequentially to fully display multimedia resources.

OK, in line with the principle of "there are no insurmountable difficulties", we continued, still in developer mode, switching from Elements mode to NetWork mode, removing unneeded data, we found two m3u8 files, a key file and a ts file.

[the outer link picture is being transferred. (img-fEQVPtfL-1598605086354)]

After clicking separately, we can see the corresponding address.

[the outer link picture is being transferred. (img-AaVH7Y5i-1598605086355)]

OK, now that we have the address, we can start our data download.

First initialize, including path setting, camouflage of the request header, and then we download all ts files through a loop. As for how to define the number of cycles, we can get a list of all ts files by parsing the files after downloading the m3u8 files, then concatenate the address and then loop to get all the ts files.

The first layer of novice learning, Python tutorials / tools / methods / problem solving + VGV itz992examples EXTM3UTXEXTXExxxxxxxxxxxxxxm3u8

Observe the data, not the real path. The second layer path can be seen on the third line. Combined with our analysis of the source code of the website, the string request is stitched again:

Layer 2 # EXT-X-VERSION:3#EXT-X-TARGETDURATION:2#EXT-X-MEDIA-SEQUENCE:0#EXT-X-KEY:METHOD=AES-128,URI= "key.key" # EXTINF:2.000000,IsZhMS5924000.ts#EXTINF:2.000000,IsZhMS5924001.ts#EXT-X-ENDLIST

Then we cycle through the TS list and download the video clip through the splicing address. But the problem is far from that simple. The ts file we downloaded cannot be played. By analyzing the m3u8 file downloaded from the second layer, we can find this line of code:

# EXT-X-KEY:METHOD=AES-128,URI= "key.key"

This website uses AES method to encrypt all ts files, of which

METHOD=ASE-128: indicates that this video is encrypted by ASE-128.

URI= "key.key": the address that represents key

To sum up, I feel so difficult, I can't watch the video, but we have to stick to our original intention and don't give up. Fortunately, we should be grateful for the powerful module function of Python. This problem can be solved by downloading the AES module.

When we are done, we need to merge all the ts into a MP4 file. The easiest thing to do is to enter the path where the video is located under the CMD command and execute:

Copy / b * .ts fileName.mp4

Note that all TS files need to be arranged in order. In this project, we use os module to merge and delete temporary ts files directly.

Complete code:

Method 1:

Novice learning Python tutorial / tools / methods / troubleshooting + V:itz992import reimport requestsimport m3u8import timeimport osfrom bs4 import BeautifulSoupimport jsonfrom Crypto.Cipher import AESclass VideoCrawler (): def _ _ init__ (self,url): super (VideoCrawler, self). _ _ init__ () self.url=urlself.down_path=r "F:\ Media\ Film\ Temp" self.final_path=r "F:\ Media\ Film\ Final" self.headers= {'Connection':'Keep-Alive','Accept':'text/html,application/xhtml+xml,*/*' 'User-Agent':'Mozilla/5.0 (Linux U; Android 6.0; zh-CN MZ-m2 note Build/MRA58K) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/40.0.2214.89 MZBrowser/6.5.506 UWS/2.10.1.22 Mobile Safari/537.36'} def get_url_from_m3u8 (self,readAdr): print ("parsing the real download address...") with open ('temp.m3u8' 'wb') as file:file.write (requests.get (readAdr) .content) m3u8Obj=m3u8.load (' temp.m3u8') print ("parsing complete") return m3u8Obj.segmentsdef run (self): print ("Start!") start_time=time.time () os.chdir (self.down_path) html=requests.get (self.url) .textbsObj = BeautifulSoup (html 'lxml') tempStr = bsObj.find (class_= "iplays"). Contents [3] .string # find the component firstM3u8Adr=json.loads (tempStr.strip (' var player_data=')) ["url"] # where the m3u8 address is stored through class to get the first layer m3u8 address tempArr=firstM3u8Adr.rpartition ('/') realAdr= "% s/500kb/hls/%s"% (tempArr [0], tempArr [2]) # under certain rules of string concatenation to get the second layer address Get the real m3u8 download address Key_url= "% s/500kb/hls/key.key"% tempArr [0] # analyze the law of string concatenation to get the address of key key=requests.get (key_url). ContentfileName = bsObj.find (class_= "video-title w100"). Contents [0] .contents [0] # find the rule of video name from the source code fileName=re.sub (r'[\ s force!]','' FileName) # remove exclamation points, commas, spaces and other special strings cryptor=AES.new (key,AES.MODE_CBC,key) # from Chinese names through regular expressions. Decrypt ts through AES urlList=self.get_url_from_m3u8 (realAdr) urlRoot=tempArr [0] i=1for url in urlList:resp=requests.get ("% s/500kb/hls/%s"% (urlRoot,url.uri), headers=crawler.headers) if len (key): with open ('clip%s.ts'% I) 'wb') as f:f.write (cryptor.decrypt (resp.content)) print ("downloading clip%d" I) else:with open (' clip%s.ts'%i,'wb') as f:f.write (resp.content) print ("downloading clip%d" I) i+=1print ("download complete! Total time% d s "% (time.time ()-start_time) print (" next merge. ") os.system ('copy/b% s\\ * .ts% s\\% s% s% s)% (self.down_path,self.final_path) FileName) print ("delete the fragment source file.") files=os.listdir (self.down_path) for filena in files:del_file=self.down_path+'\\'+ filenaos.remove (del_file) print ("fragment file deletion completed") if _ _ name__=='__main__':crawler=VideoCrawler ("find the address yourself") crawler.start () crawler2=VideoCrawler ("find the address yourself") crawler2.start ()

Method 2 in method 1, we merge all the ts clips after downloading them locally, and the order may be out of order. Sometimes the decrypted video cannot be played and the merged video will cause the entire video timeline to be incorrect and the video can not be played completely at all. After various efforts, some problems can not be perfectly solved after multi-party search, and finally I have a new idea. Instead of downloading all the ts clips locally and merging them, we adopt a different mode of thinking. At first, we only create a ts file, and then instead of downloading the ts file in each loop, we add the stream of video clips obtained by address directly to the ts file we created at the beginning. If there is an error, jump out of the current loop and continue the next operation. In the end, all we get is a complete ts file, and we don't need to merge all the fragments. It depends on how the code is implemented.

This code is the same in many places as above, we only need to understand the principles and methods of OK.

Import reimport requestsimport m3u8import timeimport osfrom bs4 import BeautifulSoupimport jsonfrom Crypto.Cipher import AESimport sysimport randomclass VideoCrawler (): def _ _ init__ (self,url): super (VideoCrawler, self). _ _ init__ () self.url=urlself.down_path=r "F:\ Media\ Film\ Temp" self.agency_url=' https://www.kuaidaili.com/free/' # get a free proxy site if the site expires or expires Find your own proxy website to replace self.final_path=r "F:\ Media\ Film\ Final" self.headers= {'Connection':'Keep-Alive','Accept':'text/html,application/xhtml+xml,*/*','User-Agent':'Mozilla/5.0 (Linux) U; Android 6.0; zh-CN MZ-m2 note Build/MRA58K) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/40.0.2214.89 MZBrowser/6.5.506 UWS/2.10.1.22 Mobile Safari/537.36'} def get_url_from_m3u8 (self,readAdr): print ("parsing the real download address...") with open ('temp.m3u8' 'wb') as file:file.write (requests.get (readAdr) .content) m3u8Obj=m3u8.load (' temp.m3u8') print ("parsing complete") return m3u8Obj.segmentsdef get_ip_list (self,url, headers): web_data = requests.get (url, headers=headers). Textsoup = BeautifulSoup (web_data, 'lxml') ips = soup.find_all (' tr') ip_list = [] for i in range (1 Len (ips): ip_info = IPs [I] tds = ip_info.find_all ('td') ip_list.append (tds [0] .text +':'+ tds [1] .text) return ip_listdef get_random_ip (self Ip_list): proxy_list = [] for ip in ip_list:proxy_list.append ('http://' + ip) proxy_ip = random.choice (proxy_list) proxies = {' http': proxy_ip} return proxiesdef run (self): print ("Start!") start_time=time.time () self.down_path = r "% s\% s"% (self.down_path) Uuid.uuid1 ()) # splice the new download address if not os.path.exists (self.down_path): # to determine whether the file exists Create os.mkdir (self.down_path) html=requests.get (self.url) .textbsObj = BeautifulSoup (html) if it does not exist 'lxml') tempStr = bsObj.find (class_= "iplays"). Contents [3] .string # find the component firstM3u8Adr=json.loads (tempStr.strip (' var player_data=')) ["url"] # where the m3u8 address is stored through class to get the first layer m3u8 address tempArr=firstM3u8Adr.rpartition ('/') all_content = (requests.get (firstM3u8Adr) .text). Split ('n') [2] # find the second layer file from the first layer m3u8 file Address midStr = all_content.split ('/') [0] # get the useful string in it This uses different methods for different websites to find their own rules realAdr = "% s tempArr% s"% (tempArr [0], all_content) # under certain rules, the string is concatenated to get the second layer address, and the real m3u8 download address is obtained. Key_url = "% s/%s/hls/key.key"% (tempArr [0] MidStr) # analyze the rule of string concatenation to get the address of key key_html = requests.head (key_url) # access the text of the address of key status = key_html.status_code# successfully access the address of key key = "" if status = = 200:all_content=requests.get (realAdr) .text # request the address of the second layer m3u8 file to get the content if "# EXT-X-KEY" in all_content:key = requests.get (key_ Url) .content # if the field "# EXT-X-KEY" indicates that the video is encrypted self.fileName= bsObj.find (class_= "video-title w100") .contents [0] .contents [0] # analyze the web page and get the name of the video self.fileName=re.sub (r'[\ s) !]','', self.fileName) # because if there is a comma exclamation point or space in the file name, it will cause incorrect command errors when merging. So use the regular expression to directly remove these characters from the name iv = b'abcdabcdabcdabcd'#AES decryption ivif len (key): # if key has a value indicating that it is encrypted cryptor = AES.new (key, AES.MODE_CBC, iv) # decrypt ts through AES urlList=self.get_url_from_m3u8 (realAdr) urlRoot=tempArr [0] i=1outputfile=open (os.path.join (self.final_path,'%s.ts'%self.fileName) 'wb') # initially create a ts file Then each cycle writes the file stream of ts fragments to this file without merging the ts file ip_list=self.get_ip_list (self.agency_url,self.headers) # crawling through the website to the free proxy ip collection for url in urlList:try:proxies=self.get_random_ip (ip_list) # randomly getting a proxy resp = requests.get ("% s/%s/hls/%s"% (urlRoot)) from the ip collection as this visit MidStr, url.uri), headers=crawler.headers,proxies=proxies) # splice address to crawl data By simulating header and using proxy solution IPif len (key): tempText=cryptor.decrypt (resp.content) # decrypt crawled content progess=i/len (urlList) # record current crawl progress outputfile.write (tempText) # write the file stream crawled to the ts fragment to the ts file you just created sys.stdout.write ('\ r downloading:% s, progress:% s%'% (self.fileName) Progess)) # display download progress by percentage sys.stdout.flush () # refresh the previous line of code by this method The console retains only one line of else:outputfile.write (resp.content) except Exception as e:print ("\ n error:% s", e.args) continue# has an error and jumps out of the current loop and continues to the next loop i+=1outputfile.close () print ("download complete! Total time% d s "% (time.time ()-start_time)) self.del_tempfile () # delete temporary file def del_tempfile (self): file_list=os.listdir (self.down_path) for i in file_list:tempPath=os.path.join (self.down_path I) os.remove (tempPath) os.rmdir (self.down_path) print ('temporary file deletion completed') if _ _ name__=='__main__':url=input ("enter address:\ n") crawler=VideoCrawler (url) crawler.run () quitClick=input ("Please press enter to confirm exit!")

Problems and Solutions:

At first, I thought there was a module in the Python environment of the computer, but finally I found that the corresponding module needed to be added in the virtual environment of the Pycharm.

No module named Crypto.Cipher, read a lot on the Internet and finally solved the problem by adding pycryptodome module, computer environment Win10

The file name cannot have special characters such as exclamation points, commas or spaces, otherwise the merge command will be prompted to be incorrect.

This error ('Data must be padded to 16 byte boundary in CBC mode',) Data must be padded occurs when the ts file stream is written to the file during the download, and we continue directly out of the current loop to continue with the next download.

Sometimes there is "Protocol Error, Connection abort, os.error". It should be that the crawl operation is blocked too frequently. We use a free agent to solve this problem.

That's all for "how Python crawls movies". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.