How to use Python crawler to crawl the website of American TV series 07/13 Update SLTechnology News&Howtos

How to use Python crawler to crawl the website of American TV series

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

How to use Python crawler to crawl American TV series website, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

Crawler crawls American TV series website!

[preface]

I always have the habit of watching American TV series. On the one hand, I practice my English listening, on the other hand, I pass the time. Previously, I could watch it online on the video website, but since the restraining order issued by the State Administration of Radio, Film and Television, imported American and British TV dramas do not seem to be updated at the same time as before. However, as a diao, how can I be willing not to chase TV dramas, so I casually looked it up on the Internet and found an American TV series download website that can be downloaded with Xunlei, and I downloaded all kinds of resources casually. Recently, I was infatuated with BBC's high-definition documentaries, and nature was so beautiful that I didn't need it.

Although you have found the resource website to download, you have to open the browser every time, enter the URL, find the American TV series, and then click the link to download it. After a long time, I feel that the process is very tedious, and sometimes the website link will not be opened, which will be a little troublesome. So today's Python learning route tutorial wants to give you some practical things, grab all the links to American TV series on the site, and save them in a text document, which show you want can be downloaded directly by opening the copy link to Xunlei.

In fact, at the beginning, I plan to write the kind of url I found, use requests to open the crawl download link, and climb the full site from the home page. However, a lot of repeated links, and the url of its website is not as regular as I thought. I didn't write the kind of divergent crawler I wanted for a long time.

Later found that the TV link is in the article, and then there is a number after the article url, like this http://cn163.net/archives/24016/, so witty I used the crawler experience I wrote before, the solution is to automatically generate url, the number behind it can be changed, and each play is unique, so I tried about how many articles there are. Then the range function is used to generate numbers directly to construct url.

But many url do not exist, so will directly hang up, don't worry, we use requests, its own status_code is used to determine the status returned by the request, so as long as the return status code is 404, we will skip it, the rest will go in to climb the link, which solves the problem of url.

The following is the implementation code for the above steps.

Def get_urls (self): try: for i in range (2015 self.save_links 25000): base_url=' http://cn163.net/archives/' url=base_url+str (I) +'/'if requests.get (url) .status_code = = 404: continue else: self.save_links (url) except Exception,e: pass

The rest went well. I found similar crawlers written by my predecessors on the Internet, but only crawled an article, so I borrowed its regular expression. Their own use of BeautifulSoup is not as good as the regular effect, so decisively abandoned, boundless learning ah. But the effect is not so ideal, about half of the links can not be crawled correctly, we need to continue to optimize.

#-*-coding:utf-8-*-import requests import reimport sysimport threadingimport timereload (sys) sys.setdefaultencoding ('utf-8') class Archives (object): def save_links (self,url): try: data=requests.get (url,timeout=3) content=data.text link_pat=' "(ed2k://\ | file\ | [^"] +. (s\ d +) (E\ d +) [^ "] +? 1024X\ d {3} [^"] +? Name_pat=re.compile (r'(. *?)', re.S) links = set (re.findall (link_pat,content)) name=re.findall (name_pat,content) links_dict = {} count=len (links) except Exception,e: pass for i in links: links_ episode [int (I [1] [1:3]) * 100 + int (I [2] [1:3])] = i# extract the show number by s and e try: with open (name [0] .replace ('/') '') + '.txt','w') as f: print name [0] for i in sorted (list (links_dict.keys ()): # write f.write (links_ numbers [I] [0] +'\ n') print "Get links...", name [0], count except Exception E: pass def get_urls (self): try: for i in range (2015 25000): base_url=' http://cn163.net/archives/' url=base_url+str (I) +'/'if requests.get (url). Status_code = = 404: continue else: self.save_links (url) except Exception E: pass def main (self): thread1=threading.Thread (target=self.get_urls ()) thread1.start () thread1.join () if _ _ name__ = ='_ main__': start=time.time () a=Archives () a.main () end=time.time () print end-start

Full version of the code, in which multithreading is also used, but it feels useless, because of Python's GIL, there seem to be more than 20, 000 dramas, which originally thought it would take a long time to grab, but excluding url errors and those that did not match, the total capture time was less than 20 minutes. I originally wanted to use Redis to crawl on the two Linux, but after a lot of trouble, I didn't feel necessary, so that's it. I'll do it later when I need more big data.

Another excruciating problem I encountered in the process is the preservation of file names. I must complain here that file names in txt text format can have spaces, but no slashes, backslashes, parentheses, and so on. This is the problem. I spent the whole morning on this. At first I thought it was a mistake of grabbing data, but after a long time of checking it, I found that there was a slash in the title of the show, which made me suffer. Everyone must be careful when trying! Come on! come on!

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.