How to use Python to crawl all the plots of a TV series 07/06 Update SLTechnology News&Howtos

How to use Python to crawl all the plots of a TV series

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Most people do not understand the knowledge points of this article "how to use Python to crawl all the plots of TV series", so the editor summarizes the following content for you. The content is detailed, the steps are clear, and it has a certain reference value. I hope you can gain something after reading this article. Let's take a look at this article "how to use Python to climb all the plots of TV dramas."

[sample code]

# coding=utf-8# @ Auther: Pengge thief excellent # @ Date: 2019-8-7

From bs4 import BeautifulSoupimport requestsimport getheader

# obtain the corresponding title of each episode and the corresponding interface URL key address def get_title (): url = "https://www.tvsou.com/storys/0d884ba0dd/" headers= getheader.getheaders () r = requests.get (url, headers=headers) r.encoding =" utf-8 "soup = BeautifulSoup (r.text," lxml ") temps = soup.find (" ul ") Class_= "m-l14 clearfix episodes-list teleplay-lists"). Find_all ("li") tempurllist = [] titlelist = [] for temp in temps: tempurl = temp.a.get ("href") title = temp.a.get ("title") tempurllist.append (tempurl) titlelist.append (title) return tempurllist, titlelist

# download all the plots after episode x of 12 hours in Chang'an, which starts from the first episode by default. Def Changan (episode=1): tempurllist_b, titlelist_b = get_title () tempurllist = tempurllist_b [(episode-1):] titlelist = titlelist_b [(episode-1):] baseurl = "https://www.tvsou.com" for I, tempurl in enumerate (tempurllist): print (" downloading article {0} ".format (str (I + episode) url = baseurl + tempurl r = requests.get (url) Headers=getheader.getheaders () r.encoding = "utf-8" soup = BeautifulSoup (r.text, "lxml") result = soup.find ("pre") Class_= "font-16 color-3 mt-20 pre-content") .find_all ("p") content = [] for temp in result: if temp.string: content.append (temp.string) with open ("test.txt" "a") as f: f.write (titlist [I] + "\ n") f.writelines (content) f.write ("\ n")

If _ _ name__ = = "_ _ main__": Changan (43)

[the effect is as follows]

[knowledge points]

1. How to automatically obtain the corresponding URL address for each set?

First check the crawled content of the first episode, and find that there is a piece of information about each episode in the response, as shown below:

As you can see from this response message, each set corresponds to a href, and then in the first set of URL addresses, "https://www.tvsou.com/storys/0d884ba0dd/" happens to have some URL addresses that match href." Then verify the next second episode URL, and find that it is indeed the corresponding href. So you get a way to get the URL addresses of each set automatically.

2. How to climb the plot content of each episode?

Take the first episode as an example, you can see such a paragraph in the response.

In the class_= "font-16 color-3 mt-20 pre-content" tag, there is plot content. However, because there are multiple p tags in this response, each p tag corresponds to a piece of content. Therefore, text extraction is required for each p tag. And since the first p label is

Therefore, a non-null judgment is required

The above is about the content of this article on "how to use Python to crawl all the plots of TV dramas". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more related knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.