Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to climb Tencent Video by Python

2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces Python how to climb Tencent Video, the article is very detailed, has a certain reference value, interested friends must read it!

Project address:

Https://github.com/yangrq1018/vqq-douban-film dependence

The following Python packages are required:

Requests

Bs4-Beautiful soup

Pandas

That's all, there is no need for a complex automated crawler architecture, simple and common packages are enough.

Crawl movie information

First, look at the movie channel and find that it is loaded asynchronously. You can use the network tab in the inspect of Firefox (or Chrome) to filter for possible api interfaces. It is quickly discovered that the URL of the interface is in this format:

Base_url = 'https://v.qq.com/x/bu/pagesheet/list?_all=1&append=1&channel=movie&listpage=2&offset={offset}&pagesize={page_size}&sort={sort}'

Where offset is the location where the request page starts, pagesize is the number of requests per page, and sort is the type. Here sort=21 refers to the type of "Douban praise" we need. Pagesize cannot be greater than 30. If it is greater than 30, only 30 elements will be returned, and if it is less than 30, it will return a specified number of elements.

# the URL that makes the Pandas complete and too long everywhere will need pd.set_option ('display.max_colwidth',-1) base_url =' https://v.qq.com/x/bu/pagesheet/list?_all=1&append=1&channel=movie&listpage=2&offset={offset}&pagesize={page_size}&sort={sort}'# the best type of Douban DOUBAN_BEST_SORT = 21NUM_PAGE_DOUBAN = 167'

Write a small cycle to find that Douban praise this type has a total of 167 pages, each page of 30 elements.

We use the requests library to request a web page, and get_soup will request the elements on page page_idx, parse response.content with Beautifulsoup, and generate a DOM-like object that can easily find the element we need. We return a list. Each movie entry is contained in a div called list_item, so write a function to help us extract all such div.

Def get_soup (page_idx, page_size=30, sort=DOUBAN_BEST_SORT): url = base_url.format (offset=page_idx * page_size, page_size=page_size, sort=sort) res = requests.get (url) soup = bs4.BeautifulSoup (res.content.decode ('utf-8'),' lxml') return soupdef find_list_items (soup): return soup.find_all ('div', class_='list_item')

We iterate through each page and return a list that contains the HTML of all the bs4-passed entry elements.

Def douban_films (): rel = [] for p in range (NUM_PAGE_DOUBAN): print ('Getting page {}' .format (p)) soup = get_soup (p) rel + = find_list_items (soup) return rel

This is the HTML code for one of the movies:

霸王别姬

Starring Zhang Guorong Zhang Fengyi Gong Li Ge You 46.71 million

It is not difficult to find out whether the title, playback address, cover, rating, starring, membership and playback volume of the film Farewell my Concubine are all in this div. In an interactive environment like ipython, it's easy to figure out how to extract them with bs. One of the tricks I use is that you can open a spyder.py file, write the required functions in it, open the option of ipython's automatic overload module, and then copy the code to the file after debug in console, and then the functions in ipython will be updated accordingly. The advantage is that it is much more convenient than changing the code in ipython. How to turn on the automatic reloading of ipython:

% load_ext autoreload%autoreload 2 # Reload all modules every time before executing Python code%autoreload 0 # Disable automatic reloading

This parse_films function extracts information using two common methods in bs:

Find

Find_all

Because Douban's API has turned off the search function, and the crawler will be detected by the anti-crawler, the function was abandoned when the score of Douban was added.

OrderedDict can accept a list consisting of (key, value), and then the order of the key will be remembered. This is useful when we export it to pandas DataFrame later.

Def parse_films (films):''films is a list of `bs4.element.Tag` objects''' rel = [] for I, film in enumerate (films): title = film.find (' Parsing film, class_= "figure_title") ['title'] print (' Parsing film% d:'% I, title) link = film.find ('a' Class_= "figure") ['href'] img_link = film.find (' img', class_= "figure_pic") ['src'] # test if need VIP need_vip = bool (film.find (' img', class_= "mark_v")) score = getattr (film.find ('div', class_='figure_score'),' text') None) if score: score = float (score) cast = film.find ('div', class_= "figure_desc") if cast: cast = cast.get (' title', None) play_amt = film.find ('div', class_= "figure_count"). Get_text () # db_score Db_link = search_douban (title) # Store key orders dict_item = OrderedDict ([('title', title), (' vqq_score', score), # ('db_score', db_score), (' need_vip', need_vip), ('cast', cast), (' play_amt', play_amt) ('vqq_play_link', link), # (' db_discuss_link', db_link), ('img_link', img_link),]) rel.append (dict_item) return rel export

Finally, we call the written function and run it in the main program.

Parsed objects in list of dictionaries format can be passed directly to DataFrame's constructor. Sort by score, the highest score comes first, and then convert the playback link into a HTML link tag, which is more beautiful and can be opened directly.

Note that the csv files generated by pandas have always had compatibility problems with excel and will garbled when there are Chinese characters. The solution is to choose the encoding of utf_8_sig, and you can let excel decode normally.

Pickle is a very powerful serialization library for Python, which can save Python objects as files, and then load Python objects from the file. We saved our DataFrame as .pkl. Call the to_html method of DataFrame to save a HTML file. Be careful to set escape to False or the hyperlink cannot be opened directly.

If _ _ name__ = ='_ main__': df = DataFrame (parse_films (douban_films () # Sorted by score df.sort_values (by= "vqq_score", inplace=True, ascending=False) # Format links df ['vqq_play_link'] = df [' vqq_play_link'] .apply (lambda x: 'Film link'.format (x) df [' img_link'] = df ['img_link'] .apply (lambda x:'

'.format (x) # Chinese characters in Excel must be encoded with _ sig df.to_csv (' vqq_douban_films.csv', index=False, encoding='utf_8_sig') # Pickle df.to_pickle ('vqq_douban_films.pkl') # HTML, render hyperlink df.to_html (' vqq_douban_films.html', escape=False) Project Management

That's what the code part is. Then when you have finished writing the code, you should file it for analysis. Choose to put it on Github.

So, in fact, Github provides a command-line tool (not git, but an extension of git) called hub. MacOS users can install it like this

Brew install hub

Hub has many more concise syntax than git, which we mainly use here.

Hub create-d "Create repo for our proj" vqq-douban-film

Isn't it cool to create repo directly from the command line? You don't have to open a browser at all. You may then be prompted to register your SSH public key (authentication permission) on Github, if not, just generate one with ssh-keygen and copy the contents of .pub into the Github settings.

In the project directory, there may be _ _ pycache__ and .DS _ Store files that you don't want to track. It's too troublesome to write a gitignore by hand. do you have any tools? there must be! Python has a bag.

Pip install git-ignoregit-ignore python # generates a template# for python and manually adds .DS _ Store

Note: only for study and communication

The above is all the contents of the article "how Python crawled Tencent Video". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report