Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize the information collection of popular movies by Python crawler

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "Python crawler how to achieve popular movie information collection". In daily operation, I believe that many people have doubts about how Python crawler achieves hot movie information collection. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how Python crawler achieves popular movie information collection". Next, please follow the editor to study!

I. Preface

Finally, the goddess asked me to go to the cinema, but she didn't know what to see, so of course I had to prepare.

Preparation in advance 1. The software used

Python 3.8 is open source and free (uniform 3.8)

The editor that Pycharm YYDS python uses best does not accept rebuttal.

2. Modules needed

Requests > > data request module pip install requests

Parsel > > data parsing module pip install parsel

Csv

3. Module installation problem

Module installation issues:

1) how to install the third-party module of python:

First: win + R enter cmd click OK, enter the installation command pip install module name (pip install requests) enter.

Second: click Terminal (terminal) in pycharm to enter the installation command

2) reason for installation failure:

First: pip is not an internal command

Solution: set environment variables

Second: a large number of dividends (read time out)

Solution: because the network link timed out, you need to switch the mirror source.

Tsinghua: https://pypi.tuna.tsinghua.edu.cn/simple Aliyun: http://mirrors.aliyun.com/pypi/simple/ University of Science and Technology of China https://pypi.mirrors.ustc.edu.cn/simple/ Huazhong University of Science and Technology: http://pypi.hustunique.com/ Shandong University of Technology: http://pypi.sdutlinux.org/ Douban: http://pypi.douban.com/simple/ for example: pip3 install-I https://pypi.doubanio.com/simple/ module name

Third: cmd shows that it has been installed, or it has been installed successfully, but it still cannot be imported in pycharm.

Solution: you may have installed multiple python versions (anaconda or python can install one) and uninstall one, or the python interpreter in your pycharm is not set up.

4. How to configure the python interpreter in pycharm?

How to configure the python interpreter in pycharm?

Select file (File) > setting (Settings) > Project (Project) > python interpreter (python interpreter)

Click on the gear and select add

Add python installation path

5. How does pycharm install plug-ins?

Select file (File) > setting (Settings) > Plugins (plug-in)

Click Marketplace to enter the name of the plug-in you want to install, such as: translation plug-in, enter translation

Select the appropriate plug-in and click install.

After successful installation, the option to restart pycharm will pop up. Click OK, and the restart will take effect.

Third, train of thought

The crawler gets the data to analyze the content of the data returned by the server. No matter you crawl any website data, you can follow these steps.

1. Identify the needs

What kind of https://movie.douban.com/top250 are we going to crawl? to analyze our desired data, where can we get it / where can we get it? (data source analysis)

Use developer tools to grab packets (packets) analysis, static web pages / web pages on the data content, in the source code of the web page.

2. Send a request

Headers in the developer tool, send a request for that URL, what kind of request is sent, and carry those request header parameters.

3. Obtain data

Get the content of the data returned by the server to see what the server data format is, or what kind of data we want.

Get text data response.text

Get server json dictionary data response.json ()

Get binary data response.content

Save video / audio / picture / file contents in a specific format, all to obtain binary data

4. Parsing data

Provide the data content we want

5. Save data

Save locally

4. Code part import requests # data request module pip install requestsimport parsel # data parsing module pip install parselimport csv # save table data f = open ('Douban data .csv, mode='a', encoding='utf-8', newline='') # Fast batch replacement of all selected content using regular expressions to replace content csv_writer = csv.DictWriter (f, fieldnames= [' movie name', 'director' 'starring', 'year', 'country', 'Movie Type', 'number of comments', 'rating', 'Overview', 'details Page',] csv_writer.writeheader () for page in range (0250 25): url = f 'https://movie.douban.com/top250?start={page}&filter=' # headers request header is used to disguise python code in order to prevent the crawler from being recognized by the server, the basic identity of the # User-Agent browser user agent directly copies and pastes the # Python learning exchange group 872937351 Get free videos / e-books / answers, etc. # Wolf crawler in sheep's clothing > > Wolf headers > > Sheepskin server data > Sheepfold headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0) Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'} # send request get # through the get request method in the requests data request module for url and send the request and carry the header request header, and finally use the response variable to receive the return data response = requests.get (url=url Headers=headers) # get data # print (response.text) # parse data re regular expression css selector xpath which is more convenient and comfortable, use which # json key value pair to extract the data content we want # convert the acquired response.text web page string data into selector object selector = parsel.Selector (response.text) # Object # css selector extracts data based on tag attributes # parses data for the first time Get all the li tags lis = selector.css ('.grid _ view li') # css selector syntax # selector.xpath (' / * [@ class= "grid_view"] / li') # xpath write # [] list, what if I want to extract the elements in the list one by one? For li in lis: try: # span:nth-child (1) the selection represented by the combination selector number of span tags # 1 Select the first span tag text to get the tag text data title = li.css ('.hd a span:nth-child (1):: text'). Get () href = li.css ('. Hd: Attr (href)'). Get () # details page # li.xpath ('/ * [@ class= "hd"] / a/span (1) / text ()') .get () # get returns string data getall is returned list data # get gets the first tag data getall gets all move_info = li.css ('.bd p: : text') .getall () actor_list = move_info [0] .strip () .split ('') # list index position value # print (actor_list) date_list = move_info [1] .strip (). Split ('/') # list index position value director = actor_list [0] .replace ('director:' Strip () # Director actor = actor_list [1] .replace ('starring:','). Replace ('/','). Replace ('...' '') # actor date = date_list [0] .strip () # year country = date_list [1] .strip () # National move_type = date_list [2] .strip () # Movie type comment = li.css ('. Star span:nth-child (4):: text'). Get (). Replace ) # number of comments star = li.css ('.star span:nth-child (2):: text'). Get () # Star world = li.css (' .inq:: text'). Get () # Overview # Advanced method of string # replace () string replacement method strip () removes the word The list of # strings returned after the space split () at the left and right ends of the string is split, how to remove the spaces? # print (title Actor_list, date_list) dit = {'Movie name': title, 'Director': director, 'starring': actor, 'year': date, 'country': country, 'Movie type': move_type, 'number of reviews': comment 'Ratings': star, 'Overview': world, 'details Page': href,} csv_writer.writerow (dit) print (title, director, actor, date, country, move_type, comment, star, world, href, sep=' |') except: pass so far On the "Python crawler how to achieve popular movie information collection" study is over, I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report