In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "Python crawler how to achieve popular movie information collection". In daily operation, I believe that many people have doubts about how Python crawler achieves hot movie information collection. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how Python crawler achieves popular movie information collection". Next, please follow the editor to study!
I. Preface
Finally, the goddess asked me to go to the cinema, but she didn't know what to see, so of course I had to prepare.
Preparation in advance 1. The software used
Python 3.8 is open source and free (uniform 3.8)
The editor that Pycharm YYDS python uses best does not accept rebuttal.
2. Modules needed
Requests > > data request module pip install requests
Parsel > > data parsing module pip install parsel
Csv
3. Module installation problem
Module installation issues:
1) how to install the third-party module of python:
First: win + R enter cmd click OK, enter the installation command pip install module name (pip install requests) enter.
Second: click Terminal (terminal) in pycharm to enter the installation command
2) reason for installation failure:
First: pip is not an internal command
Solution: set environment variables
Second: a large number of dividends (read time out)
Solution: because the network link timed out, you need to switch the mirror source.
Tsinghua: https://pypi.tuna.tsinghua.edu.cn/simple Aliyun: http://mirrors.aliyun.com/pypi/simple/ University of Science and Technology of China https://pypi.mirrors.ustc.edu.cn/simple/ Huazhong University of Science and Technology: http://pypi.hustunique.com/ Shandong University of Technology: http://pypi.sdutlinux.org/ Douban: http://pypi.douban.com/simple/ for example: pip3 install-I https://pypi.doubanio.com/simple/ module name
Third: cmd shows that it has been installed, or it has been installed successfully, but it still cannot be imported in pycharm.
Solution: you may have installed multiple python versions (anaconda or python can install one) and uninstall one, or the python interpreter in your pycharm is not set up.
4. How to configure the python interpreter in pycharm?
How to configure the python interpreter in pycharm?
Select file (File) > setting (Settings) > Project (Project) > python interpreter (python interpreter)
Click on the gear and select add
Add python installation path
5. How does pycharm install plug-ins?
Select file (File) > setting (Settings) > Plugins (plug-in)
Click Marketplace to enter the name of the plug-in you want to install, such as: translation plug-in, enter translation
Select the appropriate plug-in and click install.
After successful installation, the option to restart pycharm will pop up. Click OK, and the restart will take effect.
Third, train of thought
The crawler gets the data to analyze the content of the data returned by the server. No matter you crawl any website data, you can follow these steps.
1. Identify the needs
What kind of https://movie.douban.com/top250 are we going to crawl? to analyze our desired data, where can we get it / where can we get it? (data source analysis)
Use developer tools to grab packets (packets) analysis, static web pages / web pages on the data content, in the source code of the web page.
2. Send a request
Headers in the developer tool, send a request for that URL, what kind of request is sent, and carry those request header parameters.
3. Obtain data
Get the content of the data returned by the server to see what the server data format is, or what kind of data we want.
Get text data response.text
Get server json dictionary data response.json ()
Get binary data response.content
Save video / audio / picture / file contents in a specific format, all to obtain binary data
4. Parsing data
Provide the data content we want
5. Save data
Save locally
4. Code part import requests # data request module pip install requestsimport parsel # data parsing module pip install parselimport csv # save table data f = open ('Douban data .csv, mode='a', encoding='utf-8', newline='') # Fast batch replacement of all selected content using regular expressions to replace content csv_writer = csv.DictWriter (f, fieldnames= [' movie name', 'director' 'starring', 'year', 'country', 'Movie Type', 'number of comments', 'rating', 'Overview', 'details Page',] csv_writer.writeheader () for page in range (0250 25): url = f 'https://movie.douban.com/top250?start={page}&filter=' # headers request header is used to disguise python code in order to prevent the crawler from being recognized by the server, the basic identity of the # User-Agent browser user agent directly copies and pastes the # Python learning exchange group 872937351 Get free videos / e-books / answers, etc. # Wolf crawler in sheep's clothing > > Wolf headers > > Sheepskin server data > Sheepfold headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0) Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'} # send request get # through the get request method in the requests data request module for url and send the request and carry the header request header, and finally use the response variable to receive the return data response = requests.get (url=url Headers=headers) # get data # print (response.text) # parse data re regular expression css selector xpath which is more convenient and comfortable, use which # json key value pair to extract the data content we want # convert the acquired response.text web page string data into selector object selector = parsel.Selector (response.text) # Object # css selector extracts data based on tag attributes # parses data for the first time Get all the li tags lis = selector.css ('.grid _ view li') # css selector syntax # selector.xpath (' / * [@ class= "grid_view"] / li') # xpath write # [] list, what if I want to extract the elements in the list one by one? For li in lis: try: # span:nth-child (1) the selection represented by the combination selector number of span tags # 1 Select the first span tag text to get the tag text data title = li.css ('.hd a span:nth-child (1):: text'). Get () href = li.css ('. Hd: Attr (href)'). Get () # details page # li.xpath ('/ * [@ class= "hd"] / a/span (1) / text ()') .get () # get returns string data getall is returned list data # get gets the first tag data getall gets all move_info = li.css ('.bd p: : text') .getall () actor_list = move_info [0] .strip () .split ('') # list index position value # print (actor_list) date_list = move_info [1] .strip (). Split ('/') # list index position value director = actor_list [0] .replace ('director:' Strip () # Director actor = actor_list [1] .replace ('starring:','). Replace ('/','). Replace ('...' '') # actor date = date_list [0] .strip () # year country = date_list [1] .strip () # National move_type = date_list [2] .strip () # Movie type comment = li.css ('. Star span:nth-child (4):: text'). Get (). Replace ) # number of comments star = li.css ('.star span:nth-child (2):: text'). Get () # Star world = li.css (' .inq:: text'). Get () # Overview # Advanced method of string # replace () string replacement method strip () removes the word The list of # strings returned after the space split () at the left and right ends of the string is split, how to remove the spaces? # print (title Actor_list, date_list) dit = {'Movie name': title, 'Director': director, 'starring': actor, 'year': date, 'country': country, 'Movie type': move_type, 'number of reviews': comment 'Ratings': star, 'Overview': world, 'details Page': href,} csv_writer.writerow (dit) print (title, director, actor, date, country, move_type, comment, star, world, href, sep=' |') except: pass so far On the "Python crawler how to achieve popular movie information collection" study is over, I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.