In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "how Python crawler crawls Douban movie review". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "how Python crawler crawls Douban movie review".
First, you need to install the module before you start the study.
Pip install requests
Pip install lxml
Pip install pandas
II. Summary of the explanation
Use requests+xpath to climb Douban movie review-suitable for 0 basic students to learn
Third, officially begin, put up your little eyes
1.requests + xpath climb Douban movie review
(1) get the content of the page
# crawling page url
Douban_url = 'https://movie.douban.com/subject/26647117/comments?status=P'
# requests send request
Get_response = requests.get (douban_url)
# convert the returned response code into text (entire web page)
Get_data = get_response.text
''
At this point, we have already obtained the content of the entire web page.
It can be 'calculated' to complete the crawler.
''
(2) analyze the content of the page and get the content we want.
Open the page we want to crawl in the browser
Press F12 to enter the developer tool to see where the data we want is.
Analyze the xpath value we obtained
'/ html/body/div [3] / div [1] / div/div [1] / div [4] / div [1] / div [2] / h4/span [2] / a'
'/ html/body/div [3] / div [1] / div/div [1] / div [4] / div [2] / div [2] / h4/span [2] / a'
'/ html/body/div [3] / div [1] / div/div [1] / div [4] / div [3] / div [2] / h4/span [2] / a'
Through observation, we find that these xpath are only slightly different, and the format of the bold part has been changed, so we want to crawl all the commentator, just change the xpath to:
'/ html/body/div [3] / div [1] / div/div [1] / div [4] / div/div [2] / h4/span [2] / a'
That is, do not follow the sequence number, when we query, we will automatically capture a similar xpath.
For the same analysis, we can get the xpath of the comment content as follows:
'/ html/body/div [3] / div [1] / div/div [1] / div [4] / div/div [2] / p'
# (after the above code) parse the page and output the acquired content
A = etree.HTML (get_data)
Commentator = s.xpath ('/ html/body/div [3] / div [1] / div/div [1] / div [4] / div/div [2] / h4/span [2] / a/text ()')
Comment_content = a.xpath ('/ html/body/div [3] / div [1] / div/div [1] / div [4] / div/div [2] / p/text ()')
# parse the acquired content and remove the redundant content
For i in range (0jinlen (files)):
Print (commentator [I] + 'say:')
Files [I] .strip (r'\ n')
Files [I] .strip ('')
Print (comment_ content [I])
Run result (part of data)
Oriol Paulo said, 'Wrath of silence' is quite different from the crime movies I've seen. It's a mix of genres. It's a crime movie,a mystery movie,an action movie,it's also a social realistic movie. Xin Yu Kun plays very well the mix of different genres in this film,and it has a powerful ending.
Wen Wen Zhou said: young directors above average should be encouraged without stinginess, and directors who are too old to say nothing should be mercilessly attacked.
Xiluxian said: the boss's son eats vacuum mutton and is greedily ground into the meat shredder; the butcher's son drinks polluted well water, and justice is only on the TV screen. If you poke out your left eye, even your fellow countrymen who have been stabbed can cover up; if you bite off your tongue, the lawyer who has been rescued dare not say a word. You can't build a pyramid by brute force, and you can't become a rabbit mother by falsetto. The Superman mask is like a conscience curse, which cannot be returned to the original owner; the sign of finding a son is like a soul charm, fluttering in the wind. The truth is buried in the soil, hidden in the cave, and finally no one knows.
# 85 said: Xin Yukun's second work is not a show-operated "Heart Labyrinth 2.0", and its style is not like anyone: Kubrick's single-point perspective gazing at the cave, the neurotic killer shaped like the Cohen brothers, the promenade Fight like "the Old Boy". What's different is that he doesn't just want to tell you who the killer is, it's his choice, and like a scalpel, he cuts through the social crux of upper misconduct, middle immorality, aphasia at the bottom, and human disqualification.
He ate the cupcake in one bite and said: the ending is so good. I like the soundtrack very much. I wish I could get rid of the subtitles. I guessed the end when Jiang Wu picked up the ashtray. It's just that after careful consideration, why is the well water getting more and more salty? Why do so many people have edema? The village head knows that, otherwise he would not drink mineral water. However, this stem, in the end, did not give too much explanation.
The big meat jar said: the upper layer is hypocritical and cruel, the middle layer is cold and selfish, and the lower layer is aphasic and powerless.
The little prince of martial arts said: Motorola's electricity was still much lower than that of Nokia.
He said: only 80% of the films are made, and they are already brilliant. This is how Chinese genre films should be made. Good multi-line narrative control, deep point mapping of human nature, explosive economic growth, uncontrollable social problems, silent resentment and pain of men, just like the people at the bottom who can not speak. At the end of the darkness, the child is not found, and the truth is not revealed, but this is the social truth. Sometimes the wicked do evil just to become a true alliance with those with the same interests.
Europa said: the kind of film that keeps going down and falls into the dark lashes the principal contradiction of society and is not responsible for providing the pleasure of solving puzzles, so it will be very heavy and congested after watching it. If the "labyrinth" is still the spontaneous creation of the manual age, "burst Silence" is obviously the consideration of the industrial age (Kass action special effects). The rivalry among the three, the role of the lawyer is too weak, the fighting strength of Song Yang is too strong, and Jiang Wu is stylized. The advantages and disadvantages are obvious.
The Bavarian god of wine said: the ending is so fucking awesome that I took a breath in the cinema after watching it. The innuendo is also very strong, 1984 motorcycle license plate, a loser at the bottom is set as a mute (no voice), lawyers (representing the middle class and the law) and coal bosses (representatives and evil forces) collude with each other. Therefore, even if Zhang Baomin has an explosive military strength like Mian Zhenghe in the Yellow Sea, he can only be the victim of this cruel society.
Ling Rui said: when you look at the abyss, the abyss is also looking at you.
Frozenmoon said: Chang Wannian is a meat eater, Xu Wenjie is a soup eater, and Zhang Baomin himself is "meat". They used to play their own role in a position in the food chain, but the accident destroyed everything. After losing control, everyone found that they were just "flesh". When Chang took off his wig and suit, he had to succumb to violence and luck. Xu also had to face cruelty when he got out of the protection of money and words. Zhang's price may be even higher. The dull sound of the burst of human nature.
Shameless asshole said: what moves me most is not the obvious or even obvious metaphors, but the "aphasia" of the whole film. We belong to the "aphasia generation". In the corresponding film, it is not only the surface dumb Zhang Baomin's "physiological aphasia", but also the "active aphasia" chosen by elite lawyers at the end of the film. The film's accurate display of "aphasia" not only sensitively captures the pain points of the times, but also extremely stabs the hearts of the people.
(3) turn the page and save the commentator and comment content in the csv file.
Turn the page (1)
Unlike the previous analysis of xpath, we just need to find out the differences and rules of url between pages.
# start attribute indicates the start position
Turn_page1 = 'https://movie.douban.com/subject/26647117/comments?status=P'
Turn_page2 = 'https://movie.douban.com/subject/26647117/comments?start=20&limit=20&sort=new_score&status=P'
Turn_page3 = 'https://movie.douban.com/subject/26647117/comments?start=40&limit=20&sort=new_score&status=P'
Turn_page4 = 'https://movie.douban.com/subject/26647117/comments?start=60&limit=20&sort=new_score&status=P'
Observation found that, in addition to the first, each url only has a different value of start, and each time it increases by 20. The start attribute has already been mentioned above, and it is not difficult to find that there are only 20 comments per page, which is controlled by the attribute limit. (editor has tried, artificial changes are useless, it is estimated to be Douban's anti-climbing, but it does not affect us.) What I want to say here is that the reason why the value of this start is incremented by 20 is controlled by this limit.
Turn the page (2)
# get the total number of comments
Comment_counts = a.xpath ('/ html/body/div [3] / div [1] / div/div [1] / div [1] / ul/li [1] / span/text ()')
Comment_counts = int (comment_counts [0] .strip ("seen ()"))
# calculate the total number of pages (20 comments per page)
Page_counts = int (comment_counts/20)
# request access and store crawl data in csv file
For i in range (0century pagecounting):
Turn_page_url = 'https://movie.douban.com/subject/26647117/comments?start={}&limit=20&sort=new_score&status=P'.format(i*20)
Get_respones_data (turn_page_url)
Before we finish the above, we must modify the previously written code to make it look like we can encapsulate the previously written code into a function get_respones_data (), pass in an access url parameter, and get the returned HTML.
Code overhaul:
Import requests
From lxml import etree
Import pandas as pd
Def get_respones_data (douban_url = 'https://movie.douban.com/subject/26647117/comments?status=P'):
# requests send request
Get_response = requests.get (douban_url)
# convert the returned response code into text (entire web page)
Get_data = get_response.text
# parsing the page
A = etree.HTML (get_data)
Return a
First_a = get_respones_data ()
# turn the page
Comment_counts = first_a.xpath ('/ html/body/div [3] / div [1] / div/div [1] / div [1] / ul/li [1] / span/text ()')
Comment_counts = int (comment_counts [0] .strip ("seen ()"))
Page_counts = int (comment_counts / 20)
# the editor has been tested. If you don't log in, you can only visit a maximum of 10 pages, that is, 200 comments.
# the next issue of the editor will teach you how to deal with anti-crawling
For i in range (0, page_counts+1):
Turn_page_url = 'https://movie.douban.com/subject/26647117/comments?start={}&limit=20&sort=new_score&status=P'.format(
I * 20)
Print (turn_page_url)
A = get_respones_data (turn_page_url)
# get commentators and comment content
Commentator = a.xpath ('/ html/body/div [3] / div [1] / div/div [1] / div [4] / div/div [2] / h4/span [2] / a/text ()')
Comment_content = a.xpath ('/ html/body/div [3] / div [1] / div/div [1] / div [4] / div/div [2] / p/text ()')
# parse the content and save it in the csv file
Content = [''for i in range (0, len (commentator))]
For i in range (0, len (commentator)):
Comment_ content [I] .strip (r'\ n')
Comment_ [I] .strip ('')
Content_s = [commentator [I], comment_ content [I]]
Content [I] = content_s
Name = ['commentator', 'comment content']
File_test = pd.DataFrame (columns=name, data=content)
If I = = 0:
File_test.to_csv (ringing H:\ PyCoding\ FlaskCoding\ Test_all\ test0609\ app\ comment_content.cvs',encoding='utf-8',index=False)
Else:
File_test.to_csv (ringing H:\ PyCoding\ FlaskCoding\ Test_all\ test0609\ app\ comment_content.cvs',mode='a+',encoding='utf-8',index=False)
4. Something advanced (nothing to do with reptiles)
New installation module
1pip install jieba
2pip install re
3pip install csv
4pip install pyecharts
5pip install numpy
Parsing data
1 with codecs.open (ringing H:\ PyCoding\ FlaskCoding\ Test_all\ test0609\ app\ comment_content.cvs', 'ritual,' utf-8') as csvfile:
2 content =''
3 reader = csv.reader (csvfile)
4 I = 0
5 for file1 in reader:
6 if I = 0 or I = 1:
7 pass
8 else:
9 content = content + file1 [1]
10 I = I + 1
11 # remove any extra characters from all comments
12 content = re.sub ('[,. . \ r\ n]',', content)
two。 Analysis data
# cut words and break down the whole comment into individual words
1segment = jieba.lcut (content)
2words_df = pd.DataFrame ({'segment': segment})
"quoting=3" means that all the contents in stopwords.txt are not referenced.
4stopwords = pd.read_csv (r "H:\ PyCoding\ FlaskCoding\ Test_all\ test0609\ app\ stopwords.txt", index_col=False, quoting=3, sep= "\ t", names= ['stopword'], encoding='utf-8')
5words_df = words_df [~ words_df.segment.isin (stopwords.stopword)]
# calculate the number of repetitions of each word
6words_stat = words_df.groupby (by= ['segment']) [' segment'] .agg ({"count": numpy.size})
7words_stat = words_stat.reset_index () .sort_values (by= ["count], ascending=False)
3. Data visualization
1test = words_stat.head (1000) .values
# get all words
2words = [test [I] [0] for i in range (0 Len (test))]
# get the number of occurrences of a word pair
3counts = [test [I] [1] for i in range (0jullen (test))]
4wordcloud = WordCloud (width=1300, height=620)
# generate word cloud image
5wordcloud.add ("burst Silence", words, counts, word_size_range= [20,100])
6wordcloud.render () so far, I believe you have a deeper understanding of "Python crawler how to climb Douban Film Review". You might as well do it in practice! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.