In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article shows you how to use requsets in Python to get the most valuable content of Zhihu, which is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.
A preface
Use requsets to crawl the content that knows the best value in Hu, and write a python program to get the content.
Second practice
#! / usr/bin/env python
#-*-coding:utf-8-*-
Import re
Import requests
Import os
From urlparse import urlsplit
From os.path import basename
Def getHtml (url):
Session = requests.Session ()
# simulate browser access
Header = {
'User-Agent': "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"
'Accept-Encoding': 'gzip, deflate'}
Res = session.get (url, headers=header)
If res.status_code = = 200:
Content = res.content
Else:
Content =''
Return content
Def mkdir (path):
If not os.path.exists (path):
Print 'New folder:', path
Os.makedirs (path)
Return True
Else:
Print u "pictures are stored at:", os.getcwd () + os.sep + path
Return False
Def download_pic (img_lists, dir_name):
Print "a total of {num} photos" .format (num=len (img_lists))
For image_url in img_lists:
Response = requests.get (image_url, stream=True)
If response.status_code = = 200:
Image = response.content
Else:
Continue
File_name = dir_name + os.sep + basename (urlsplit (image_url) [2])
Try:
With open (file_name, "wb") as picture:
Picture.write (image)
Except IOError:
Print ("IO Error\ n")
Return
Finally:
Picture.close
Print "download {pic_name} complete!" .format (pic_name=file_name)
Def getAllImg (html):
# use regular expressions to filter out the image addresses in the source code
# reg = "(. *?)" >'
Reg = r 'https://pic\d.zhimg.com/[a-fA-F0-9]{5,32}_\w+.jpg'
Imgre = re.compile (reg, re.S)
Tmp_list = imgre.findall (html) # means to filter out the addresses of all images in the entire web page and put them in imglist
# erase the avatar and retrieve the contents of data-original
Tmp_list = list (set (tmp_list)) # weight removal
Imglist = []
For item in tmp_list:
If item.endswith ('r.jpg'):
Img_list.append (item)
Print 'num:% d'% (len (imglist))
Return imglist
If _ _ name__ = ='_ _ main__':
Question_id = 35990613
Zhihu_url = "https://www.zhihu.com/question/{qid}".format(qid=question_id)
Html_content = getHtml (zhihu_url)
Path = 'zhihu_pic'
Mkdir (path) # create a local folder
Img_list = getAllImg (html_content) # get the address list of the picture
Download_pic (img_list, path) # Save the picture
There are still some shortcomings in this code, can not fully get all the pictures, you need to automatically click "more" to load more answers.
The second version of the code solves the problem that the first version of the code cannot be loaded automatically.
#! / usr/bin/env python
#-*-coding:utf-8-*-
Import re
Import requests
Import os
From urlparse import urlsplit
From os.path import basename
Headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"
'Accept-Encoding': 'gzip, deflate'}
Def mkdir (path):
If not os.path.exists (path):
Print 'New folder:', path
Os.makedirs (path)
Return True
Else:
Print u "pictures are stored at:", os.getcwd () + os.sep + path
Return False
Def download_pic (img_lists, dir_name):
Print "a total of {num} photos" .format (num=len (img_lists))
For image_url in img_lists:
Response = requests.get (image_url, stream=True)
If response.status_code = = 200:
Image = response.content
Else:
Continue
File_name = dir_name + os.sep + basename (urlsplit (image_url) [2])
Try:
With open (file_name, "wb") as picture:
Picture.write (image)
Except IOError:
Print ("IO Error\ n")
Continue
Finally:
Picture.close
Print "download {pic_name} complete!" .format (pic_name=file_name)
Def get_image_url (qid, headers):
# use regular expressions to filter out the image addresses in the source code
# reg = "(. *?)" >'
Tmp_url = "https://www.zhihu.com/node/QuestionAnswerListV2"
Size = 10
Image_urls = []
Session = requests.Session ()
# using loop auto-completion requires clicking "more" to get all the answers, each page as an answer collection.
While True:
Postdata = {'method':' next', 'params':' {"url_token":'+
Str (qid) +', "pagesize": "10", "offset":'+ str (size) + "}"}
Page = session.post (tmp_url, headers=headers, data=postdata)
Ret = eval (page.text)
Answers = ret ['msg']
Size + = 10
If not answers:
Print "Picture URL acquisition completed, number of pages:", (size-10) / 10
Return image_urls
# reg = r 'https://pic\d.zhimg.com/[a-fA-F0-9]{5,32}_\w+.jpg'
Imgreg = re.compile ('data-original= "(. *?)", re.S)
For answer in answers:
Tmp_list = []
Url_items = re.findall (imgreg, answer)
For item in url_items: # here remove the escape character'\\ 'from the resulting image URL.
Image_url = item.replace ("\", "")
Tmp_list.append (image_url)
# erase the avatar and retrieve the contents of data-original
Tmp_list = list (set (tmp_list)) # weight removal
For item in tmp_list:
If item.endswith ('r.jpg'):
Print item
Image_urls.append (item)
Print 'size:% d, num:% d'% (size, len (image_urls))
If _ _ name__ = ='_ _ main__':
Question_id = 26037846
Zhihu_url = "https://www.zhihu.com/question/{qid}".format(qid=question_id)
Path = 'zhihu_pic'
Mkdir (path) # create a local folder
Img_list = get_image_url (question_id, headers) # get the address list of the picture
Download_pic (img_list, path) # Save the picture
The above is how to use requsets to obtain the most valuable content of Zhihu in Python. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.