Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use requsets to get the most valuable content of Zhihu in Python

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article shows you how to use requsets in Python to get the most valuable content of Zhihu, which is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

A preface

Use requsets to crawl the content that knows the best value in Hu, and write a python program to get the content.

Second practice

#! / usr/bin/env python

#-*-coding:utf-8-*-

Import re

Import requests

Import os

From urlparse import urlsplit

From os.path import basename

Def getHtml (url):

Session = requests.Session ()

# simulate browser access

Header = {

'User-Agent': "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"

'Accept-Encoding': 'gzip, deflate'}

Res = session.get (url, headers=header)

If res.status_code = = 200:

Content = res.content

Else:

Content =''

Return content

Def mkdir (path):

If not os.path.exists (path):

Print 'New folder:', path

Os.makedirs (path)

Return True

Else:

Print u "pictures are stored at:", os.getcwd () + os.sep + path

Return False

Def download_pic (img_lists, dir_name):

Print "a total of {num} photos" .format (num=len (img_lists))

For image_url in img_lists:

Response = requests.get (image_url, stream=True)

If response.status_code = = 200:

Image = response.content

Else:

Continue

File_name = dir_name + os.sep + basename (urlsplit (image_url) [2])

Try:

With open (file_name, "wb") as picture:

Picture.write (image)

Except IOError:

Print ("IO Error\ n")

Return

Finally:

Picture.close

Print "download {pic_name} complete!" .format (pic_name=file_name)

Def getAllImg (html):

# use regular expressions to filter out the image addresses in the source code

# reg = "(. *?)" >'

Reg = r 'https://pic\d.zhimg.com/[a-fA-F0-9]{5,32}_\w+.jpg'

Imgre = re.compile (reg, re.S)

Tmp_list = imgre.findall (html) # means to filter out the addresses of all images in the entire web page and put them in imglist

# erase the avatar and retrieve the contents of data-original

Tmp_list = list (set (tmp_list)) # weight removal

Imglist = []

For item in tmp_list:

If item.endswith ('r.jpg'):

Img_list.append (item)

Print 'num:% d'% (len (imglist))

Return imglist

If _ _ name__ = ='_ _ main__':

Question_id = 35990613

Zhihu_url = "https://www.zhihu.com/question/{qid}".format(qid=question_id)

Html_content = getHtml (zhihu_url)

Path = 'zhihu_pic'

Mkdir (path) # create a local folder

Img_list = getAllImg (html_content) # get the address list of the picture

Download_pic (img_list, path) # Save the picture

There are still some shortcomings in this code, can not fully get all the pictures, you need to automatically click "more" to load more answers.

The second version of the code solves the problem that the first version of the code cannot be loaded automatically.

#! / usr/bin/env python

#-*-coding:utf-8-*-

Import re

Import requests

Import os

From urlparse import urlsplit

From os.path import basename

Headers = {

'User-Agent': "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36"

'Accept-Encoding': 'gzip, deflate'}

Def mkdir (path):

If not os.path.exists (path):

Print 'New folder:', path

Os.makedirs (path)

Return True

Else:

Print u "pictures are stored at:", os.getcwd () + os.sep + path

Return False

Def download_pic (img_lists, dir_name):

Print "a total of {num} photos" .format (num=len (img_lists))

For image_url in img_lists:

Response = requests.get (image_url, stream=True)

If response.status_code = = 200:

Image = response.content

Else:

Continue

File_name = dir_name + os.sep + basename (urlsplit (image_url) [2])

Try:

With open (file_name, "wb") as picture:

Picture.write (image)

Except IOError:

Print ("IO Error\ n")

Continue

Finally:

Picture.close

Print "download {pic_name} complete!" .format (pic_name=file_name)

Def get_image_url (qid, headers):

# use regular expressions to filter out the image addresses in the source code

# reg = "(. *?)" >'

Tmp_url = "https://www.zhihu.com/node/QuestionAnswerListV2"

Size = 10

Image_urls = []

Session = requests.Session ()

# using loop auto-completion requires clicking "more" to get all the answers, each page as an answer collection.

While True:

Postdata = {'method':' next', 'params':' {"url_token":'+

Str (qid) +', "pagesize": "10", "offset":'+ str (size) + "}"}

Page = session.post (tmp_url, headers=headers, data=postdata)

Ret = eval (page.text)

Answers = ret ['msg']

Size + = 10

If not answers:

Print "Picture URL acquisition completed, number of pages:", (size-10) / 10

Return image_urls

# reg = r 'https://pic\d.zhimg.com/[a-fA-F0-9]{5,32}_\w+.jpg'

Imgreg = re.compile ('data-original= "(. *?)", re.S)

For answer in answers:

Tmp_list = []

Url_items = re.findall (imgreg, answer)

For item in url_items: # here remove the escape character'\\ 'from the resulting image URL.

Image_url = item.replace ("\", "")

Tmp_list.append (image_url)

# erase the avatar and retrieve the contents of data-original

Tmp_list = list (set (tmp_list)) # weight removal

For item in tmp_list:

If item.endswith ('r.jpg'):

Print item

Image_urls.append (item)

Print 'size:% d, num:% d'% (size, len (image_urls))

If _ _ name__ = ='_ _ main__':

Question_id = 26037846

Zhihu_url = "https://www.zhihu.com/question/{qid}".format(qid=question_id)

Path = 'zhihu_pic'

Mkdir (path) # create a local folder

Img_list = get_image_url (question_id, headers) # get the address list of the picture

Download_pic (img_list, path) # Save the picture

The above is how to use requsets to obtain the most valuable content of Zhihu in Python. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report