Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How does Python crawl some number of questions and answers

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "how Python climbs a certain number of questions and answers". The content is simple and clear. I hope it can help you solve your doubts. Let the editor lead you to study and learn this article "how to climb a certain number of questions and answers by Python".

Preface

Python is a little expert at getting data, so this time I hope to use it to crawl the answers to some questions and practice.

1. Import module

Import refrom bs4 import BeautifulSoupimport requestsimport timeimport jsonimport pandas as pdimport numpy as np

two。 Status code

R = requests.get ('https://github.com/explore')r.status_code

3. Crawl *

# browsers header and cookiesheaders = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 101406) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'} cookies = {' cookie':'_zap=3d979dbb-f25b-4014-8770-89045dec48f6; tst=r; _ ga=GA1.2.910277933.1582789012; q_c1=9a429b07b08a4ae1afe0a99386626304 = "APDvML4koQ-PTqFU56egNZNd2wd-eileT3E= | 1561292196"; tst=r; _ ga=GA1.2.910277933.1582789012; q_c1=9a429b07b08a4ae1afe0a99386626304 | 1584073146000 | 1561373910000; _ xsrf=bf1c5edf-75bd-4512-8319-02c650b7ad2c; _ gid=GA1.2.1983259099.1586575835 Cap_id= "M2I5NmJkMzRjc3NGZjNDhiNzBmNDMyNDQ3NDlmNmEE= | 1586663749 | dacf440ab7ad64214a939974e539f9b86ddb9eac"; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1586585625,1586587735,1586667228,1586667292; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1586667292; SESSIONID=GWBltmMTwz5oFeBTjRm4Akv8pFF6p8Y6qWkgUP4tjp6; JOID=UVkSBEJI6EKgHAipMkwAEWAkvEomDbkAwmJn4mY1kHHPVGfpYMxO3voUDK88UO62JqgwW5Up4hC2kX_KGO9xoKI=; osd=UlEXAU5L4EelEAuhN0kMEmghuUYlBbwFzmFv52M5k3nKUWvqaMlL0vkcCaowU-azI6QzU5As7hO-lHrGG-d0pa4=; capsion_ticket= "1; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1586585625,1586587735,1586667228,1586667292; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1586667292; SESSIONID=GWBltmMTwz5oFeBTjRm4Akv8pFF6p8Y6qWkgUP4tjp6; JOID=UVkSBEJI6EKgHAipMkwAEWAkvEomDbkAwmJn4mY1kHHPVGfpYMxO3voUDK88UO62JqgwW5Up4hC2kX_KGO9xoKI=; osd=UlEXAU5L4EelEAuhN0kMEmghuUYlBbwFzmFv52M5k3nKUWvqaMlL0vkcCaowU-azI6QzU5As7hO-lHrGG-d0pa4=; capsion_ticket=" 2 | 1:0 | 10Mj4YWI4NDI0Mzk0MjQ1YYWI4NDI0Mzk0MjQ1YWI4NDI0Mzk0Mjk0MQ1YwmUZGyNzYYYNzYYYYYYYYYYYNzYYYYYYYYWI4NDI0MzNzYYNzYYYYYYYYYYYWI4NDI0MzYYYMQ1YmUxZGyNzYYYYYYYYWI4NDI0MzK0MJMQ1YmYmUZGYYYYYYYYYYYYYYYNzYYYYYYYYWI4NDI0MZYNYYYYYYMYYYYYYYYYYWI4NDI0MzK0MJMQ1YmYmUxZGyNzYYYYYYYYYYWI4NDI0MzK0MQ1YmYmUxZGYYYYYYYYYYYYYWI4NDI0Mzk0MJQ1YmYmUxZGyNzYYYNzYYYYQ0MQ1Ym KLBRSID=fb3eda1aa35a9ed9f88f346a7a3ebe83 | 1586667697 | 1586660346'} start_url = 'https://www.zhihu.com/api/v3/feed/topstory/recommend?session_token=c03069ed8f250472b687fd1ee704dd5b&desktop=true&page_number=5&limit=6&action=pull&ad_interval=-1&before_id=23'

4. Beautifulsoup parsing

S = requests.Session () start_url = 'https://www.zhihu.com/'html = s.get (url = start_url, headers = headers,cookies = cookies,timeout = 5) soup = BeautifulSoup (html.content) question = [] # # name question_address = [] # # urltemp1 = soup.find_all (' div',class_='Card TopstoryItem TopstoryItem-isRecommend') for item in temp1: temp2 = item.find_all ('div' Itemprop= "zhihu:question") # print (temp2) if temp2! = []: # skip question_address.append (temp2 [0] .find ('meta',itemprop='url'). Get (' content')) question.append (temp2 [0]. Find ('meta',itemprop='name'). Get (' content'))

5. Storage information

Question_focus_number = [] # followings question_answer_number = [] # replies for url in question_address: test = s.get (url = url,headers = headers,cookies = cookies,timeout = 5) soup = BeautifulSoup (test.content) info = soup.find_all ('div',class_='QuestionPage') [0] # print (info) focus_number = info.find (' meta') Itemprop= "answerCount"). Get ('content') answer_number = info.find (' meta',itemprop= "zhihu:followerCount"). Get ('content') question_focus_number.append (focus_number) question_answer_number.append (answer_number)

6. Organize the information and output it

Question_info = pd.DataFrame (list (zip (question,question_focus_number,question_answer_number)), columns = ['name of question', 'number of followers', 'number of respondents'] for item in ['number of followers', 'number of respondents']: question_ info = np.array (question_ int') question_info.sort_values (by=' followers', ascending = False)

Output:

The above is all the contents of the article "how Python crawls some questions and answers". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report