In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces the knowledge of "how to use Python crawler". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
1. Import module
Import re from bs4 import BeautifulSoup import requests import time import json import pandas as pd import numpy as np
two。 Status code
R = requests.get ('https://github.com/explore') r.status_code
3. Crawl *
# browsers header and cookies headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 101406) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'} cookies = {' cookie':'_zap=3d979dbb-f25b-4014-8770-89045dec48f6; tst=r; _ ga=GA1.2.910277933.1582789012; q_c1=9a429b07b08a4ae1afe0a99386626304 = "APDvML4koQ-PTqFU56egNZNd2wd-eileT3E= | 1561292196"; tst=r; _ ga=GA1.2.910277933.1582789012; q_c1=9a429b07b08a4ae1afe0a99386626304 | 1584073146000 | 1561373910000; _ xsrf=bf1c5edf-75bd-4512-8319-02c650b7ad2c; _ gid=GA1.2.1983259099.1586575835 Cap_id= "M2I5NmJkMzRjc3NGZjNDhiNzBmNDMyNDQ3NDlmNmEE= | 1586663749 | dacf440ab7ad64214a939974e539f9b86ddb9eac"; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1586585625,1586587735,1586667228,1586667292; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1586667292; SESSIONID=GWBltmMTwz5oFeBTjRm4Akv8pFF6p8Y6qWkgUP4tjp6; JOID=UVkSBEJI6EKgHAipMkwAEWAkvEomDbkAwmJn4mY1kHHPVGfpYMxO3voUDK88UO62JqgwW5Up4hC2kX_KGO9xoKI=; osd=UlEXAU5L4EelEAuhN0kMEmghuUYlBbwFzmFv52M5k3nKUWvqaMlL0vkcCaowU-azI6QzU5As7hO-lHrGG-d0pa4=; capsion_ticket= "1; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1586585625,1586587735,1586667228,1586667292; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1586667292; SESSIONID=GWBltmMTwz5oFeBTjRm4Akv8pFF6p8Y6qWkgUP4tjp6; JOID=UVkSBEJI6EKgHAipMkwAEWAkvEomDbkAwmJn4mY1kHHPVGfpYMxO3voUDK88UO62JqgwW5Up4hC2kX_KGO9xoKI=; osd=UlEXAU5L4EelEAuhN0kMEmghuUYlBbwFzmFv52M5k3nKUWvqaMlL0vkcCaowU-azI6QzU5As7hO-lHrGG-d0pa4=; capsion_ticket=" 2 | 1:0 | 10Mj4YWI4NDI0Mzk0MjQ1YYWI4NDI0Mzk0MjQ1YWI4NDI0Mzk0Mjk0MQ1YwmUZGyNzYYYNzYYYYYYYYYYYNzYYYYYYYYWI4NDI0MzNzYYNzYYYYYYYYYYYWI4NDI0MzYYYMQ1YmUxZGyNzYYYYYYYYWI4NDI0MzK0MJMQ1YmYmUZGYYYYYYYYYYYYYYYNzYYYYYYYYWI4NDI0MZYNYYYYYYMYYYYYYYYYYWI4NDI0MzK0MJMQ1YmYmUxZGyNzYYYYYYYYYYWI4NDI0MzK0MQ1YmYmUxZGYYYYYYYYYYYYYWI4NDI0Mzk0MJQ1YmYmUxZGyNzYYYNzYYYYQ0MQ1Ym KLBRSID=fb3eda1aa35a9ed9f88f346a7a3ebe83 | 1586667697 | 1586660346'} start_url = 'https://www.zhihu.com/api/v3/feed/topstory/recommend?session_token=c03069ed8f250472b687fd1ee704dd5b&desktop=true&page_number=5&limit=6&action=pull&ad_interval=-1&before_id=23'
4. Beautifulsoup parsing
S = requests.Session () start_url = 'https://www.zhihu.com/' html = s.get (url = start_url, headers = headers,cookies = cookies,timeout = 5) soup = BeautifulSoup (html.content) question = [] # # name question_address = [] # # url temp1 = soup.find_all (' div',class_='Card TopstoryItem TopstoryItem-isRecommend') for item in temp1: temp2 = item.find_all ('div' Itemprop= "zhihu:question") # print (temp2) if temp2! = []: # skip question_address.append (temp2 [0] .find ('meta',itemprop='url'). Get (' content')) question.append (temp2 [0]. Find ('meta',itemprop='name'). Get (' content'))
5. Storage information
Question_focus_number = [] # followings question_answer_number = [] # replies for url in question_address: test = s.get (url = url,headers = headers,cookies = cookies,timeout = 5) soup = BeautifulSoup (test.content) info = soup.find_all ('div',class_='QuestionPage') [0] # print (info) focus_number = info.find (' meta') Itemprop= "answerCount"). Get ('content') answer_number = info.find (' meta',itemprop= "zhihu:followerCount"). Get ('content') question_focus_number.append (focus_number) question_answer_number.append (answer_number)
6. Organize the information and output it
Question_info = pd.DataFrame (list (zip (question,question_focus_number,question_answer_number)), columns = ['name of question', 'number of followers', 'number of respondents'] for item in ['number of followers', 'number of respondents']: question_ info = np.array (question_ int') question_info.sort_values (by=' followers', ascending = False)
Output:
That's all for "how to use Python crawler". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.