In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces the relevant knowledge of how to write the code of the Python crawler website, the content is detailed and easy to understand, the operation is simple and fast, and it has a certain reference value. I believe you will gain something after reading the code of this Python crawler website. Let's take a look.
Import requests
Import json
Import os
Import time
Import random
Import jieba
From wordcloud import WordCloud
From imageio import imread
Comments_file_path = 'jd_comments.txt'
Def get_jd_comments (page = 0):
# get jd comments
Url = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=1340204&score=0&sortType=5&page=%s&pageSize=10&isShadowSku=0&fold=1'%page
Headers = {
# which page does the data request come from? each website is different.
'referer': 'https://item.jd.com/1340204.html',
# 'user-agent' refers to a user agent, that is, to let the website know which browser you are using to log in
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
# what kind of users want to see the data, whether they are tourists or members? it is recommended to use the
'cookie':' _ _ jdu=1766075400; areaId=27; PCSYCityID=CN_610000_610100_610113; shshshfpa=a9dc241f-78b8-f3e1-edab-09485009987f-1585747224; shshshfpb=dwWV9IhxtSce3DU0STB1%20TQ%3D%3D; jwotest_product=99; unpl=V2_ZzNtbRAAFhJ3DUJTfhFcUGIAE1RKU0ZCdQoWU3kQXgcwBxJdclRCFnQUR1FnGF8UZAMZWEpcRhFFCEdkeBBVAWMDE1VGZxBFLV0CFSNGF1wjU00zQwBBQHcJFF0uSgwDYgcaDhFTQEJ2XBVQL0oMDDdRFAhyZ0AVRQhHZHsfWwJmBRZYQ1ZzJXI4dmR9EFoAYjMTbUNnAUEpDURSeRhbSGcFFVpDUUcQdAl2VUsa; _ _ jdv=76161171 | baidu-pinzhuan | t_288551095_baidupinzhuan | cpc | 0f3d30c8dba7459bb52f2eb5eba8ac7d_0_cfd63456491d4208954f13a63833f511 | 1585835385193; _ _ jda=122270672.1766075400.1585747219.1585829967.1585835353.3; _ _ jdc=122270672; 3AB9D23F7A4B3C9BroomAXAFRBHRKYDEJAQ4SPJBVU4J4TI6OQHDFRDGI7ISQFUQGA6OZOQN52T3QSRWPSIHTFRYRN2QEG7AMEV2JG6NT2DFM; JSESSIONID=51895EFB4EBD95BA3B3ADAC8C6C73CD8.s1; shshshsID=d2435956e0c158fa7db1980c3053033d_15_1585836826172; _ jdb=122270672.16.1766075400 | 3.1585835353'
}
Try:
Response = requests.get (url, headers = headers)
Except:
Print ('something Wrongems')
# get dataset in json format
Comments_json = response.text[ 20:-2]
# convert the acquired json dataset to json object
Comments_json_obj = json.loads (comments_json)
# get all the contents of comments
Comments_all = comments_json_obj ['comments']
For comment in comments_all:
With open (comments_file_path, 'ajar, encoding =' utf-8') as fin:
Fin.write (comment ['content'] +'\ n')
Print (comment ['content'])
Def batch_jd_comments ():
# clear the data before each write
If os.path.exists (comments_file_path):
Os.remove (comments_file_path)
# when we specify the value of page I, it can get comments on a fixed page.
For i in range (30):
Print ('crawling' + str (iTun1) + 'page data....')
Get_jd_comments (I)
# set time to simulate user browsing to prevent ip from being blocked due to crawling too frequently.
Time.sleep (random.random () * 5)
# Segmentation of the obtained data
Def cut_comments ():
With open (comments_file_path, encoding='utf-8') as file:
Comment_text = file.read ()
Wordlist = jieba.lcut_for_search (comment_text)
New_wordlist = '.join (wordlist)
Return new_wordlist
# introduce picture byt.jpg to make word cloud images of the same shape
Def create_word_cloud ():
Mask = imread ('byt.jpg')
Wordcloud = WordCloud (font_path='msyh.ttc',mask = mask) .generate (cut_comments ())
Wordcloud.to_file ('picture.png')
If _ _ name__ = ='_ _ main__':
Create_word_cloud () on "how to write the code for the Python crawler site" this article is introduced here, thank you for reading! I believe that everyone has a certain understanding of the knowledge of "how to write the code of the Python crawler website". If you want to learn more knowledge, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.