Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to write the code of Python crawler website

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of how to write the code of the Python crawler website, the content is detailed and easy to understand, the operation is simple and fast, and it has a certain reference value. I believe you will gain something after reading the code of this Python crawler website. Let's take a look.

Import requests

Import json

Import os

Import time

Import random

Import jieba

From wordcloud import WordCloud

From imageio import imread

Comments_file_path = 'jd_comments.txt'

Def get_jd_comments (page = 0):

# get jd comments

Url = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=1340204&score=0&sortType=5&page=%s&pageSize=10&isShadowSku=0&fold=1'%page

Headers = {

# which page does the data request come from? each website is different.

'referer': 'https://item.jd.com/1340204.html',

# 'user-agent' refers to a user agent, that is, to let the website know which browser you are using to log in

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'

# what kind of users want to see the data, whether they are tourists or members? it is recommended to use the

'cookie':' _ _ jdu=1766075400; areaId=27; PCSYCityID=CN_610000_610100_610113; shshshfpa=a9dc241f-78b8-f3e1-edab-09485009987f-1585747224; shshshfpb=dwWV9IhxtSce3DU0STB1%20TQ%3D%3D; jwotest_product=99; unpl=V2_ZzNtbRAAFhJ3DUJTfhFcUGIAE1RKU0ZCdQoWU3kQXgcwBxJdclRCFnQUR1FnGF8UZAMZWEpcRhFFCEdkeBBVAWMDE1VGZxBFLV0CFSNGF1wjU00zQwBBQHcJFF0uSgwDYgcaDhFTQEJ2XBVQL0oMDDdRFAhyZ0AVRQhHZHsfWwJmBRZYQ1ZzJXI4dmR9EFoAYjMTbUNnAUEpDURSeRhbSGcFFVpDUUcQdAl2VUsa; _ _ jdv=76161171 | baidu-pinzhuan | t_288551095_baidupinzhuan | cpc | 0f3d30c8dba7459bb52f2eb5eba8ac7d_0_cfd63456491d4208954f13a63833f511 | 1585835385193; _ _ jda=122270672.1766075400.1585747219.1585829967.1585835353.3; _ _ jdc=122270672; 3AB9D23F7A4B3C9BroomAXAFRBHRKYDEJAQ4SPJBVU4J4TI6OQHDFRDGI7ISQFUQGA6OZOQN52T3QSRWPSIHTFRYRN2QEG7AMEV2JG6NT2DFM; JSESSIONID=51895EFB4EBD95BA3B3ADAC8C6C73CD8.s1; shshshsID=d2435956e0c158fa7db1980c3053033d_15_1585836826172; _ jdb=122270672.16.1766075400 | 3.1585835353'

}

Try:

Response = requests.get (url, headers = headers)

Except:

Print ('something Wrongems')

# get dataset in json format

Comments_json = response.text[ 20:-2]

# convert the acquired json dataset to json object

Comments_json_obj = json.loads (comments_json)

# get all the contents of comments

Comments_all = comments_json_obj ['comments']

For comment in comments_all:

With open (comments_file_path, 'ajar, encoding =' utf-8') as fin:

Fin.write (comment ['content'] +'\ n')

Print (comment ['content'])

Def batch_jd_comments ():

# clear the data before each write

If os.path.exists (comments_file_path):

Os.remove (comments_file_path)

# when we specify the value of page I, it can get comments on a fixed page.

For i in range (30):

Print ('crawling' + str (iTun1) + 'page data....')

Get_jd_comments (I)

# set time to simulate user browsing to prevent ip from being blocked due to crawling too frequently.

Time.sleep (random.random () * 5)

# Segmentation of the obtained data

Def cut_comments ():

With open (comments_file_path, encoding='utf-8') as file:

Comment_text = file.read ()

Wordlist = jieba.lcut_for_search (comment_text)

New_wordlist = '.join (wordlist)

Return new_wordlist

# introduce picture byt.jpg to make word cloud images of the same shape

Def create_word_cloud ():

Mask = imread ('byt.jpg')

Wordcloud = WordCloud (font_path='msyh.ttc',mask = mask) .generate (cut_comments ())

Wordcloud.to_file ('picture.png')

If _ _ name__ = ='_ _ main__':

Create_word_cloud () on "how to write the code for the Python crawler site" this article is introduced here, thank you for reading! I believe that everyone has a certain understanding of the knowledge of "how to write the code of the Python crawler website". If you want to learn more knowledge, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report