Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python to crawl JD.com 's price, title and evaluation

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article will explain in detail how to use Python to climb JD.com 's price and title and evaluation and other commodities, the content of the article is of high quality, so the editor will share it for you to do a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

Preface

Code implementation

Import requestsfrom lxml import etreeimport timeimport randomimport pandas as pdimport jsonfrom sqlalchemy import create_enginefrom sqlalchemy.dialects.oracle import DATE,FLOAT,NUMBER,VARCHAR2 import cx_Oracle

Import the packages you need first

Def create_table (table_name): conn = cx_Oracle.connect ('user/password@IP:port/database') cursor = conn.cursor () create_shouji =' 'CREATE TABLE {} (merchandise ID VARCHAR2, price number), store name VARCHAR2, store attribute VARCHAR2, title VARCHAR2, comment NUMBER (19) Excellent comments NUMBER (19)''.format (table_name) cursor.execute (create_shouji) cursor.close () conn.close ()

Build a table

Def mapping_df_types (df_pro): dtypedict = {} for I, j in zip (df_pro.columns, df_pro.dtypes): if "object" >

Define the mapping of a type

Def sava_oracle (df_pro): engine = create_engine ('oracle://user:password@ip:port/database') dtypedict = mapping_df_types (df_pro) df_pro.to_sql ("shouji", con=engine,index=False,if_exists='append',dtype=dtypedict)

Define request headers and request methods

Headers= {'user-agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 Edg/83.0.478.37'} def requesturl (url): session = requests.Session () rep = session.get (url,headers=headers) return rep

Parsing the url of comments

Def commreq (url_comm): dd_commt = pd.DataFrame (columns= ['commodity ID',' review', 'excellent review']) session = requests.Session () rep_comm = session.get (url_comm,headers=headers) comment = json.loads (rep_comm.text) ['CommentsCount'] comment_list = [] for i in comment: comment_list.append ({' commodity ID':str (I ['ProductId'])) 'comments': I ['CommentCount'],' excellent comments': I ['GoodCount']}) dd_commt = dd_commt.append (comment_list) return dd_commt

Subject analysis

Def parse (rep): df = pd.DataFrame (columns= ['ID',' price', 'store name', 'store attribute', 'title']) html = etree.HTML (rep.text) all_pro = html.xpath ("/ / ul [@ class='gl-warp clearfix'] / li") proid =' '.join (html.xpath ("/ / li/@data-sku")) # Commodity Evaluation after url # referenceIds= and before & callback are all id of merchandise. You only need to obtain the id stitching of merchandise in the merchandise list. Comment parsing url_comm = r 'https://club.jd.com/comment/productCommentSummaries.action?referenceIds={}'.format(proid) dd_commt = commreq (url_comm) # 2. Product list information parsing pro_list = [] for product in all_pro: proid =''.join (product.xpath ("@ data-sku")) price =' '.join (product.xpath ("div [@ class='gl-i-wrap'] / / strong/i/text ()")) target =' '.join (product.xpath ("div [@ class='gl-i") -wrap'] / / a/em//text ()) .replace ('\ t\ n') ''). Replace ('\ u2122') shopname = '.join (product.xpath ("div [@ class='gl-i-wrap'] / / span/a/@title")) shoptips = product.xpath ("div [@ class='gl-i-wrap'] / / I [contains (@ class)") 'goods-icon')] / text () ") if' proprietary'in shoptips: shoptips=' self-operated 'else: shoptips=' non-proprietary' pro_list.append (dict (commodity ID=proid, price = price, store name = shopname, store attribute = shoptips, title = target) df = df.append (pro_list) # 3. Merge product comments and lists df_pro = pd.merge (df,dd_commt,on=' goods ID') return df_pro

Join the main program

If _ _ name__ = = "_ _ main__": create_table ('shouji') for i in range (1PM81): url =' https://search.jd.com/s_new.php?keyword= Mobile & wq Mobile & ev=3613_104528%5E&page= {0} & s=30'.format (I) rep = requesturl (url) df_pro = parse (rep) Sava_oracle (df_pro) time.sleep (random.randrange (1Power4)) print ('done:' I) about how to use Python to climb JD.com 's price, title and evaluation, etc., so much for sharing here. I hope the above content can be of some help to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report