Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How Python crawls JD.com 's merchandise information and comments into MySQL

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "Python how to crawl JD.com commodity information comments and merge into MySQL". In daily operation, I believe many people have doubts about how to crawl JD.com commodity information comments and enter MySQL. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "how Python crawls JD.com commodity information comments and MySQL". Next, please follow the editor to study!

Build mysql data sheet

Question: when using SQL alchemy, the non-primary key cannot be set to self-growing, but I want this non-primary key to be used only as an index. Autoincrement=True is invalid. How can I make it self-growing?

From sqlalchemy import String,Integer,Text,Columnfrom sqlalchemy import create_enginefrom sqlalchemy.orm import sessionmakerfrom sqlalchemy.orm import scoped_sessionfrom sqlalchemy.ext.declarative import declarative_base engine=create_engine ("mysql+pymysql://root:root@127.0.0.1:3306/jdcrawl?charset=utf8", pool_size=200, max_overflow=300, echo=False) BASE=declarative_base () # instantiate class Goods (BASE): _ _ tablename__='goods' id=Column (Integer (), primary_key=True Autoincrement=True) sku_id=Column (String, primary_key=True,autoincrement= False) name=Column (String) price=Column (String) comments_num=Column (Integer) shop=Column (String) link=Column (String) class Comments (BASE): _ tablename__='comments' id=Column (Integer (), primary_key=True,autoincrement=True,nullable=False) sku_id=Column (String (200), primary_key=True Autoincrement=False) comments=Column (Text ()) BASE.metadata.create_all (engine) Session=sessionmaker (engine) sess_db=scoped_session (Session) first edition:

Problem: after crawling a few pages of comments, you will crawl to a blank page, which is still the case after adding refer

Try the solution: change the thread pool where you get comments to single thread, and increase the delay by 1s for every page of comments you get

# Don't climb too fast! Otherwise, you won't get the comment from bs4 import BeautifulSoupimport requestsfrom urllib import parseimport csv,json,reimport threadpoolimport timefrom jd_mysqldb import Goods,Comments,sess_db headers= {'user-agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36', 'Cookie':' _ _ jdv=76161171 | baidu |-| organic |% 25E4%25BA%25AC%25E4%25B8%259C | 1613711947911; _ _ jdu=16137119479101182770449; areaId=7; ipLoc-djd=7-458-466-0 PCSYCityID=CN_410000_0_0; shshshfpa=07383463-032f-3f99-9d40-639cb57c6e28-1613711950; shshshfpb=u8S9UvxK66gfIbM1mUNrIOg%3D%3D; user-key=153f6b4d-0704-4e56-82b6-8646f3f0dad4; cn=0; shshshfp=9a88944b34cb0ff3631a0a95907b75eb; _ jdc=122270672; 3AB9D23F7A4B3C9BroomSEELVNXBPU7OAA3UX5JTKR5LQADM5YFJRKY23Z6HDBU4OT2NWYGX525CKFFVHTRDJ7Q5DJRMRZQIQOW5GBY43XVI _ _ jdb=122270672.5.16137119479101182770449 | 4.1613748918commodities, 'Referer':' https://www.jd.com/'} num=0 # quantity of goods comments_num=0 # quantity of comments # get product information and SkuIddef getIndex (url): session=requests.Session () session.headers=headers global num res=session.get (url,headers=headers) print (res.status_code) res.encoding=res.apparent_encoding soup=BeautifulSoup (res.text 'lxml') items=soup.select (' li.gl-item') for item in items [: 3]: # crawl three commodities to test title=item.select_one ('. P-name an em'). Text.strip (). Replace (',') price=item.select_one ('. P-price strong'). Text.strip (). Replace ('¥' '') try: shop=item.select_one ('.p-shopnum a') .text.strip () # method to find a store when getting books except: shop=item.select_one (' .p-shop a') .text.strip () # method to find a store when getting other goods link=parse.urljoin ('https://', Item.select_one ('. P-img a'). Get ('href') SkuId=re.search. Group () comments_num=getCommentsNum (SkuId,session) print (SkuId,title, price, shop, link, comments_num) print ("start saving in the database.") Try: IntoGoods (SkuId,title, price, shop, link Comments_num) except Exception as e: print (e) sess_db.rollback () num + = 1 print ("getting comments...") # get the total number of comments pages url1 = f 'https://club.jd.com/comment/productPageComments.action?productId={SkuId}&score=0&sortType=5&page=0&pageSize=10' headers [' Referer'] = f 'https://item.jd.com/{SkuId}.html' headers [' Connection'] = 'keep-alive' res2 = session.get (url1) Headers=headers) res2.encoding = res2.apparent_encoding json_data = json.loads (res2.text) max_page = json_data ['maxPage'] # can get up to 100 pages of comments after testing 10 args per page = [] for i in range (0, max_page): # use this link to get comments in json format url2 = f 'https://club.jd.com/comment/productPageComments.action?productId={SkuId}&score=0&sortType=5&page={i}&pageSize=10' # use this link to get comments in non-json format Need to extract # url2_2=f' https://club.jd.com/comment/productPageComments.action?callback=jQuery9287224&productId={SkuId}&score=0&sortType=5&page={i}&pageSize=10' args.append (([session,SkuId,url2], None)) pool2 = threadpool.ThreadPool (2) # 2 threads reque2 = threadpool.makeRequests (getComments) Args) # create task for r in reque2: pool2.putRequest (r) # submit the task to the thread pool pool2.wait () # get the total number of comments def getCommentsNum (SkuId,sess): headers ['Referer'] = f' https://item.jd.com/{SkuId}.html' url=f' https://club.jd.com/comment/productCommentSummaries.action?referenceIds={SkuId}' res=sess.get (url Headers=headers) try: res.encoding=res.apparent_encoding json_data=json.loads (res.text) # json format is converted to dictionary num=json_data ['CommentsCount'] [0] [' CommentCount'] return num except: return 'Error' # to get comment def getComments (sess,SkuId Url2): global comments_num print (url2) headers ['Referer'] = f' https://item.jd.com/{SkuId}.html' res2 = sess.get (url2,headers=headers) res2.encoding='gbk' json_data=res2.text'# extract json start = res2.text.find ('jQuery9287224 (') + len ('jQuery9287224 (') end = res2.text.find (') if you need to do the following with url2_2 ') json_data= res2.Text [start: end]' 'dict_data = json.loads (json_data) try: comments=dict_data [' comments'] for item in comments: comment=item ['content'] .replace ('\ nMar content'') # print (comment) comments_num+=1 try: IntoComments (SkuId Comment) except Exception as e: print (e) sess_db.rollback () except: pass # goods information is stored in def IntoGoods (SkuId,title, price, shop, link, comments_num): goods_data=Goods (sku_id=SkuId, name=title, price=price, comments_num=comments_num, shop=shop Link=link) sess_db.add (goods_data) sess_db.commit () # comments on def IntoComments (SkuId,comment): comments_data=Comments (sku_id=SkuId) Comments=comment) sess_db.add (comments_data) sess_db.commit () if _ _ name__ ='_ _ main__': start_time=time.time () urls= [] KEYWORD=parse.quote (input ("Please enter keywords to be queried:") for i in range (1Please 2): # crawl a page to test url=f' https://search.jd.com/Search ? keyword= {KEYWORD} & wq= {KEYWORD} & page= {I} 'urls.append ([url ], None)) # threadpool requires that you must write pool=threadpool.ThreadPool (2) # 2 thread pool reque=threadpool.makeRequests (getIndex,urls) # create task for rin reque: pool.putRequest (r) # submit task to thread pool pool.wait () # wait for all tasks to be executed print ("get {} items in total" Get {} comments, time-consuming {} ".format (num,comments_num,time.time ()-start_time)) second edition:

After testing, it is true that there will be no blank pages.

Further optimization: get comments on more than 2 items at the same time

# Don't climb too fast! Otherwise, you won't get the comment from bs4 import BeautifulSoupimport requestsfrom urllib import parseimport csv,json,reimport threadpoolimport timefrom jd_mysqldb import Goods,Comments,sess_db headers= {'user-agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36', 'Cookie':' _ _ jdv=76161171 | baidu |-| organic |% 25E4%25BA%25AC%25E4%25B8%259C | 1613711947911; _ _ jdu=16137119479101182770449; areaId=7; ipLoc-djd=7-458-466-0 PCSYCityID=CN_410000_0_0; shshshfpa=07383463-032f-3f99-9d40-639cb57c6e28-1613711950; shshshfpb=u8S9UvxK66gfIbM1mUNrIOg%3D%3D; user-key=153f6b4d-0704-4e56-82b6-8646f3f0dad4; cn=0; shshshfp=9a88944b34cb0ff3631a0a95907b75eb; _ jdc=122270672; 3AB9D23F7A4B3C9BroomSEELVNXBPU7OAA3UX5JTKR5LQADM5YFJRKY23Z6HDBU4OT2NWYGX525CKFFVHTRDJ7Q5DJRMRZQIQOW5GBY43XVI _ _ jdb=122270672.5.16137119479101182770449 | 4.1613748918commodities, 'Referer':' https://www.jd.com/'} num=0 # quantity of goods comments_num=0 # quantity of comments # get product information and SkuIddef getIndex (url): session=requests.Session () session.headers=headers global num res=session.get (url,headers=headers) print (res.status_code) res.encoding=res.apparent_encoding soup=BeautifulSoup (res.text 'lxml') items=soup.select (' li.gl-item') for item in items [: 2]: # climb 2 merchandise to test title=item.select_one ('. P-name an em'). Text.strip (). Replace (',') price=item.select_one ('. P-price strong'). Text.strip (). Replace ('¥' '') try: shop=item.select_one ('.p-shopnum a') .text.strip () # method to find a store when getting books except: shop=item.select_one (' .p-shop a') .text.strip () # method to find a store when getting other goods link=parse.urljoin ('https://', Item.select_one ('. P-img a'). Get ('href') SkuId=re.search. Group () headers [' Referer'] = f 'https://item.jd.com/{SkuId}.html' headers [' Connection'] = 'keep-alive' comments_num=getCommentsNum (SkuId,session) print (SkuId,title, price, shop, link) Comments_num) print ("start storing items in the database...") Try: IntoGoods (SkuId,title, price, shop, link Comments_num) except Exception as e: print (e) sess_db.rollback () num + = 1 print ("getting comments...") # get the total number of comments pages url1 = f 'https://club.jd.com/comment/productPageComments.action?productId={SkuId}&score=0&sortType=5&page=0&pageSize=10' res2 = session.get (url1 Headers=headers) res2.encoding = res2.apparent_encoding json_data = json.loads (res2.text) max_page = json_data ['maxPage'] # can get up to 100 pages of comments after testing 10 print per page ("{} comments of {} pages" .format (SkuId,max_page)) if max_page==0: IntoComments (SkuId,'0') else: for i in range (0 Max_page): # use this link to get comments in the json format url2 = f 'https://club.jd.com/comment/productPageComments.action?productId={SkuId}&score=0&sortType=5&page={i}&pageSize=10' # use this link to get comments in non-json format Need to extract # url2_2=f' https://club.jd.com/comment/productPageComments.action?callback=jQuery9287224&productId={SkuId}&score=0&sortType=5&page={i}&pageSize=10' print ("start getting comments on page {}: {}" .format (iCompen1memurl2) getComments (session,SkuId,url2) time.sleep (1) # to get the total number of comments def getCommentsNum (SkuId Sess): url=f' https://club.jd.com/comment/productCommentSummaries.action?referenceIds={SkuId}' res=sess.get (url) try: res.encoding=res.apparent_encoding json_data=json.loads (res.text) # json format to num=json_data ['CommentsCount'] [0] [' CommentCount'] return num except: return 'Error' # get comment def getComments (sess,SkuId Url2): global comments_num res2 = sess.get (url2) res2.encoding=res2.apparent_encoding json_data=res2.text'# if using url2_2, extract json start = res2.text.find ('jQuery9287224 (') + len ('jQuery9287224 (') end = res2.text.find (') ') json_data= res2.Text [start: end]' 'dict_data = json.loads (json_data) comments=dict_data [' comments'] for item in comments: comment=item ['content'] .replace ('\ nMed content'') # print (comment) comments_num+=1 try: IntoComments (SkuId Comment) except Exception as e: print (e) sess_db.rollback () # def IntoGoods (SkuId,title, price, shop, link, comments_num): goods_data=Goods (sku_id=SkuId, name=title, price=price, comments_num=comments_num, shop=shop Link=link) sess_db.add (goods_data) sess_db.commit () # comments on def IntoComments (SkuId,comment): comments_data=Comments (sku_id=SkuId) Comments=comment) sess_db.add (comments_data) sess_db.commit () if _ _ name__ ='_ _ main__': start_time=time.time () urls= [] KEYWORD=parse.quote (input ("Please enter keywords to be queried:") for i in range (1Please 2): # crawl a page to test url=f' https://search.jd.com/Search ? keyword= {KEYWORD} & wq= {KEYWORD} & page= {I} 'urls.append ([url ], None)) # threadpool requires that you must write pool=threadpool.ThreadPool (2) # 2 thread pool reque=threadpool.makeRequests (getIndex,urls) # create task for rin reque: pool.putRequest (r) # submit task to thread pool pool.wait () # wait for all tasks to be executed print ("get {} items in total" Get {} comments, time-consuming {} ".format (num,comments_num,time.time ()-start_time) third edition:

. No, there's a blank page again.

# Don't climb too fast! Otherwise, you won't get the comment from bs4 import BeautifulSoupimport requestsfrom urllib import parseimport csv,json,reimport threadpoolimport timefrom jd_mysqldb import Goods,Comments,sess_db headers= {'user-agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36', 'Cookie':' _ _ jdv=76161171 | baidu |-| organic |% 25E4%25BA%25AC%25E4%25B8%259C | 1613711947911; _ _ jdu=16137119479101182770449; areaId=7; ipLoc-djd=7-458-466-0 PCSYCityID=CN_410000_0_0; shshshfpa=07383463-032f-3f99-9d40-639cb57c6e28-1613711950; shshshfpb=u8S9UvxK66gfIbM1mUNrIOg%3D%3D; user-key=153f6b4d-0704-4e56-82b6-8646f3f0dad4; cn=0; shshshfp=9a88944b34cb0ff3631a0a95907b75eb; _ jdc=122270672; 3AB9D23F7A4B3C9BroomSEELVNXBPU7OAA3UX5JTKR5LQADM5YFJRKY23Z6HDBU4OT2NWYGX525CKFFVHTRDJ7Q5DJRMRZQIQOW5GBY43XVI _ _ jdb=122270672.5.16137119479101182770449 | 4.1613748918 commodities, 'Referer':' https://www.jd.com/'} num=0 # quantity of goods comments_num=0 # quantity of comments # get product information and SkuIddef getIndex (url): global num skuids= [] session=requests.Session () session.headers=headers res=session.get (url,headers=headers) print (res.status_code) res.encoding=res.apparent_encoding soup=BeautifulSoup (res.text 'lxml') items=soup.select (' li.gl-item') for item in items [: 3]: # crawl three commodities to test title=item.select_one ('. P-name an em'). Text.strip (). Replace (',') price=item.select_one ('. P-price strong'). Text.strip (). Replace ('¥' '') try: shop=item.select_one ('.p-shopnum a') .text.strip () # method to find a store when getting books except: shop=item.select_one (' .p-shop a') .text.strip () # method to find a store when getting other goods link=parse.urljoin ('https://', Item.select_one ('. P-img a'). Get ('href') SkuId=re.search. Group () skuids.append (([SkuId,session], None)) headers [' Referer'] = f 'https://item.jd.com/{SkuId}.html' headers [' Connection'] = 'keep-alive' comments_num=getCommentsNum (SkuId,session) # number of comments print (SkuId) Title, price, shop, link, comments_num) print ("start storing items in the database...") Try: IntoGoods (SkuId,title, price, shop, link, comments_num) except Exception as e: print (e) sess_db.rollback () num + = 1 print ("start getting comments and saving them in the database...") Pool2=threadpool.ThreadPool (3) # can get reviews of three products at the same time task=threadpool.makeRequests (getComments,skuids) for r in task: pool2.putRequest (r) pool2.wait () # get comments def getComments (SkuId,sess): # get the total number of comments url1 = f 'https://club.jd.com/comment/productPageComments.action?productId={SkuId}&score=0&sortType=5&page=0&pageSize=10' res2 = sess.get (url1 Headers=headers) res2.encoding = res2.apparent_encoding json_data = json.loads (res2.text) max_page = json_data ['maxPage'] # can get up to 100 pages of comments after testing 10 print per page ("{} comments of {} pages" .format (SkuId, max_page)) if max_page = = 0: IntoComments (SkuId,'0') else: for i in range (0 Max_page): # use this link to get comments in the json format url2 = f 'https://club.jd.com/comment/productPageComments.action?productId={SkuId}&score=0&sortType=5&page={i}&pageSize=10' # use this link to get comments in non-json format Need to extract # url2_2=f' https://club.jd.com/comment/productPageComments.action?callback=jQuery9287224&productId={SkuId}&score=0&sortType=5&page={i}&pageSize=10' print ("start getting comments on page {}: {}" .format (I + 1, url2)) getComments_one (sess, SkuId, url2) time.sleep (1) # get the total number of comments def getCommentsNum (SkuId Sess): url=f' https://club.jd.com/comment/productCommentSummaries.action?referenceIds={SkuId}' res=sess.get (url) try: res.encoding=res.apparent_encoding json_data=json.loads (res.text) # json format is converted to dictionary num=json_data ['CommentsCount'] [0] [' CommentCount'] return num except: return 'Error' # to get a single comment def getComments_one (sess,SkuId Url2): global comments_num res2 = sess.get (url2) res2.encoding=res2.apparent_encoding json_data=res2.text'# if using url2_2, extract json start = res2.text.find ('jQuery9287224 (') + len ('jQuery9287224 (') end = res2.text.find (') ') json_data= res2.Text [start: end]' 'dict_data = json.loads (json_data) comments=dict_data [' comments'] for item in comments: comment=item ['content'] .replace ('\ nMed content'') # print (comment) comments_num+=1 try: IntoComments (SkuId Comment) except Exception as e: print (e) print ("rollback!") Sess_db.rollback () # goods information storage def IntoGoods (SkuId,title, price, shop, link, comments_num): goods_data=Goods (sku_id=SkuId, name=title, price=price, comments_num=comments_num, shop=shop, link=link) sess_db.add (goods_data) sess_db.commit () # comments storage def IntoComments (SkuId Comment): comments_data=Comments (sku_id=SkuId Comments=comment) sess_db.add (comments_data) sess_db.commit () if _ _ name__ ='_ _ main__': start_time=time.time () urls= [] KEYWORD=parse.quote (input ("Please enter keywords to be queried:") for i in range (1Please 2): # crawl a page to test url=f' https://search.jd.com/Search ? keyword= {KEYWORD} & wq= {KEYWORD} & page= {I} 'urls.append ([url ], None)) # threadpool requires that you must write pool=threadpool.ThreadPool (2) # 2 thread pool reque=threadpool.makeRequests (getIndex,urls) # create task for rin reque: pool.putRequest (r) # submit task to thread pool pool.wait () # wait for all tasks to be executed print ("get {} items in total" Get {} comments, take time {} ".format (num,comments_num,time.time ()-start_time)), this is the end of the study on" how to crawl JD.com commodity information comments and MySQL ", hoping to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report