Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The docker crawler Technology of python in "docker practice"-app crawling of python script (13)

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

Original articles, welcome to reprint. Reprint please indicate: reproduced from IT Story Association, thank you!

Original link address: "docker practice" python's docker crawler technology-python script app crawl (13)

Last time, we have analyzed the specific app request connection, this time mainly talk about the development of python, grab the information in APP. Source code: https://github.com/limingios/dockerpython.git

Analyze app packet

View Analysi

Parsed header

Night god configuration

Python code, crawling classification

#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2019-1-9 11 Author @ Author: lm# @ Url: idig8.com# @ Site: # @ File: spider_douguomeishi.py# @ Software: PyCharmimport requests#header has a lot of content, because different manufacturers have different ideas, # fiddler crawls out more fields, and some content should be optional. You can only try to annotate something in practice. Def handle_request (url,data): header = {"client": "4", "version": "6916.2", "device": "SM-G955N", "sdk": "22J 5.1.1", "imei": "354730010002552", "channel": "zhuzhan", "mac": "00:FF:E2:A2:7B:58" "resolution": "14400000"," dpi ":" 2.0", "android-id": "bcdaf527105cc26f", "pseudo-id": "354730010002552", "brand": "samsung", "scale": "2.0"," timezone ":" 28800 "," language ":" zh "," cns ":" 3 "," carrier ":" Android " # "imsi": "310260000000000", "user-agent": "Mozilla/5.0 (Linux) Android 5.1.1; SM-G955N Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/39.0.0.0 Mobile Safari/537.36, "lon": "105.566938", "lat": "29.99831", "cid": "512000", "Content-Type": "application/x-www-form-urlencoded" Charset=utf-8 "," Accept-Encoding ":" gzip, deflate "," Connection ":" Keep-Alive ", #" Cookie ":" duid=58349118 "," Host ":" api.douguo.net ", #" Content-Length ":" 65 "} response = requests.post (url=url,headers=header Data=data) return responsedef handle_index (): url = "http://api.douguo.net/recipe/flatcatalogs" # client=4&_session=1547000257341354730010002552&v=1503650468&_vs=0 data= {" client ":" 4 "," _ session ":" 1547000257341354730010002552 "," v ":" 1503650468 "," _ vs ":" 0 "} response = handle_request (url,data) print (response.text) handle_index ()

Crawl the details, and the information can be found by classification.

#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2019-1-9 11 Author @ Author: lm# @ Url: idig8.com# @ Site: # @ File: spider_douguomeishi.py# @ Software: PyCharmimport jsonimport requestsfrom multiprocessing import Queue# create queue queue_list = Queue () def handle_request (url,data): header = {"client": "4" "version": "6916.2", "device": "SM-G955N", "sdk": "22Magi 5.1.1", "imei": "354730010002552", "channel": "zhuzhan", "mac": "00:FF:E2:A2:7B:58", "resolution": "1440,900", "dpi": "2.0" "android-id": "bcdaf527105cc26f", "pseudo-id": "354730010002552", "brand": "samsung", "scale": "2.0"," timezone ":" 28800 "," language ":" zh "," cns ":" 3 "," carrier ":" Android ", #" imsi ":" 310260000000000 " "user-agent": "Mozilla/5.0 (Linux Android 5.1.1; SM-G955N Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/39.0.0.0 Mobile Safari/537.36, "lon": "105.566938", "lat": "29.99831", "cid": "512000", "Content-Type": "application/x-www-form-urlencoded" Charset=utf-8 "," Accept-Encoding ":" gzip, deflate "," Connection ":" Keep-Alive ", #" Cookie ":" duid=58349118 "," Host ":" api.douguo.net ", #" Content-Length ":" 65 "} response = requests.post (url=url,headers=header Data=data) return responsedef handle_index (): url = "http://api.douguo.net/recipe/flatcatalogs" # client=4&_session=1547000257341354730010002552&v=1503650468&_vs=0 data= {" client ":" 4 "," _ session ":" 1547000257341354730010002552 "," v ":" 1503650468 "," _ vs ":" 0 "} response = handle_request (url) Data) # print (response.text) index_response_dic = json.loads (response.text) for item_index in index_response_dic ["result"] ["cs"]: # print (item_index) for item_index_cs in item_index ["cs"]: # print (item_index_cs) for item in item_index_cs ["cs"]: # print (item) data_2 = {"client": "4" "_ session": "1547000257341354730010002552", "keyword": item ["name"], "_ vs": "400"} # print (data_2) queue_list.put (data_2) handle_index () print (queue_list.qsize ())

Internal details of classified recipes

#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2019-1-9 11 Author @ Author: lm# @ Url: idig8.com# @ Site: # @ File: spider_douguomeishi.py# @ Software: PyCharmimport jsonimport requestsfrom multiprocessing import Queue# create queue queue_list = Queue () def handle_request (url,data): header = {"client": "4" "version": "6916.2", "device": "SM-G955N", "sdk": "22Magi 5.1.1", "imei": "354730010002552", "channel": "zhuzhan", "mac": "00:FF:E2:A2:7B:58", "resolution": "1440,900", "dpi": "2.0" "android-id": "bcdaf527105cc26f", "pseudo-id": "354730010002552", "brand": "samsung", "scale": "2.0"," timezone ":" 28800 "," language ":" zh "," cns ":" 3 "," carrier ":" Android ", #" imsi ":" 310260000000000 " "user-agent": "Mozilla/5.0 (Linux Android 5.1.1; SM-G955N Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/39.0.0.0 Mobile Safari/537.36, "lon": "105.566938", "lat": "29.99831", "cid": "512000", "Content-Type": "application/x-www-form-urlencoded" Charset=utf-8 "," Accept-Encoding ":" gzip, deflate "," Connection ":" Keep-Alive ", #" Cookie ":" duid=58349118 "," Host ":" api.douguo.net ", #" Content-Length ":" 65 "} response = requests.post (url=url,headers=header Data=data) return responsedef handle_index (): url = "http://api.douguo.net/recipe/flatcatalogs" # client=4&_session=1547000257341354730010002552&v=1503650468&_vs=0 data= {" client ":" 4 "," _ session ":" 1547000257341354730010002552 "," v ":" 1503650468 "," _ vs ":" 0 "} response = handle_request (url) Data) # print (response.text) index_response_dic = json.loads (response.text) for item_index in index_response_dic ["result"] ["cs"]: # print (item_index) for item_index_cs in item_index ["cs"]: # print (item_index_cs) for item in item_index_cs ["cs"]: # print (item) data_2 = {"client": "4" # "_ session": "1547000257341354730010002552", "keyword": item ["name"], "_ vs": "400" "order": "0"} # print (data_2) queue_list.put (data_2) def handle_caipu_list (data): print ("current ingredients:", data ["keyword"]) caipu_list_url = "http://api.douguo.net/recipe/s/0/20";" Caipu_response = handle_request (caipu_list_url Data) caipu_response_dict = json.loads (caipu_response.text) for caipu_item in caipu_response_dict ["result"] ["list"]: caipu_info = {} caipu_info ["shicai"] = data ["keyword"] if caipu_item ["type"] = 13: caipu_info ["user_name"] = caipu_item ["r"] ["an"] Caipu_info ["shicai_id"] = caipu_item ["r"] ["id"] caipu_info ["describe"] = caipu_item ["r"] ["cookstory"] .replace ("\ n") Replace (",") caipu_info ["caipu_name"] = caipu_item ["r"] ["n"] caipu_info ["zuoliao_list"] = caipu_item ["r"] ["major"] print (caipu_info) else: continuehandle_index () handle_caipu_list (queue_list.get ())

Details of the interior of the dish

#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2019-1-9 11 Author @ Author: lm# @ Url: idig8.com# @ Site: # @ File: spider_douguomeishi.py# @ Software: PyCharmimport jsonimport requestsfrom multiprocessing import Queue# create queue queue_list = Queue () def handle_request (url,data): header = {"client": "4" "version": "6916.2", "device": "SM-G955N", "sdk": "22Magi 5.1.1", "imei": "354730010002552", "channel": "zhuzhan", "mac": "00:FF:E2:A2:7B:58", "resolution": "1440,900", "dpi": "2.0" "android-id": "bcdaf527105cc26f", "pseudo-id": "354730010002552", "brand": "samsung", "scale": "2.0"," timezone ":" 28800 "," language ":" zh "," cns ":" 3 "," carrier ":" Android ", #" imsi ":" 310260000000000 " "user-agent": "Mozilla/5.0 (Linux Android 5.1.1; SM-G955N Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/39.0.0.0 Mobile Safari/537.36, "lon": "105.566938", "lat": "29.99831", "cid": "512000", "Content-Type": "application/x-www-form-urlencoded" Charset=utf-8 "," Accept-Encoding ":" gzip, deflate "," Connection ":" Keep-Alive ", #" Cookie ":" duid=58349118 "," Host ":" api.douguo.net ", #" Content-Length ":" 65 "} response = requests.post (url=url,headers=header Data=data) return responsedef handle_index (): url = "http://api.douguo.net/recipe/flatcatalogs" # client=4&_session=1547000257341354730010002552&v=1503650468&_vs=0 data= {" client ":" 4 "," _ session ":" 1547000257341354730010002552 "," v ":" 1503650468 "," _ vs ":" 0 "} response = handle_request (url) Data) # print (response.text) index_response_dic = json.loads (response.text) for item_index in index_response_dic ["result"] ["cs"]: # print (item_index) for item_index_cs in item_index ["cs"]: # print (item_index_cs) for item in item_index_cs ["cs"]: # print (item) data_2 = {"client": "4" # "_ session": "1547000257341354730010002552", "keyword": item ["name"], "_ vs": "400" "order": "0"} # print (data_2) queue_list.put (data_2) def handle_caipu_list (data): print ("current ingredients:", data ["keyword"]) caipu_list_url = "http://api.douguo.net/recipe/s/0/20";" Caipu_response = handle_request (caipu_list_url Data) caipu_response_dict = json.loads (caipu_response.text) for caipu_item in caipu_response_dict ["result"] ["list"]: caipu_info = {} caipu_info ["shicai"] = data ["keyword"] if caipu_item ["type"] = 13: caipu_info ["user_name"] = caipu_item ["r"] ["an"] Caipu_info ["shicai_id"] = caipu_item ["r"] ["id"] caipu_info ["describe"] = caipu_item ["r"] ["cookstory"] .replace ("\ n") "). Replace (" "") caipu_info ["caipu_name"] = caipu_item ["r"] ["n"] caipu_info ["zuoliao_list"] = caipu_item ["r"] ["major"] # print (caipu_info) detail_url = "http://api.douguo.net/recipe/detail/"+ str (caipu_info [" shicai_id "]) Detail_data = {"client": "4" "_ session": "1547000257341354730010002552", "author_id": "0", "_ vs": "2803", "ext":'{"query": {"kw": "'+ data [" keyword "] +'", "src": "2803", "idx": "1", "type": "13" "id":'+ str (caipu_info ["shicai_id"]) +'}}'} detail_reponse = handle_request (detail_url Detail_data) detail_reponse_dic = json.loads (detail_reponse.text) caipu_info ["tips"] = detail_reponse_dic ["result"] ["recipe"] ["tips"] caipu_info ["cookstep"] = detail_reponse_dic ["result"] ["recipe"] ["cookstep"] print (json.dumps (caipu_info) else: Continuehandle_index () handle_caipu_list (queue_list.get ()) saves the data in mongodb

Install a virtual machine through vagrant

Vagrant up enters the virtual machine

Ip 192.168.66.100

Su-# password: vagrantdocker

Pull the image of mongodb

Https://hub.docker.com/r/bitnami/mongodb

Default port: 27017

Docker pull bitnami/mongodb:latest

Create the container of mongodb mkdir bitnamicd bitnamimkdir mongodbdocker run-d-v / path/to/mongodb-persistence:/root/bitnami-p 27017purl 27017 bitnami/mongodb:latest

# turn off the firewall

Systemctl stop firewalld

> use a third-party tool to connect! [] (the tool for https://upload-images.jianshu.io/upload_images/11223715-aea2f13184d728c2.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)> to connect to mongodb is ```pythonpotato mongodb.) USR binash Env python#-*-coding: utf-8-*-# @ Time: 2019-1-11 0RV @ Author: liming# @ Site: # @ File: Handle_mongodb.py# @ url: idig8.com# @ Software: PyCharmimport pymongofrom pymongo.collection import Collectionclass Connect_mongo (object): def _ _ init__ (self): self.client = pymongo.MongoClient (host= "192.168.66.100" Port=27017) self.db_data = self.client ["dou_guo_mei_shi"] def insert_item (self,item): db_collection = Collection (self.db_data,'dou_guo_mei_shi_item') db_collection.insert (item) # exposed mongo_info = Connect_mongo ()

The data crawled by python is saved to the docker image of centos7 through mongo tools.

#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2019-1-9 11 Author @ Author: lm# @ Url: idig8.com# @ Site: # @ File: spider_douguomeishi.py# @ Software: PyCharmimport jsonimport requestsfrom multiprocessing import Queuefrom handle_mongo import mongo_info# create queue queue_list = Queue () def handle_request (url,data): header = {"client": "4" "version": "6916.2", "device": "SM-G955N", "sdk": "22Magi 5.1.1", "imei": "354730010002552", "channel": "zhuzhan", "mac": "00:FF:E2:A2:7B:58", "resolution": "1440,900", "dpi": "2.0" "android-id": "bcdaf527105cc26f", "pseudo-id": "354730010002552", "brand": "samsung", "scale": "2.0"," timezone ":" 28800 "," language ":" zh "," cns ":" 3 "," carrier ":" Android ", #" imsi ":" 310260000000000 " "user-agent": "Mozilla/5.0 (Linux Android 5.1.1; SM-G955N Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/39.0.0.0 Mobile Safari/537.36, "lon": "105.566938", "lat": "29.99831", "cid": "512000", "Content-Type": "application/x-www-form-urlencoded" Charset=utf-8 "," Accept-Encoding ":" gzip, deflate "," Connection ":" Keep-Alive ", #" Cookie ":" duid=58349118 "," Host ":" api.douguo.net ", #" Content-Length ":" 65 "} response = requests.post (url=url,headers=header Data=data) return responsedef handle_index (): url = "http://api.douguo.net/recipe/flatcatalogs" # client=4&_session=1547000257341354730010002552&v=1503650468&_vs=0 data= {" client ":" 4 "," _ session ":" 1547000257341354730010002552 "," v ":" 1503650468 "," _ vs ":" 0 "} response = handle_request (url) Data) # print (response.text) index_response_dic = json.loads (response.text) for item_index in index_response_dic ["result"] ["cs"]: # print (item_index) for item_index_cs in item_index ["cs"]: # print (item_index_cs) for item in item_index_cs ["cs"]: # print (item) data_2 = {"client": "4" # "_ session": "1547000257341354730010002552", "keyword": item ["name"], "_ vs": "400" "order": "0"} # print (data_2) queue_list.put (data_2) def handle_caipu_list (data): print ("current ingredients:", data ["keyword"]) caipu_list_url = "http://api.douguo.net/recipe/s/0/20";" Caipu_response = handle_request (caipu_list_url Data) caipu_response_dict = json.loads (caipu_response.text) for caipu_item in caipu_response_dict ["result"] ["list"]: caipu_info = {} caipu_info ["shicai"] = data ["keyword"] if caipu_item ["type"] = 13: caipu_info ["user_name"] = caipu_item ["r"] ["an"] Caipu_info ["shicai_id"] = caipu_item ["r"] ["id"] caipu_info ["describe"] = caipu_item ["r"] ["cookstory"] .replace ("\ n") "). Replace (" "") caipu_info ["caipu_name"] = caipu_item ["r"] ["n"] caipu_info ["zuoliao_list"] = caipu_item ["r"] ["major"] # print (caipu_info) detail_url = "http://api.douguo.net/recipe/detail/"+ str (caipu_info [" shicai_id "]) Detail_data = {"client": "4" "_ session": "1547000257341354730010002552", "author_id": "0", "_ vs": "2803", "ext":'{"query": {"kw": "'+ data [" keyword "] +'", "src": "2803", "idx": "1", "type": "13" "id":'+ str (caipu_info ["shicai_id"]) +'}}'} detail_reponse = handle_request (detail_url Detail_data) detail_reponse_dic = json.loads (detail_reponse.text) caipu_info ["tips"] = detail_reponse_dic ["result"] ["recipe"] ["tips"] caipu_info ["cookstep"] = detail_reponse_dic ["result"] ["recipe"] ["cookstep"] # print (json.dumps (caipu_info)) Mongo_info.insert_item (caipu_info) else: continuehandle_index () handle_caipu_list (queue_list.get ())

Grab python3 through python multithreading-thread pool through concurrent.futures import ThreadPoolExecutor

Reference thread pool

#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2019-1-9 11 Author @ Author: lm# @ Url: idig8.com# @ Site: # @ File: spider_douguomeishi.py# @ Software: PyCharmimport jsonimport requestsfrom multiprocessing import Queuefrom handle_mongo import mongo_infofrom concurrent.futures import ThreadPoolExecutor# create queue queue_list = Queue () def handle_request (url Data): header = {"client": "4", "version": "6916.2", "device": "SM-G955N", "sdk": "22pr 5.1.1", "imei": "354730010002552", "channel": "zhuzhan", "mac": "00:FF:E2:A2:7B:58", "resolution": "1440mm 900" "dpi": "2.0"," android-id ":" bcdaf527105cc26f "," pseudo-id ":" 354730010002552 "," brand ":" samsung "," scale ":" 2.0", "timezone": "28800", "language": "zh", "cns": "3", "carrier": "Android", # "imsi": "310260000000000" "user-agent": "Mozilla/5.0 (Linux Android 5.1.1; SM-G955N Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/39.0.0.0 Mobile Safari/537.36, "lon": "105.566938", "lat": "29.99831", "cid": "512000", "Content-Type": "application/x-www-form-urlencoded" Charset=utf-8 "," Accept-Encoding ":" gzip, deflate "," Connection ":" Keep-Alive ", #" Cookie ":" duid=58349118 "," Host ":" api.douguo.net ", #" Content-Length ":" 65 "} response = requests.post (url=url,headers=header Data=data) return responsedef handle_index (): url = "http://api.douguo.net/recipe/flatcatalogs" # client=4&_session=1547000257341354730010002552&v=1503650468&_vs=0 data= {" client ":" 4 "," _ session ":" 1547000257341354730010002552 "," v ":" 1503650468 "," _ vs ":" 0 "} response = handle_request (url) Data) # print (response.text) index_response_dic = json.loads (response.text) for item_index in index_response_dic ["result"] ["cs"]: # print (item_index) for item_index_cs in item_index ["cs"]: # print (item_index_cs) for item in item_index_cs ["cs"]: # print (item) data_2 = {"client": "4" # "_ session": "1547000257341354730010002552", "keyword": item ["name"], "_ vs": "400" "order": "0"} # print (data_2) queue_list.put (data_2) def handle_caipu_list (data): print ("current ingredients:", data ["keyword"]) caipu_list_url = "http://api.douguo.net/recipe/s/0/20";" Caipu_response = handle_request (caipu_list_url Data) caipu_response_dict = json.loads (caipu_response.text) for caipu_item in caipu_response_dict ["result"] ["list"]: caipu_info = {} caipu_info ["shicai"] = data ["keyword"] if caipu_item ["type"] = 13: caipu_info ["user_name"] = caipu_item ["r"] ["an"] Caipu_info ["shicai_id"] = caipu_item ["r"] ["id"] caipu_info ["describe"] = caipu_item ["r"] ["cookstory"] .replace ("\ n") "). Replace (" "") caipu_info ["caipu_name"] = caipu_item ["r"] ["n"] caipu_info ["zuoliao_list"] = caipu_item ["r"] ["major"] # print (caipu_info) detail_url = "http://api.douguo.net/recipe/detail/"+ str (caipu_info [" shicai_id "]) Detail_data = {"client": "4" "_ session": "1547000257341354730010002552", "author_id": "0", "_ vs": "2803", "ext":'{"query": {"kw": "'+ data [" keyword "] +'", "src": "2803", "idx": "1", "type": "13" "id":'+ str (caipu_info ["shicai_id"]) +'}}'} detail_reponse = handle_request (detail_url Detail_data) detail_reponse_dic = json.loads (detail_reponse.text) caipu_info ["tips"] = detail_reponse_dic ["result"] ["recipe"] ["tips"] caipu_info ["cookstep"] = detail_reponse_dic ["result"] ["recipe"] ["cookstep"] # print (json.dumps (caipu_info)) Mongo_info.insert_item (caipu_info) else: continuehandle_index () pool = ThreadPoolExecutor (max_workers=20) while queue_list.qsize () > 0: pool.submit (handle_caipu_list Queue_list.get ()

Hide the crawler by using the proxy ip

When the app operation and maintenance staff found that we had been requesting their server, they probably blocked our ip by proxy ip. Hide yourself.

Registration Application abuyun.com

1 yuan an hour. I applied for an hour. Let's use it together.

#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2019-1-11 2V @ Author: Aries# @ Site: # @ File: handle_proxy.py# @ Software: ipimport requestsurl represented by PyCharm#60.17.177.187 = 'http://ip.do.cn/ip'proxy = {' http':' http://H79623F667Q3936C:84F1527F3EE09817@http-cla.abuyun Com: 9030'} response = requests.get (url=url Proxies=proxy) print (response.text)

#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2019-1-9 11 Author @ Author: lm# @ Url: idig8.com# @ Site: # @ File: spider_douguomeishi.py# @ Software: PyCharmimport jsonimport requestsfrom multiprocessing import Queuefrom handle_mongo import mongo_infofrom concurrent.futures import ThreadPoolExecutor# create queue queue_list = Queue () def handle_request (url Data): header = {"client": "4", "version": "6916.2", "device": "SM-G955N", "sdk": "22pr 5.1.1", "imei": "354730010002552", "channel": "zhuzhan", "mac": "00:FF:E2:A2:7B:58", "resolution": "1440mm 900" "dpi": "2.0"," android-id ":" bcdaf527105cc26f "," pseudo-id ":" 354730010002552 "," brand ":" samsung "," scale ":" 2.0", "timezone": "28800", "language": "zh", "cns": "3", "carrier": "Android", # "imsi": "310260000000000" "user-agent": "Mozilla/5.0 (Linux Android 5.1.1; SM-G955N Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/39.0.0.0 Mobile Safari/537.36, "lon": "105.566938", "lat": "29.99831", "cid": "512000", "Content-Type": "application/x-www-form-urlencoded" Charset=utf-8 "," Accept-Encoding ":" gzip, deflate "," Connection ":" Keep-Alive ", #" Cookie ":" duid=58349118 "," Host ":" api.douguo.net ", #" Content-Length ":" 65 "} proxy = {'http':' http://H79623F667Q3936C:84F1527F3EE09817@http-cla.abuyun.com:9030'} response = requests.post (url=url,headers=header,data=data) Proxies=proxy) return responsedef handle_index (): url = "http://api.douguo.net/recipe/flatcatalogs" # client=4&_session=1547000257341354730010002552&v=1503650468&_vs=0 data = {" client ":" 4 "," _ session ":" 1547000257341354730010002552 "," v ":" 1503650468 "," _ vs ":" 0 "} response = handle_request (url) Data) # print (response.text) index_response_dic = json.loads (response.text) for item_index in index_response_dic ["result"] ["cs"]: # print (item_index) for item_index_cs in item_index ["cs"]: # print (item_index_cs) for item in item_index_cs ["cs"]: # print (item) data_2 = {"client": "4" # "_ session": "1547000257341354730010002552", "keyword": item ["name"], "_ vs": "400" "order": "0"} # print (data_2) queue_list.put (data_2) def handle_caipu_list (data): print ("current ingredients:", data ["keyword"]) caipu_list_url = "http://api.douguo.net/recipe/s/0/20";" Caipu_response = handle_request (caipu_list_url Data) caipu_response_dict = json.loads (caipu_response.text) for caipu_item in caipu_response_dict ["result"] ["list"]: caipu_info = {} caipu_info ["shicai"] = data ["keyword"] if caipu_item ["type"] = 13: caipu_info ["user_name"] = caipu_item ["r"] ["an"] Caipu_info ["shicai_id"] = caipu_item ["r"] ["id"] caipu_info ["describe"] = caipu_item ["r"] ["cookstory"] .replace ("\ n") "). Replace (" "") caipu_info ["caipu_name"] = caipu_item ["r"] ["n"] caipu_info ["zuoliao_list"] = caipu_item ["r"] ["major"] # print (caipu_info) detail_url = "http://api.douguo.net/recipe/detail/"+ str (caipu_info [" shicai_id "]) Detail_data = {"client": "4" "_ session": "1547000257341354730010002552", "author_id": "0", "_ vs": "2803", "ext":'{"query": {"kw": "'+ data [" keyword "] +'", "src": "2803", "idx": "1", "type": "13" "id":'+ str (caipu_info ["shicai_id"]) +'}}'} detail_reponse = handle_request (detail_url Detail_data) detail_reponse_dic = json.loads (detail_reponse.text) caipu_info ["tips"] = detail_reponse_dic ["result"] ["recipe"] ["tips"] caipu_info ["cookstep"] = detail_reponse_dic ["result"] ["recipe"] ["cookstep"] # print (json.dumps (caipu_info)) Mongo_info.insert_item (caipu_info) else: continuehandle_index () pool = ThreadPoolExecutor (max_workers=2) while queue_list.qsize () > 0: pool.submit (handle_caipu_list Queue_list.get ()

PS: this is an introduction to app data crawling. First of all, through the agent service of the simulator, go to the local computer (install fiddler), so that fiddler can grab the data. To analyze the data, you have to find the corresponding url based on your own experience. If you can analyze the url, the basic crawler will write half of it. Encapsulate the request header. Obtained through fiddler. There is a lot of header content in it, and it is also an anti-crawler strategy to try to delete it. Some data are easily found to be crawlers, such as cookies, etc., but some crawlers need cookies to crawl data. Set the proxy ip through the proxy way to prevent the same ip in the crawling process, and keep requesting an interface to be found to be a crawler. Queues are introduced to facilitate extraction when using thread pools. And put it in the mongodb. This is done using multithreaded app data.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report