In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly introduces Python how to climb the semi-dimensional COS diagram, the article is very detailed, has a certain reference value, interested friends must read it!
Today, when I was browsing the website, suddenly an inexplicable link led me to jump to the semi-dimensional website https://bcy.net/ and found that there was no interesting content. Professional sensitivity made me instantly think of cosplay. This kind of website must have this existence, so I am ready for my big crawler.
After opening the above link, I found it, and I knew that my eighth feeling was good. The next step is to find the entrance. Be sure to find the entrance to the picture link before you can do the following
The page keeps dragging down, and the page will always load, and when you drag and drop for a while, it stops. This is the time.
Found the entrance, in my actual operation, actually found a lot of other entrances, this is not explained one by one, quickly get on the car, after entering the view more, found that the page is still a drop-down refresh layout, the technical term waterfall flow.
Semi-dimensional COS graph crawling-the first step of python crawler
After opening the developer tool and switching to network, I found a lot of xhr requests. Finding this means that the site is easy to crawl.
Extract the links to be crawled and analyze the rules
Https://bcy.net/circle/timeline/loadtag?since=0&grid_type=timeline&tag_id=1482&sort=hothttps://bcy.net/circle/timeline/loadtag?since=26499.779&grid_type=timeline&tag_id=1482&sort=hothttps://bcy.net/circle/timeline/loadtag?since=26497.945&grid_type=timeline&tag_id=1482&sort=hot
It is found that only one parameter is changing, and the change does not seem to have any rules to look for. Nothing, if you look at the data, you can find out the secret of it.
The principle of this website is very simple, that is, by constantly getting the since of the last piece of each data and then getting the following data, then we can implement the code according to its law, not multithreading, this law has no way to practice.
I store the data in mongodb this time, because I can't get it all at once, so I may need to use it again next time.
If _ _ name__ = ='_ main__': # some basic operations of mongodb DATABASE_IP = '127.0.0.1' DATABASE_PORT = 27017 DATABASE_NAME = 'sun' start_url = "https://bcy.net/circle/timeline/loadtag?since={}&grid_type=timeline&tag_id=399&sort=recent" client = MongoClient (DATABASE_IP, DATABASE_PORT) db = client.sun db.authenticate (" dba ") "dba") collection = db.bcy # ready to insert data # # 3333 get_data (start_url,collection) Python Resource sharing qun 784758214 with installation package PDF, learning video, this is the gathering place for Python learners, zero basic, advanced, welcome
The place of obtaining web data becomes very simple from our previous experience.
# # semi-dimensional COS graph crawling-data acquisition function def get_data (start_url,collection): since = 0 while 1: try: with requests.Session () as s: response = s.get (start_url.format (str (since)), headers=headers Timeout=3) res_data = response.json () if res_data ["status"] = = 1: data = res_data ["data"] # get Data array time.sleep (0.5) # # data processing since = data [- 1] ["since"] # since ret = json_handle (data) # code in the last json data of 20 pieces of data is implemented in the following try: print (ret) collection.insert_many (ret) # batch entry and exit of database print ("the above data is inserted successfully!") Except Exception as e: print ("insert failed") print (ret) # # except Exception as e: print ("!", end= "exception, Please note") print (eMagne end= "") else: print ("cycle completed")
Web page parsing code
# processing JSON data def json_handle (data): # extracting key data list_infos = [] for item in data: item = item ["item_detail"] try: avatar = item ["avatar"] # user profile item_id = item ["item_id"] # Image details page like_count = item ["like_count"] # number of likes pic_num = item ["pic_num"] if "pic_num" in item else 0 # Total number of pictures reply_count = item ["reply_count"] share_count = item ["share_count"] uid = item ["uid"] plain = item ["plain"] uname = Item ["uname"] list_infos.append ({"avatar": avatar "item_id": item_id, "like_count": like_count, "pic_num": pic_num, "reply_count": reply_count, "share_count": share_count "uid": uid, "plain": plain, "uname": uname}) except Exception as e: print (e) continue return list_infos
It's done by now, and the code is running.
The above is all the contents of the article "how Python crawls semi-dimensional COS diagrams". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.