What is the method of Python distributed crawler 07/13 Update SLTechnology News&Howtos

What is the method of Python distributed crawler

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what is the method of Python distributed crawler". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is the method of Python distributed crawler".

Environment, Architecture:

Development language: Python2.7

Development environment: 64-bit Windows8 system, 4G memory, i7-3612QM processor.

Database: MongoDB 3.2.0, Redis 3.0.501

(Python Editor: Pycharm;MongoDB Management tool: MongoBooster;Redis Management tool: RedisStudio)

The crawler framework uses Scrapy and uses scrapy_redis and Redis for distributed implementation.

In the distributed system, one machine acts as Master, installs Redis to schedule tasks, and the rest acts as Slaver. Just crawl tasks from Master. The principle is: when Slaver is running, when scrapy encounters Request, it is not given to spider to climb, but to the Redis database on Master. the Request that spider wants to climb is also taken from Redis, while Redis receives Request and then restores it in the database, which Slaver should Request and then give it, thus achieving task coordination.

Instructions for use:

Python needs to have Scrapy, pymongo, json, base64 and requests installed.

Master machines only need to install Redis (larger memory requirements), while Slaver machines need to install python environment and MongoDB to store data. If you want to store all the data on a machine, you can directly change the IP of MongoDB in the crawler (pipeline), or it is recommended to build a MongoDB cluster. Both Redis and MongoDB are installed and no configuration is required.

Add the Weibo account and password you used to log in to the cookies.py file, and there are already two accounts as format references.

You can modify the settings of setting in scrapy, such as interval time, log level, IP of redis, and so on.

After the above configuration, you can run Begin.py. To reiterate that the Master machine does not need to run programs, its function is to use Redis for task scheduling. Python machine runs crawler, add a Slaver machine, just set up the python environment and Slaver, then copy the code and run it directly.

Project source code

# encoding=utf-8import jsonimport base64import requests

Enter your Weibo account and password, you can go to Taobao to buy seven. It is recommended to buy dozens, Weibo anti-scraping is fierce, too often there will be 302 transfers. Or you can adjust the interval a little bit. "" MyWeiBo = [{'no':' jiadieyuso3319@163.com', 'psw':' a123456'}, {'no':' shudieful3618@163.com', 'psw':' a123456'},]

Def getCookies (weibo): "" get Cookies "" cookies = [] loginURL = r 'https://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.15)' for elem in weibo: account = elem [' no'] password = elem ['psw'] username = base64.b64encode (account.encode (' utf-8')) .decode ('utf-8') " PostData = {"entry": "sso" "gateway": "1", "from": "null", "savestate": "30", "useticket": "0", "pagerefer": "," vsnf ":" 1 "," su ": username," service ":" sso "," sp ": password "sr": "1440,900", "encoding": "UTF-8", "cdult": "3", "domain": "sina.com.cn", "prelt": "0", "returntype": "TEXT",} session = requests.Session () r = session.post (loginURL) Data=postData) jsonStr = r.content.decode ('gbk') info = json.loads (jsonStr) if info ["retcode"] = = "0": print "Get Cookie Success! (Account:%s)"% account cookie = session.cookies.get_dict () cookies.append (cookie) else: print "Failed! (Reason:%s)"% info [' reason'] return cookies

Cookies = getCookies (myWeiBo) print "Get Cookies Finish! (Num:%d)"% len (cookies) Thank you for reading, this is the content of "what is the method of Python distributed crawler?" after the study of this article, I believe you have a deeper understanding of what the method of Python distributed crawler is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.