In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
What is the main point of using proxy IP in Scrapy framework? for this question, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible way.
The scrapy framework realizes the general functional interface of data acquisition through modular design, and provides custom extension. It frees programmers from the tedious repetitive work of flow programs, and provides programmers with flexible and simple basic construction. For ordinary web page data collection, programmers only need to devote their main energy to website data analysis and website anti-crawling strategy analysis, combined with the use of agent IP. It can realize the efficient and fast start of the project.
Key features include:
1) Parametric setting the number of concurrent requests, which is automatically executed asynchronously.
2) support xpath, concise and efficient
3) support custom middleware middleware
4) support collecting source list
5) support independent debugging to facilitate shell mode
6) data pipeline interface definition is supported, and users can choose a variety of ways such as text, database, etc.
There are several ways to use proxies in the Scrapy framework:
1.scrapy middleware
Create a new middlewares.py file (. / project name / middlewares.py) in the project as follows:
#!-*-encoding:utf-8-*-
Import base64
Import sys
Import random
PY3 = sys.version_info [0] > = 3
Def base64ify (bytes_or_str):
If PY3 and isinstance (bytes_or_str, str):
Input_bytes = bytes_or_str.encode ('utf8')
Else:
Input_bytes = bytes_or_str
Output_bytes = base64.urlsafe_b64encode (input_bytes)
If PY3:
Return output_bytes.decode ('ascii')
Else:
Return output_bytes
Class ProxyMiddleware (object):
Def process_request (self, request, spider):
# proxy server (product website www.16yun.cn)
ProxyHost = "t.16yun.cn"
ProxyPort = "31111"
# Agent verification information
ProxyUser = "username"
ProxyPass = "password"
Request.meta ['proxy'] = "http://{0}:{1}".format(proxyHost,proxyPort)
# add a verification header
Encoded_user_pass = base64ify (proxyUser + ":" + proxyPass)
Request.headers ['Proxy-Authorization'] =' Basic'+ encoded_user_pass
# set IP switchover head (according to demand)
Tunnel = random.randint (110000th)
Request.headers ['Proxy-Tunnel'] = str (tunnel)
Modify the project configuration file (. / project name / settings.py)
DOWNLOADER_MIDDLEWARES = {
'Project name .roomlewares.ProxyMiddleware': 100
}
2.scrapy environment variable
Use the crawler agent (Windows) by setting the environment variable
C:\ > set http_proxy= http://username:password@ip:port
This is the answer to what is the main point of using proxy IP in the Scrapy framework. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.