In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "the use of python pyspider". In the daily operation, I believe that many people have doubts about the use of python pyspider. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about the use of python pyspider. Next, please follow the editor to study!
Summary: after understanding the basic knowledge of the crawler, let's use the framework to write the crawler. Using the frame will make it easier for us to write the crawler. Next, let's take a look at the use of the pyspider framework. Knowing the framework, mom no longer has to worry about our study.
Preliminary preparation:
1. Install pyspider:pip3 install pyspider
2. Install Phantomjs: after downloading and decompressing it on the official website, drag pathtomjs.exe to Scripts under the path of installing python.
Download address: https://phantomjs.org/dowmload.html
Official API address: http://www.pyspider.cn/book/pyspider/self.crawl-16.html
2. Usage (here is only a brief introduction. For more information, please see the official documentation):
1. Start pyspider first
Enter pyspider all in the black window and you will see the following.
Remind us to open the http://localhost:5000 port. Be careful not to close it! )
Click the Create on the right to create the project after opening it. Project Name is the project name.
Start URL (s) is the address you want to crawl and click Create to create a project.
2. Understand the format and process of pyspider
After creation, we will see the following:
Handler method is the main class of pysqider, inheriting BaseHandler, all the functions, only a Handler can be solved.
Crawl_config = {} indicates the global configuration, for example, you can define a headers as you learned earlier.
The @ every (minutes=24*60) every attribute indicates the time interval between crawls. Minutes=24*60 means one day, that is, crawling once a day.
On_start method: the entry point for crawling, which sends the request by calling the crawl method
Callback=self.index_page:callback represents a callback function, which means handing over the result of the request to index_page for processing.
@ config (age=10 * 24 * 60 * 60) sets the validity period of the task to 10 days, that is, it will not be repeated for 10 days.
Index_page function: the function for processing. The parameter response represents the returned content of the request page.
The next content must be understood by everyone, through the pyquery parser (pyspider can use this parser directly) to get all the links. And ask for it in turn. Then call detail_page to process it and get what we want.
@ config (priority=2) priority indicates the priority of crawling. If it is not set, the default value is 0. The higher the number, you can call it first.
Other parameters of crawl:
Exetime: indicates that the task is executed in an hour.
Self.crawl ('http://maoyan.com/board/4', callback=self.detail_page,exetime=time.time () + 60' 60)
Retries: you can set the number of repeats. Default is 3.
Auto_recrawl: when set to True, when the task expires, that is, the time for age is up, execute it again.
Method: that is, the request method of HTTP is get by default
Params: add the request parameters of get in dictionary form
Self.crawl ('http://maoyan.com/board/4', callback=self.detail_page,params= {' axiazhuang '123'})
The form data of data:post is also in the form of a dictionary
Files: uploading files
Self.crawl ('http://maoyan.com/board/4', callback=self.detail_page,method='post',files= {filename:'xxx'})
User_agent: that is User-Agent
Headers: request header information
Cookies: dictionary form
Connect_timeout: the maximum waiting time for initialization. Default is 20 seconds.
Timeout: the longest waiting time for crawling a page
Proxy: the agent for crawling, in dictionary form
Fetch_type: set to js to see the javascript rendered page. You need to install Phantomjs.
Js_script: similarly, you can execute your own js scripts. Such as:
Self.crawl ('http://maoyan.com/board/4', callback=self.detail_page,js_script='''
Function () {
Alert ('123')
}
'')
Js_run_at: used with running the js script above, you can set the location where the script is run at the beginning or the end
Load_images: whether to load images when loading javascript. Default is No.
Save: pass parameters between different methods:
Def on_start (self):
Self.crawl ('http://maoyan.com/board/4', callback=self.index_page,save= {' axianghuanghuan1'})
Def index_page (self,response):
Return response.save ['a']
3. The saved project looks like this
Status indicates status:
TODO: indicates the status of the project that has just been created
STOP: indicates stop
CHECKING: the status of a running project that has been modified
DEBUG/RUNNING: it's all running. DEBUG can be said to be a test version.
PAUSE: multiple errors occurred.
How do I delete an item?
Change group to delete,staus and delete it automatically after STOP,24 hours
Actions:
Run means to run
Active tasks View request
Results View results
Tate/burst:
1 move 3 means to send a request in 1 second, 3 processes
Progress: indicates the progress of the crawl.
At this point, the study of "how to use python pyspider" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.