Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use AJAX and HTTP in pyspider crawler

2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to use AJAX and HTTP in pyspider crawler". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to use AJAX and HTTP in pyspider crawler".

AJAX

AJAX is an abbreviation for Asynchronous JavaScript and XML (asynchronous JavaScript and XML). AJAX uses the original web standard components to communicate with the server without reloading the entire page. In Sina Weibo, for example, you can open a Weibo comment without reloading or opening a new page. But the content is not in the page at the beginning (so the page is too big), but is loaded when you click. As a result, when you crawl the page, you can't get these comments (because you don't "expand").

A common use of AJAX is to use AJAX to load JSON data and then render it on the browser side. If you can grab JSON data directly, it will be easier to parse than HTML.

When a website uses AJAX, except for the difference between the page grabbed with pyspider and what the browser sees. When you open such a page in a browser, or click "expand", you will often see "loading" or similar icons / animations. For example, when you try to grab: http://movie.douban.com/explore

You will find that the movie is "loading... "

Find the real request.

Because AJAX actually transmits data through HTTP, we can find the real request through Chrome Developer Tools and directly initiate the fetching of the real request to get the data.

Open a new window

Press Ctrl+Shift+I (press Cmd+Opt+I on Mac) to open the developer tools.

Switch to the network (Netwotk panel)

Open http://movie.douban.com/explore in a window

During the page load, you will see all resource requests in the panel.

AJAX generally sends requests through the XMLHttpRequest object interface, and XMLHttpRequest is generally abbreviated to XHR. Click the funnel-shaped filter button on the network panel to filter out the XHR request. Look at each request one by one, and find the request containing information through the access path and preview: http://movie.douban.com/j/searchX61Xsubjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=0

In the case of Douban, there are not many XHR requests, which can be confirmed one by one. However, when there are many XHR requests, it may be necessary to combine the time of the trigger action, the path of the request and other information to help find the key requests that contain information in a large number of requests. This requires relevant experience in crawling or front-end. So, there is a point that I have been talking about, and the way to learn to crawl is to learn to write websites.

You can now open http://movie.douban.com/j/searchX67Xsubjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=0 in a new window and you will see the raw JSON data containing the movie data. It is recommended to install the JSONView (Firfox version) plug-in so that you can see a better-looking JSON format, expand folding columns and other functions. Then, based on the JSON data, we write a script to extract the movie name and rating:

Class Handler (BaseHandler): def on_start (self): self.crawl ('http://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=0', callback=self.json_parser) def json_parser (self, response): return [{"title": X [' title'] "rate": X ['rate'], "url": X [' url']} for x in response.json ['subjects']]

You can use response.json to convert the result to a dict object of python

You can get the complete code in http://demo.pyspider.org/debug/tutorial_douban_explore and debug it. There is also an extracted version of the script that uses PhantomJS rendering, which will be described in the next tutorial.

HTTP

HTTP is a protocol used to transmit web content. In the previous tutorial, we have submitted the URL for crawling through the self.crawl interface. These fetches are transmitted over the HTTP protocol.

In the process of fetching, you may encounter situations like 403 Forbidden, or need to log in, at which point you need the correct HTTP parameters to crawl.

A typical HTTP request packet, which is sent to http://example.com/, is as follows:

GET / HTTP/1.1 Host: example.com Connection: keep-alive Cache-Control: max-age=0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 User-Agent: Mozilla/5.0 (Macintosh Intel Mac OS X 10: 10: 1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.45 Safari/537.36 Referer: http://en.wikipedia.org/wiki/Example.com Accept-Encoding: gzip, deflate, sdch Accept-Language: zh-CN,zh;q=0.8 If-None-Match: "359670651" If-Modified-Since: Fri, 09 Aug 2013 23:54:35 GMT

The requested * line contains version information for method, path, and HTTP protocols

The remaining lines are called header and are rendered in the form of key: value

If it is a POST request, there may be body content at the end of the request

You can see this information through the Chrome Developer Tools tool you used earlier:

Most of the time, you can always grab the information you need with the right method, path, headers and body.

HTTP Method

HTTP Method tells the server what to expect with the URL resource. For example, GET is used when opening a URL, while POST is generally used when submitting data.

TODO: need example here

HTTP Headers

HTTP Headers is a parameter list that comes with the request, and you can find a complete list of commonly used Headers here. Some common things to pay attention to are:

User-Agent

A UA is a string that identifies the browser or crawler you are using. The default UA used by pyspider is pyspider/VERSION (+ http://pyspider.org/)). Websites often use this string to distinguish between a user's operating system and browser, and to determine whether the other person is a crawler. So when fetching, UA is often camouflaged.

In pyspider, you can specify script-level UA through self.crawl (URL, headers= {'User-Agent':' pyspider'}), or crawl_config = {'headers': {' User-Agent': 'xxxx'}}. Please check the API documentation for details.

Referer

Referer is used to tell the server what the last web page you visited was. It is often used for hotlink protection and may be used when grabbing pictures.

X-Requested-With

The Header that is used to send an AJAX request using XHR is often used to determine whether it is an AJAX request. For example, in the Beiyou Forum, you need to:

Def on_start (self): self.crawl ('http://bbs.byr.cn/board/Python', headers= {' Xmuri RequestedMushroom: 'XMLHttpRequest'}, callback=self.index_page)

The content can only be crawled with headers= {'Xcopyright RequestedMutual ownership:' XMLHttpRequest'}.

HTTP Cookie

Although Cookie is only one of the HTTP Header, because it is very important, take it out and talk about it. Cookie is requested by HTTP to identify and track the identity of users. When you log in to a website, you record the login status by writing the Cookie field.

When you encounter a website that needs to log in, you need to request the content that needs to be logged in by setting the Cookie parameter. Cookie can be obtained from the request panel of the developer's tool, or from the resources panel. In pyspider, you can also use response.cookies to get the returned cookie and use self.crawl (URL, cookie= {'key':' value'}) to set the Cookie parameters of the request.

Thank you for your reading, the above is the content of "how to use AJAX and HTTP in pyspider crawler". After the study of this article, I believe you have a deeper understanding of how to use AJAX and HTTP in pyspider crawler. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report