In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article will explain in detail how to interpret the basic knowledge of HTTP in the crawler. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.
What is HTTP?
The authoritative answer quoted from Baidu Encyclopedia:
Hypertext transfer Protocol (HTTP,HyperText Transfer Protocol) is the most widely used network protocol on the Internet. All WWW files must comply with this standard. HTTP was originally designed to provide a way to publish and receive HTML pages.
In 1960, American Ted Nelson conceived a method of processing text information by computer, and called it hypertext, which became the foundation of the development of the standard architecture of HTTP hypertext transfer protocol. Ted Nelson organizes and coordinates the collaborative research of the World wide Web Association (World Wide Web Consortium) and the Internet Engineering working Group (Internet Engineering Task Force), resulting in the release of a series of RFC, of which the famous RFC 2616 defines HTTP 1.1.
The HTTP protocol is a transport protocol used to transfer hypertext from a WWW server to a local browser. It can make browsers more efficient and reduce network transmission. It not only ensures that the computer transmits hypertext documents correctly and quickly, but also determines which part of the transferred document and which part of the content is displayed first (for example, text precedes graphics) and so on.
HTTP adopts the browser / server request / response model. The browser is always the initiator of the HTTP request and the server is the responder.
In this way, if the browser client does not initiate a request, the server cannot actively push the message to the client.
HTTP is an application layer protocol, which is the most intuitive request we want to get information from the server side. For example, used in the crawler, and so on are encapsulated in the HTTP protocol, as a HTTP client to achieve the download of blog posts, pictures, videos and other information sources.
But HTTP is not directly available, and its requests are based on some underlying protocols. For example, in the TCP/IP protocol stack, HTTP needs TCP's three-way handshake connection to initiate a request to the server. Of course, if it is HTTPS, you also need TSL and SSL security layers.
A complete HTTP request process
Since the HTTP protocol needs to be built on other underlying protocols, let's take a look at what a complete HTTP request looks like.
When we click on a link or enter a link, the whole HTTP request process begins, and then go through the following steps to get the * * information. Here we briefly introduce the first four steps to understand HTTP.
Domain name resolution
First, various local DNS caches are searched, and if not, domain name resolution is initiated to the DNS server (Internet provider) to obtain the IP address.
Establish a TCP connection
When the IP is obtained, the socket socket connection, which is the 3-way handshake connection for TCP, is created, and the default port number is 80.
HTTP request
Once the TCP connection is successful, the browser / crawler can send a HTTP request message to the server, which includes the request line, the request header and the request body.
Server response
The server responds and returns a HTTP response packet (status code 200 if successful) and the requested HTML code.
The above steps and can be simply illustrated as follows, more convenient for everyone to understand. Both the request and the response contain information in a specific format, which we will continue to interpret next.
The response status code is returned in response to the HTTP request, and the status of the returned information can be known according to the status code. The status code is as follows:
1xx: an information response class that indicates that a request has been received and processing continues
100buy-requests must continue to be made
101Mel-requires the server to convert the HTTP protocol version upon request
2xx: successfully handles the response class, indicating that the action is successfully received, understood, and accepted
200Mutual-deal succeeded
201Mui-prompt to know the URL of the new file
202 color-accept and process, but processing is not completed
203mer-the return information is uncertain or incomplete
204music-request received, but the return message is empty
205Mel-the server has completed the request, and the user agent must reset the currently browsed file
206Mel-the server has completed the GET requests of some users
3xx: redirects the response class, which must be further processed in order to complete the specified action
300mura-requested resources are available in multiple places
301mer-Delete request data
302mi-request data was found at another address
303mer-customers are advised to visit other URL or access methods
304Mel-the client has executed GET, but the file has not changed
305music-the requested resource must be obtained from the address specified by the server
306Mel-the code used in the previous version of HTTP, no longer used in the current version
307 music-declare that the requested resource is temporarily deleted
4xx: client error, customer request contains syntax error or cannot be executed correctly
400m-incorrect request, such as syntax error
401muri-unauthorized
402 color-keep valid Chargeto header response
403muri-No access
404Mel-No files, queries, or URl found
405Mel-the method defined in the Request-Line field is not allowed
406Mel-the requested resource is not accessible based on the Accept sent
407Mel-the user must first be authorized on the proxy server
408Mel-the client did not complete the request within the specified time
409 color-the request could not be completed for the current resource status
410mi-the server no longer has this resource and has no further address
411Mel-the server rejects a user-defined Content-Length
412Mel-one or more request header fields are incorrect in the current request
413Mel-the requested resource is larger than the size allowed by the server
414Mel-the requested resource URL is longer than the length allowed by the server
415Mel-request resource does not support request project format
416 range-the request contains the Range request header field, there is no range indication value within the scope of the current request resource, and the request does not contain the If-Range request header field
417Mel-the server does not meet the expected value specified in the request Expect header field. If it is a proxy server, it may be that the next-level server cannot meet the request length.
5xx: server side error, the server cannot correctly execute a correct request
500mura-internal server error
501mura-not implemented
502mura-Gateway error
HTTP request message
I believe you have a general understanding of the HTTP request process, let's introduce the message information of the HTTP request in detail.
The content of the message includes the request line, the request header and the request body.
Let's take a look at the contents of the HTTP request message intercepted through the developer tool request URL, and compare the standard format above.
We found that the format of the request message is basically the same as above, which is exactly what we want. So, next we will introduce each of the above information one by one.
Request line
It is one of the request methods of HTTP. The HTTP/1.1 protocol defines eight methods to interact with the server, including GET, POST, HEAD, PUT, DELETE, OPTIONS, TRACE and CONNECT.
HEAD: get a response from the server except that the request body is the same as the GET request
GET: obtain query resource information through URL (crawler-specific URL crawl)
POST: submit form (simulated login in crawler)
PUT: uploading files (not supported by browsers)
DELETE: deletin
OPTIONS: returns the HTTP request method supported by the server for a specific resource
TRACE: returns requests received by the server for testing or diagnostics
CONNECT: a proxy service reserved for pipe connections
URL (here is /) and version 1.1 after the GET request method, don't forget the spaces.
Request header
The header domain of HTTP includes four parts: general header, request header, response header and entity header. Because we often submit headers request header information for camouflage during crawling, we will focus on the request header here.
The request header is unique to the request message, it submits some additional information to the server, for example, through the Accept field information, our client can tell the server what type of data we accept. And we can actually treat these field information as if.
Let's take a look at what these fields represent.
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Meaning: tell the browser that we accept the type of MIME
Accept-Encoding:gzip, deflate, br
Meaning: if there is this field, it means that the client supports compression to encode the content, and arbitrary encoding is supported when it is removed.
Note: generally do not add it to the crawler, the blogger does not know how to copy it all at first, as a result, it is not easy to make it stuck for a long time.
Accept-Lanague:zh-CN,zh;q=0.9
Meaning: tell the server what language it can accept, and if it doesn't, it represents any language.
Connection:keep-alive
Meaning: tell the server that a persistent and valid connection status is required (HTTP1.1 makes persistent connections by default)
Host:www.baidu.com
Meaning: the client specifies the domain name / IP address and port number of the web server it wants to access
Cache-control:max-age=0
Meaning: (quoted from Baidu encyclopedia)
Cache-Control is the most important rule. This field is used to specify the instructions that all caching mechanisms must follow throughout the request / response chain. These directives specify behaviors that prevent the cache from adversely interfering with the request or response. These instructions usually override the default caching algorithm. The cache instruction is one-way, that is, the existence of an instruction in the request does not mean that the same instruction will exist in the response.
The cache of web pages is controlled by the "Cache-control" in the HTTP message header. The common values are private, no-cache, max-age, must-revalidate and so on. The default is private.
But the Cache-Control of the HTTP request and response is not exactly the same.
Common request Cache-Control values are,.
The Cache-Control values for the response are,.
Here we mainly introduce the common Cache-Control values when requesting.
Max-age0
Indicates that the cache will be extracted directly from the browser.
No-cache
Indicates that the request will not be extracted from the browser cache, but will be forced to send a request to the server, which ensures that the client can receive the most authoritative response.
No-store
Nothing is cached in the cache or in temporary Internet files.
Upgrade-Insecure-Requests:1
Meaning: indicates that the browser / crawler can handle the HTTPS protocol and automatically upgrade requests from HTTP to HTTPS.
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64).. Safari/537.36
Meaning: (this is the most commonly used crawler) is used to request a web page disguised as a browser identity. It naturally means to indicate the identity of the browser, indicating which browser is used for the operation.
Cookies:
Meaning: (this is also very important in crawlers and is usually used for simulated login)
Cookies is used to maintain the session state of the server, written by the server, and then read by the server in subsequent requests.
This is all the field information content that appears in this example. Of course, there are some other commonly used field information, which are also explained here.
Other request header field information
Referer:
Meaning: (this is also commonly used by crawlers, hotlink protection)
The client uses the page represented by the current URL to visit the page we requested. In crawlers, generally all we have to do is set it to the requested web link.
Accept-Charset:
Meaning: (this is also commonly used by crawlers)
Represents the character set acceptable to the browser, which can be utf-8,gbk, etc.
If-Modified-Since:Thu, 10 Apr 2008 09:14:42 GMT
Meaning: once the content of the request is modified after the specified date, the object content is returned, otherwise "Not Modified" is returned.
Pragma:
Meaning:
The Pragma header field is used to contain implementation-specific instructions, the most commonly used being Pragma:no-cache. In the HTTP/1.1 protocol, it has the same meaning as Cache-Control:no-cache.
Range:
Meaning: tell the browser which part of the object you want to take. For example, Range: bytes=1173546
This is the end of the basic knowledge about how to interpret the HTTP in the crawler. I hope the above content can be helpful to everyone and learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.