How to interpret the basic knowledge of HTTP in reptiles 07/12 Update SLTechnology News&Howtos

How to interpret the basic knowledge of HTTP in reptiles

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to interpret the basic knowledge of HTTP in the crawler. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

What is HTTP?

The authoritative answer quoted from Baidu Encyclopedia:

Hypertext transfer Protocol (HTTP,HyperText Transfer Protocol) is the most widely used network protocol on the Internet. All WWW files must comply with this standard. HTTP was originally designed to provide a way to publish and receive HTML pages.

In 1960, American Ted Nelson conceived a method of processing text information by computer, and called it hypertext, which became the foundation of the development of the standard architecture of HTTP hypertext transfer protocol. Ted Nelson organizes and coordinates the collaborative research of the World wide Web Association (World Wide Web Consortium) and the Internet Engineering working Group (Internet Engineering Task Force), resulting in the release of a series of RFC, of which the famous RFC 2616 defines HTTP 1.1.

The HTTP protocol is a transport protocol used to transfer hypertext from a WWW server to a local browser. It can make browsers more efficient and reduce network transmission. It not only ensures that the computer transmits hypertext documents correctly and quickly, but also determines which part of the transferred document and which part of the content is displayed first (for example, text precedes graphics) and so on.

HTTP adopts the browser / server request / response model. The browser is always the initiator of the HTTP request and the server is the responder.

In this way, if the browser client does not initiate a request, the server cannot actively push the message to the client.

HTTP is an application layer protocol, which is the most intuitive request we want to get information from the server side. For example, used in the crawler, and so on are encapsulated in the HTTP protocol, as a HTTP client to achieve the download of blog posts, pictures, videos and other information sources.

But HTTP is not directly available, and its requests are based on some underlying protocols. For example, in the TCP/IP protocol stack, HTTP needs TCP's three-way handshake connection to initiate a request to the server. Of course, if it is HTTPS, you also need TSL and SSL security layers.

A complete HTTP request process

Since the HTTP protocol needs to be built on other underlying protocols, let's take a look at what a complete HTTP request looks like.

When we click on a link or enter a link, the whole HTTP request process begins, and then go through the following steps to get the * * information. Here we briefly introduce the first four steps to understand HTTP.

Domain name resolution

First, various local DNS caches are searched, and if not, domain name resolution is initiated to the DNS server (Internet provider) to obtain the IP address.

Establish a TCP connection

When the IP is obtained, the socket socket connection, which is the 3-way handshake connection for TCP, is created, and the default port number is 80.

HTTP request

Once the TCP connection is successful, the browser / crawler can send a HTTP request message to the server, which includes the request line, the request header and the request body.

Server response

The server responds and returns a HTTP response packet (status code 200 if successful) and the requested HTML code.

The above steps and can be simply illustrated as follows, more convenient for everyone to understand. Both the request and the response contain information in a specific format, which we will continue to interpret next.

The response status code is returned in response to the HTTP request, and the status of the returned information can be known according to the status code. The status code is as follows:

1xx: an information response class that indicates that a request has been received and processing continues

100buy-requests must continue to be made

101Mel-requires the server to convert the HTTP protocol version upon request

2xx: successfully handles the response class, indicating that the action is successfully received, understood, and accepted

200Mutual-deal succeeded

201Mui-prompt to know the URL of the new file

202 color-accept and process, but processing is not completed

203mer-the return information is uncertain or incomplete

204music-request received, but the return message is empty

205Mel-the server has completed the request, and the user agent must reset the currently browsed file

206Mel-the server has completed the GET requests of some users

3xx: redirects the response class, which must be further processed in order to complete the specified action

300mura-requested resources are available in multiple places

301mer-Delete request data

302mi-request data was found at another address

303mer-customers are advised to visit other URL or access methods

304Mel-the client has executed GET, but the file has not changed

305music-the requested resource must be obtained from the address specified by the server

306Mel-the code used in the previous version of HTTP, no longer used in the current version

307 music-declare that the requested resource is temporarily deleted

4xx: client error, customer request contains syntax error or cannot be executed correctly

400m-incorrect request, such as syntax error

401muri-unauthorized

402 color-keep valid Chargeto header response

403muri-No access

404Mel-No files, queries, or URl found

405Mel-the method defined in the Request-Line field is not allowed

406Mel-the requested resource is not accessible based on the Accept sent

407Mel-the user must first be authorized on the proxy server

408Mel-the client did not complete the request within the specified time

409 color-the request could not be completed for the current resource status

410mi-the server no longer has this resource and has no further address

411Mel-the server rejects a user-defined Content-Length

412Mel-one or more request header fields are incorrect in the current request

413Mel-the requested resource is larger than the size allowed by the server

414Mel-the requested resource URL is longer than the length allowed by the server

415Mel-request resource does not support request project format

416 range-the request contains the Range request header field, there is no range indication value within the scope of the current request resource, and the request does not contain the If-Range request header field

417Mel-the server does not meet the expected value specified in the request Expect header field. If it is a proxy server, it may be that the next-level server cannot meet the request length.

5xx: server side error, the server cannot correctly execute a correct request

500mura-internal server error

501mura-not implemented

502mura-Gateway error

HTTP request message

I believe you have a general understanding of the HTTP request process, let's introduce the message information of the HTTP request in detail.

The content of the message includes the request line, the request header and the request body.

Let's take a look at the contents of the HTTP request message intercepted through the developer tool request URL, and compare the standard format above.

We found that the format of the request message is basically the same as above, which is exactly what we want. So, next we will introduce each of the above information one by one.

Request line

It is one of the request methods of HTTP. The HTTP/1.1 protocol defines eight methods to interact with the server, including GET, POST, HEAD, PUT, DELETE, OPTIONS, TRACE and CONNECT.

HEAD: get a response from the server except that the request body is the same as the GET request

GET: obtain query resource information through URL (crawler-specific URL crawl)

POST: submit form (simulated login in crawler)

PUT: uploading files (not supported by browsers)

DELETE: deletin

OPTIONS: returns the HTTP request method supported by the server for a specific resource

TRACE: returns requests received by the server for testing or diagnostics

CONNECT: a proxy service reserved for pipe connections

URL (here is /) and version 1.1 after the GET request method, don't forget the spaces.

Request header

The header domain of HTTP includes four parts: general header, request header, response header and entity header. Because we often submit headers request header information for camouflage during crawling, we will focus on the request header here.

The request header is unique to the request message, it submits some additional information to the server, for example, through the Accept field information, our client can tell the server what type of data we accept. And we can actually treat these field information as if.

Let's take a look at what these fields represent.

Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8

Meaning: tell the browser that we accept the type of MIME

Accept-Encoding:gzip, deflate, br

Meaning: if there is this field, it means that the client supports compression to encode the content, and arbitrary encoding is supported when it is removed.

Note: generally do not add it to the crawler, the blogger does not know how to copy it all at first, as a result, it is not easy to make it stuck for a long time.

Accept-Lanague:zh-CN,zh;q=0.9

Meaning: tell the server what language it can accept, and if it doesn't, it represents any language.

Connection:keep-alive

Meaning: tell the server that a persistent and valid connection status is required (HTTP1.1 makes persistent connections by default)

Host:www.baidu.com

Meaning: the client specifies the domain name / IP address and port number of the web server it wants to access

Cache-control:max-age=0

Meaning: (quoted from Baidu encyclopedia)

Cache-Control is the most important rule. This field is used to specify the instructions that all caching mechanisms must follow throughout the request / response chain. These directives specify behaviors that prevent the cache from adversely interfering with the request or response. These instructions usually override the default caching algorithm. The cache instruction is one-way, that is, the existence of an instruction in the request does not mean that the same instruction will exist in the response.

The cache of web pages is controlled by the "Cache-control" in the HTTP message header. The common values are private, no-cache, max-age, must-revalidate and so on. The default is private.

But the Cache-Control of the HTTP request and response is not exactly the same.

Common request Cache-Control values are,.

The Cache-Control values for the response are,.

Here we mainly introduce the common Cache-Control values when requesting.

Max-age0

Indicates that the cache will be extracted directly from the browser.

No-cache

Indicates that the request will not be extracted from the browser cache, but will be forced to send a request to the server, which ensures that the client can receive the most authoritative response.

No-store

Nothing is cached in the cache or in temporary Internet files.

Upgrade-Insecure-Requests:1

Meaning: indicates that the browser / crawler can handle the HTTPS protocol and automatically upgrade requests from HTTP to HTTPS.

User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64).. Safari/537.36

Meaning: (this is the most commonly used crawler) is used to request a web page disguised as a browser identity. It naturally means to indicate the identity of the browser, indicating which browser is used for the operation.

Cookies:

Meaning: (this is also very important in crawlers and is usually used for simulated login)

Cookies is used to maintain the session state of the server, written by the server, and then read by the server in subsequent requests.

This is all the field information content that appears in this example. Of course, there are some other commonly used field information, which are also explained here.

Other request header field information

Referer:

Meaning: (this is also commonly used by crawlers, hotlink protection)

The client uses the page represented by the current URL to visit the page we requested. In crawlers, generally all we have to do is set it to the requested web link.

Accept-Charset:

Meaning: (this is also commonly used by crawlers)

Represents the character set acceptable to the browser, which can be utf-8,gbk, etc.

If-Modified-Since:Thu, 10 Apr 2008 09:14:42 GMT

Meaning: once the content of the request is modified after the specified date, the object content is returned, otherwise "Not Modified" is returned.

Pragma:

Meaning:

The Pragma header field is used to contain implementation-specific instructions, the most commonly used being Pragma:no-cache. In the HTTP/1.1 protocol, it has the same meaning as Cache-Control:no-cache.

Range:

Meaning: tell the browser which part of the object you want to take. For example, Range: bytes=1173546

This is the end of the basic knowledge about how to interpret the HTTP in the crawler. I hope the above content can be helpful to everyone and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.