What are the three data formats you need to know in the HTTP protocol? 07/15 Update SLTechnology News&Howtos

What are the three data formats you need to know in the HTTP protocol?

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

Today, I will talk to you about the three data formats that need to be known in the HTTP agreement, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

One of the main tasks in the internship is to analyze the protocols in HTTP. I have also written regular expressions in Python to match the contents of the HTTP request and response, and then extract the key fields into a dictionary for use (which can be slightly modified as a crawler tool).

I have encountered many pitfalls in the HTTP protocol, so I will make a summary of several common HTTP data formats I have encountered.

Zlib compressed data

Zlib is no stranger to us. We usually use it to compress files. The common types are zip, rar and 7z. Zlib is a popular file compression algorithm, which is widely used, especially in Linux platform. When you apply Zlib compression to a plain text file, the effect is obvious, reducing the file size by more than 70%, depending on the contents of the file.

Zlib is also suitable for Web data transmission, for example, using the Gzip (a compression algorithm mentioned later) module in Apache, we can use the Gzip compression algorithm to compress the web content published by the Apache server and then transfer it to the client browser. After compression, it actually reduces the number of bytes transmitted by the network, and the most obvious advantage is that it can speed up the loading speed of web pages.

The benefits of faster web loading are self-evident, saving traffic and improving the user's browsing experience. These benefits are not limited to static content, PHP dynamic pages and other dynamically generated content can be compressed by using Apache compression module, coupled with other performance adjustment mechanisms and corresponding server-side caching rules, which can greatly improve the performance of the website. Therefore, for PHP programs deployed on Linux servers, it is recommended that you turn on Gzip Web compression if the server supports it.

Two types of Gzip compression

Different compression algorithms can produce different compressed data (all for the purpose of reducing file size). At present, there are two popular compression formats on Web, which are Gzip and Defalte.

What is in Apache is the Gzip module. Deflate is a lossless data compression algorithm that uses both LZ77 algorithm and Huffman coding (Huffman Coding). The source code for Deflate compression and decompression can be found on the free, general-purpose compression library zlib.

Deflate with a higher compression ratio is implemented by 7-zip. AdvanceCOMP also uses this implementation, which can compress gzip, PNG, MNG, and ZIP files to get a smaller file size than zlib. A more efficient Deflate program that requires more user input is used in Ken Silverman's KZIP and PNGOUT.

Deflate uses inflateInit (), while gzip uses inflateInit2 () for initialization, with one more parameter than inflateInit ():-MAX_WBITS, which means processing raw deflate data. Because the zlib compressed data block in the gzip data does not have two bytes of zlib header. When using inflateInit2, the zlib library is required to ignore zlib header. The windowBits is required to be 8.. 15 in the zlib manual, but in fact other ranges of data have special functions, such as negative numbers for raw deflate.

In fact, to sum up, Deflate is a compression algorithm and an enhancement of huffman coding. The code extracted by deflate and gzip is almost the same and can be synthesized into a single piece of code.

For more information, please see Wikipedia zlib.

The process of data compression handled by Web server

After receiving the HTTP request from the browser, the Web server checks whether the browser supports HTTP compression (Accept-Encoding information).

If the browser supports HTTP compression, the Web server checks the suffix name of the request file

If the requested file is a static file such as HTML or CSS, the Web server goes to the compressed buffer directory to check whether the * compressed file of the requested file already exists.

If the compressed file of the requested file does not exist, the Web server returns the uncompressed requested file to the browser and stores the compressed file of the requested file in the compressed buffer directory

If the compressed file of the requested file already exists, the compressed file of the requested file is returned directly.

If the request file is a dynamic file, the Web server compresses the content dynamically and returns it to the browser, and the compressed content is not stored in the compressed cache directory.

Take a chestnut.

Having said that, let me give you an example. Open the package capture software and visit our school's official website (www.ecnu.edu.cn). The request header is as follows:

GET / _ css/tpl2/system.css HTTP/1.1 Host: www.ecnu.edu.cn Connection: keep-alive User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.59 Safari/537.36 Accept: text/css,*/*;q=0.1 Referer: http://www.ecnu.edu.cn/ Accept-Encoding: gzip, deflate Accept-Language: zh-CN,zh Qroom0.8 Cookie: a10-default-cookie-persist-20480-sg_bluecoat_a=AFFIHIMKFAAA

In the seventh line, Accept-Encoding shows gzip, deflate, which means that the browser tells the server to support both gzip and deflate data formats, and after receiving such a request, the server will perform gzip or deflate compression (usually return data in gzip format).

The urllib2 of Python can set this parameter:

Request = urllib2.Request (url) request.add_header ('Accept-encoding',' gzip') / / or set to deflate request.add_header ('Accept-encoding',' deflate') / / or both set request.add_header ('Accept-encoding',' gzip, deflate')

The response from the server is generally as follows:

HTTP/1.1 200 OK Date: Sat, 22 Oct 2016 11:41:19 GMT Content-Type: text/javascript;charset=utf-8 Transfer-Encoding: chunked Connection: close Vary: Accept-Encoding tracecode: 24798560510951725578102219 Server: Apache Content-Encoding: gzip 400a .ks # I. W. (slightly) / / the response body is compressed data

From the point of view of the response header, the paragraph Content-Encoding: gzip indicates that the compression method of the response body is gzip compression. Generally, there are several cases. A blank field indicates that there is no plaintext compression, and there are two kinds of compression: Content-Encoding: gzip and Content-Encoding: deflate.

In fact, there are far more Gzip sites than Deflate. It has been written before that a simple crawler starts from the home page of hao123 and climbs thousands of pages (basically covering all the commonly used pages), specializing in analyzing the compression types of responses. The result is:

Accept-Encoding does not set parameters: it returns an uncompressed response body (browsers are special because they automatically set Accept-Encoding: gzip: deflate to increase transmission speed)

Accept-Encoding: all gzip,100% sites will return gzip compression, but there is no guarantee that all Internet sites will support gzip (in case it is not enabled)

Accept-Encoding: deflate: less than 10% of websites return a deflate-compressed response, while others return an uncompressed response.

Accept-Encoding: gzip, deflate: the returned results are all in gzip format, indicating that gzip is more popular in terms of priority.

The Encoding field of the response header is very helpful, for example, let's write a regular expression to match what the response header is compressed:

(? > > import gzip > import StringIO > > fio = StringIO.StringIO (gzip_data) > f = gzip.GzipFile (fileobj=fio) > f.read () 'test' > f.close ()

Header detection can also be added automatically during decompression. Adding 32 to the header can trigger header detection, for example:

> zlib.decompress (gzip_data, zlib.MAX_WBITS | 32) 'test' > zlib.decompress (zlib_data, zlib.MAX_WBITS | 32)' test'

Refer to stackoverflow How can I decompress a gzip stream with zlib? above.

When you first come into contact with these things, you will report some strange mistakes every day, which can basically be solved by Google.

Block transmission coding chunked

Block transfer coding (Chunked transfer encoding) is a data transfer mechanism in Hypertext transfer Protocol (HTTP) that allows data sent by a web server to a client application (usually a web browser) that can be divided into multiple parts. Block transfer coding is only available in the HTTP protocol version 1.1 (HTTP/1.1).

Typically, the data sent in the HTTP reply message is sent as a whole, and the Content-Length header field indicates the length of the data. The length of the data is important because the client needs to know where the end of the reply message is and the beginning of the subsequent reply message. However, using block transmission coding, the data is decomposed into a series of data blocks and transmitted in one or more blocks so that the server can send the data without knowing in advance the total size of the content sent. Blocks are usually the same size, but this is not always the case.

Advantages of block transmission

The introduction of block transport coding in HTTP 1.1 provides the following benefits:

HTTP block transfer coding allows the server to maintain HTTP persistent links for dynamically generated content. Typically, persistent links require the server to send Content-Length header fields before starting to send the message body, but for dynamically generated content, it is unknowable until the content is created.

Block transfer coding allows the server to send header fields in *. It is important for situations where the value of the header field cannot be known until the content is generated, such as when the content of the message is signed with a hash, and the result of the hash is transmitted through the HTTP header field. When there is no block transmission coding, the server must buffer the content until the values of the header fields are calculated after completion and send the values of those header fields before sending the content.

HTTP servers sometimes use compression (gzip or deflate) to reduce the time it takes to transfer. Block transmission coding can be used to separate multiple parts of a compressed object. In this case, the block is not compressed separately, but the entire load is compressed, and the compressed output is transmitted in blocks using the scheme described in this article. In the case of compression, block coding is beneficial to send data while compressing, rather than completing the compression process to know the size of the compressed data.

Note: the above content comes from Wikipedia.

Format of block transmission

If the value of the Transfer-Encoding header of a HTTP message (request message or reply message) is chunked, the message body consists of an undetermined number of blocks and ends with a block of size 0. Each non-empty block starts with the number of bytes (in hexadecimal) that the block contains data, followed by a CRLF (carriage return and line feeds), then the data itself, and the * * block CRLF. In some implementations, there is a 0x20 between the block size and the CRLF.

* A piece is a single line, consisting of a block size (0), some optional blanks, and CRLF. * A block no longer contains any data, but you can send an optional trailer, including the header field.

The message * ends with CRLF. For example, here is a response body in chunked format.

HTTP/1.1 200 OK Date: Wed, 06 Jul 2016 06:59:55 GMT Server: Apache Accept-Ranges: bytes Transfer-Encoding: chunked Content-Type: text/html Content-Encoding: gzip Age: 35 X-Via: 1.1 daodianxinxiazai58:88 (Cdn Cache Server V2.0), 1.1 yzdx147:1 (Cdn Cache Server V2.0) Connection: keep-alive a.... k. | W. 166. OO.0. & ~. .].. (Franks V.A3.X.. copyright .l8.y.)., j. H. 6. S. MZ. >.. .) B.G.` "Dq.P]. F=0..Q..d.h.8....F..y.q.4 {F.. M.A. .n > .D.. o @ .`^.! @ $. P.% a\ D. K.D. {2. UnFJI C [.T.c.V."% .`U.? D....#..K..

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.