What are the knowledge points of URL coding? 07/19 Update SLTechnology News&Howtos

What are the knowledge points of URL coding?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains the "URL coding knowledge points", the article explains the content is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "what are the URL coding knowledge points"!

We all know that the transmission of parameters in Http protocol is in the form of "key=value". If you want to pass multiple parameters, you need to split the key-value pair with the "&" symbol.

Such as "? name1=value1&name2=value2", so that when the server receives such a string, it splits each parameter with "&" and then splits the parameter value with "=".

For "name1=value1&name2=value2", let's talk about the client-to-server conceptual parsing process:

The above string is expressed in ASCII on the computer as:

6E616D6531 3D 76616C756531 26 6E616D6532 3D 76616C756532 . 6E616D6531:name1 3DVS = 76616C756531:value1 26RU & 6E616D6532:name2 3DVS = 76616C756532:value2

After receiving the data, the server can traverse the byte stream, first eating byte by byte. When eating the byte 3D, the server knows that the byte before eating represents a key, and then think about it. If you encounter 26, it means that the value of the previous key from the 3D to the 26 sub-section just eaten, and so on, the parameters passed by the client can be parsed.

Now there is a question, what should I do if my parameter value contains a special character such as = or &?

For example, "name1=value1", where the value of value1 is a "va&lu=e1" string, then it will actually become "name1=va&lu=e1" during transmission. Our intention is that there is only one key-value pair, but the server parses into two key-value pairs, which makes it strange.

How to solve the ambiguity caused by the above problems? The solution is to encode the parameters with URL.

URL coding simply adds% in front of each byte of a special character. For example, we encode the above strange characters with the result of URL encoding: "name1=va%26lu%3D", so that the server will treat the byte immediately after "%" as a normal byte, but will not regard it as a separator for each parameter or key-value pair.

Another question is why we need to use ASCII transmission, can we use other encoding?

Of course, you can use other codes, you can develop your own code, and then parse it yourself. Just like most countries have their own languages. What should we do if there are exchanges between countries? In English, English is the most widely used.

Usually if something needs to be encoded, it is not suitable for transmission. There are a variety of reasons, such as the Size is too large and contains private data. For Url, it is necessary to encode because some characters in Url can cause ambiguity.

For example, parameters are passed in the form of key=value key-value pairs in the Url parameter string, separated by the & symbol, such as / s?q=abc&ie=utf-8. If your value string contains a = or &, it is bound to cause a parsing error on the server that receives the Url, so the ambiguous & and = symbols must be escaped, that is, encoded.

For example, Url is encoded in ASCII code instead of Unicode, which means you can't include any non-ASCII characters in Url, such as Chinese. Otherwise, Chinese may cause problems if the character sets supported by the client browser and the server browser are different.

The principle of Url coding is to use safe characters (printable characters with no special purpose or special meaning) to represent those unsafe characters.

Preliminary knowledge: URI means uniform resource identification, and usually what we call URL is just a kind of URI. The format of a typical URL is shown below. The URL encoding mentioned below should actually refer to the URI encoding.

Foo://example.com:8042/over/there?name=ferret#nose

\ _ /\ _ /

| | |

Scheme authority path query fragment

Which characters need to be encoded

The RFC3986 document states that only letters (a-zA-Z), numbers (0-9),-_. ~ 4 special characters and all reserved characters are allowed in Url. The RFC3986 document makes detailed suggestions on the coding and decoding of Url, points out which characters need to be encoded in order not to cause semantic changes in Url, and explains why these characters need to be encoded.

There are no printable characters in the US-ASCII character set: only printable characters are allowed in Url. The 10-7F bytes in the US-ASCII code all represent control characters, which cannot appear directly in the Url. At the same time, for 80-FF bytes (ISO-8859-1), they cannot be placed in Url because they are beyond the byte range defined by US-ACII.

Reserved characters: Url can be divided into several components, protocols, hosts, paths, and so on. Some characters (: /? # [] @) are used to separate different components. For example, the colon is used to separate the protocol from the host, / to separate the host from the path,? Used to separate paths from query parameters, and so on.

There are also some characters (! $&'() * +,; =) used to separate each component, such as = to represent key-value pairs in query parameters, and the & symbol to separate multiple key-value pairs in a query. When ordinary data in a component contains these special characters, it needs to be encoded.

The following characters are specified in RFC3986 as reserved characters:! *'();: @ & = + $, /? # []

Unsafe characters: there are also some characters that may cause ambiguity in the parser when they are placed directly in Url. These characters are considered unsafe for a number of reasons.

Whitespace: during the transmission of Url, the typesetting of the user, or the processing of Url by a text processor, it is possible to introduce insignificant spaces or remove meaningful ones.

Quotation marks and: quotation marks and angle brackets are usually used to separate Url in plain text

#: usually used to indicate bookmarks or anchor points

%: the percent sign itself is used as a special character for encoding unsafe characters, so it needs to be encoded itself.

{} |\ ^ [] `~: some gateways or transport agents tamper with these characters

It is important to note that for legal characters in Url, encoding and unencoding are equivalent, but for the characters mentioned above, if they are not encoded, they may cause differences in Url semantics. So for Url, only ordinary English characters and numbers, the special character $-. +! *'() and reserved characters can appear in unencoded Url. Other characters need to be encoded before they can appear in Url.

However, due to historical reasons, there are still some non-standard coding implementations. For example, for ~ symbols, although the RFC3986 documentation stipulates that for wave symbols, there is no need for Url coding, but there are still many old gateways or transport agents to code.

How to encode illegal characters in Url

Url coding is also commonly known as percent coding (Url Encoding,also known as percent-encoding) because it is very simple, using a% percent sign plus a two-digit character-0123456789ABCDEFMI-to represent the hexadecimal form of a byte.

The default character set for Url encoding is US-ASCII. For example, if the corresponding byte of an in the US-ASCII code is 0x61, then the Url code will get% 61. Typing http://g.cn/search?q=%61%62%63 in the address bar is actually equivalent to searching for abc on google. As another example, the corresponding byte of the @ symbol in the ASCII character set is 0x40, and the result is% 40 after Url encoding.

For non-ASCII characters, you need to encode the corresponding bytes using a superset of the ASCII character set, and then encode each byte with a percent sign. For Unicode characters, the RFC document recommends that you use utf-8 to encode them to get the corresponding bytes, and then encode each byte with a percent sign. For example, "Chinese" uses the UTF-8 character set to get bytes of 0xE4 0xB8 0xAD 0xE6 0x96 0x87, and after Url encoding, it gets "% E4%B8%AD%E6%96%87".

If a byte corresponds to an unreserved character in the ASCII character set, the byte does not need to be represented by a percent sign. For example, "Url encoding", the bytes encoded using UTF-8 are 0x55 0x72 0x6C 0xE7 0xBC 0x96 0xE7 0xA0 0x81, and since the first three bytes correspond to the unreserved character "Url" in ASCII, these three bytes can be represented by the unreserved character "Url". The final Url code can be simplified to "Url%E7%BC%96%E7%A0%81", of course, if you use "% 55%72%6C%E7%BC%96%E7%A0%81".

For historical reasons, there are some Url coding implementations that do not fully follow this principle, as mentioned below.

The difference between escape, encodeURI and encodeURIComponent in Javascript

Three pairs of functions are provided in JavaScript to encode the Url to get the legal Url. They are escape / unescape, encodeURI / decodeURI and encodeURIComponent / decodeURIComponent. Since the process of decoding and encoding is reversible, only the encoding process is explained here.

These three encoded functions, escape,encodeURI,encodeURIComponent--, are used to convert unsafe and illegal Url characters into legitimate Url character representations, with the following differences.

The security characters are different:

The safe characters for these three functions are listed below (that is, these characters are not encoded by the function)

Escape (69): * / @ + -. _ 0-9a-zA-Z

EncodeURI (82):! # $&'() * +, /:; =? @ -. _ 0-9a-zA-Z

EncodeURIComponent (71):!'() * -. _ 0-9a-zA-Z

Compatibility is different:

The escape function has existed since Javascript 1.0, while the other two functions were introduced in Javascript 1.5. But because Javascript 1.5 has become so popular, there are actually no compatibility issues with encodeURI and encodeURIComponent.

Unicode characters are encoded differently:

These three functions encode ASCII characters in the same way, using the percent sign + two-digit hexadecimal characters. But for Unicode characters, escape is encoded as% uxxxx, where xxxx is a 4-bit hexadecimal character used to represent unicode characters.

This approach has been abandoned by the W3C. However, this coding syntax of escape is still retained in the ECMA-262 standard. EncodeURI and encodeURIComponent use UTF-8 to encode non-ASCII characters, followed by a percent sign. This is recommended by RFC. Therefore, it is recommended to use these two functions instead of escape for coding as much as possible.

The application is different: encodeURI is used to encode a complete URI, while encodeURIComponent is used to encode a component of URI. Judging from the secure character range table mentioned above, we will find that the character range encoded by encodeURIComponent is larger than that of encodeURI.

As we mentioned above, reserved characters are generally used to separate URI components (a URI can be cut into multiple components, see the preliminary knowledge section) or subcomponents (such as delimiters for query parameters in URI), such as: the symbol is used to separate the scheme from the host. Number is used to separate the host from the path. Because the object that encodeURI manipulates is a complete URI, these characters have a special purpose in URI, so these reserved characters will not be encoded by encodeURI, otherwise the meaning will change.

Components have their own data representation format, but the data cannot contain reserved characters that separate components, otherwise it will lead to confusion in the separation of components in the whole URI. So using encodeURIComponent for a single component requires more characters to be encoded.

Form submission

When the Html form is submitted, each form field is encoded by Url before it is sent. For historical reasons, the Url coding implementation used by the form does not meet the latest standards.

For example, the encoding used for spaces is not% 20, but the + sign. If the form is submitted using the Post method, we can see a header with a Content-Type in the HTTP header with a value of application/x-www-form-urlencoded.

Most applications can handle this non-standard implementation of Url encoding, but in the client-side Javascript, there is no function that can decode the + sign into spaces and can only write its own conversion function. Also, for non-ASCII characters, the encoded character set used depends on the character set used by the current document. For example, we add to the Html header

In this way, the browser will use gb2312 to render the document (note that when this meta tag is not set in the HTML document, the browser will automatically select the character set according to the preferences of the current user, or the user can force the current site to use a specified character set). When submitting a form, the character set used for Url encoding is gb2312.

A confusing problem I encountered when using Aptana (why it was specifically mentioned below in aptana) was that when I used encodeURI, I found that the result of its coding was very different from what I thought. Here is my sample code:

[xss_clean] (encodeURI ("Chinese"))

Run result output E6%B6%93%EE%85%9F%E6%9E%83. Obviously this is not the result of Url encoding using the UTF-8 character set (search for "Chinese" on Google and% E4%B8%AD%E6%96%87 is displayed in Url).

So I wondered at the time whether encodeURI had anything to do with page coding, but I found that normally, if you use gb2312 for Url coding, you won't get this result. Later, I finally found out that the problem was caused by the inconsistency between the character set used in the page file storage and the character set specified in the Meta tag.

Aptana's editor uses the UTF-8 character set by default. This means that the file is actually stored using the UTF-8 character set. However, because gb2312 is specified in the Meta tag, the browser will parse the document according to gb2312, so there will naturally be an error in the "Chinese" string, because the byte of the "Chinese" string encoded in UTF-8 is 0xE4 0xB8 0xAD 0xE6 0x96 0x87, and these six bytes are decoded by the browser with gb2312, then you will get three other Chinese characters "trickle " (one Chinese character accounts for two bytes in GBK). The result of passing these three Chinese characters into the encodeURI function is% E6%B6%93%EE%85%9F%E6%9E%83. Therefore, encodeURI still uses UTF-8 and is not affected by the page character set.

Different browsers behave differently when dealing with Url in Chinese. For example, for IE, if you check the advanced setting "always send Url with UTF-8", then the Chinese part of the path in Url will be Url encoded using UTF-8 and sent to the server, while the Chinese part of the query parameters will be Url encoded using the system default character set. To ensure maximum interoperability, it is recommended that all components placed in Url explicitly specify a character set for Url encoding, independent of the browser's default implementation.

In addition, many HTTP monitoring tools or browser address bars automatically decode Url once (using the UTF-8 character set) when displaying Url, which is why the Url displayed in the address bar contains Chinese when you visit Google search Chinese in Firefox. But in fact, the original Url sent to the server is encoded. You can see it by accessing location.href using Javascript on the address bar. Don't be fooled by these illusions when studying Url codec.

Thank you for your reading, these are the contents of "URL coding knowledge points". After the study of this article, I believe you have a deeper understanding of what URL coding knowledge points have, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.