How to analyze Base64 Encoding and Decoding in JavaScrip 07/08 Update SLTechnology News&Howtos

How to analyze Base64 Encoding and Decoding in JavaScrip

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

Today, I will talk to you about how to analyze the Base64 encoding and decoding in JavaScrip. Many people may not know much about it. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

Base64 is one of the most commonly used codes, such as those used to pass parameters in development and in modern browsers

The tag renders the picture directly through the Base64 string, uses it in the mail, and so on. Base64 coding is defined in RFC2045, which is defined as: Base64 content transfer coding is designed to describe the 8-bit bytes of any sequence as a form that is not easily recognized by people.

We know that any data in the computer is stored in a binary way. A byte is 8 bits, and a character is stored as one or more bytes in the computer. For example, English letters, numbers and punctuation marks are stored in one byte, often called ASCII codes. Simplified Chinese, traditional Chinese, Japanese and Korean are all stored in multi-bytes, which are usually called multi-byte characters. Because Base64 encoding deals with the encoded representation of strings, the Base64 results of different encoded strings are different, so we need to understand the basic knowledge of character coding.

The basis of character coding

At first, the computer only supports ASCII codes, each character is represented by one byte, only the lower 7 bits are used, and the * * bit is 0, so there are a total of 128 ASCII codes in the range of 0x127. Later, in order to support a variety of regional languages, major organizations and IT manufacturers began to invent their own coding schemes, in order to make up for the shortcomings of ASCII coding, such as GB2312 coding, GBK coding and Big5 coding. But these codes are only for local areas or a small number of languages, there is no way to express all the languages. And there is no connection between these different codes, and the conversion between them needs to be achieved by looking up the table.

In order to improve the information processing and exchange function of the computer, so that the words of all countries in the world can be processed in the computer, since 1984, the ISO organization has begun to study and formulate a brand-new standard: universal multi-octet (multi-byte) coded character set (Universal Multiple-Octet Coded Character Set), referred to as UCS. The standard number is: ISO 10646. This standard compiles a unified internal code for characters (including simplified and traditional Chinese characters) and additional symbols in all major languages of the world.

Unified Code (Unicode), which stands for Universal Code, is a character coding system developed by another institution called the Unicode academic Society (The Unicode Consortium). Unicode is consistent with the ISO 10646 international coding standard in terms of content. For more information, please see Unicode.

ANSI

ANSI does not represent a specific code, it refers to a local code. For example, it represents GB2312 code on simplified windows, Big5 code on traditional windows, and JIS code on Japanese operating system. So if you created a new text file and saved it as ANSI encoding, you should now know that the encoding of the file is local.

Unicode

The Unicode encoding is mapped to the character table one by one. For example, 56DE stands for the Chinese character "Hui", and this mapping relationship is fixed. Generally speaking, Unicode coding is the coordinates of the character table, and the Chinese character "Hui" can be found through 56DE. The implementation of Unicode coding includes UTF8, UTF16, UTF32 and so on.

Unicode itself defines the numerical value of each character, which is the mapping relationship between characters and natural numbers, while UTF-8 or UTF-16 or even UTF-32 defines how to interrupt words in byte stream, which is a concept in the computer field.

From the figure above, we know that UTF-8 coding is a variable-length coding method, accounting for 1-6 bytes, which can be judged by the interval of Unicode coding values, and each byte that makes up UTF8 characters has rules to follow. This article only discusses UTF8 and UTF16 encodings.

UTF16

UTF16 encodings are stored using a fixed 2 bytes. Because it is multi-byte storage, it can be stored in two ways: large end order and small end order. UTF16 encoding is the most direct implementation of Unicode. Usually we create a new text file on windows and save it as Unicode encoding, which is actually saved as UTF16 encoding. UTF16 code is stored in small end order on windows. Below, I created a new text file and saved it as Unicode code to test. Only one Chinese character "Hui" was entered in the file. Then I opened it with Editplus and switched to hexadecimal mode to view it, as shown in the figure:

We see that there are four bytes, the first two bytes FF FE is the file header, indicating that this is a UTF16-encoded file, while DE 56 is the "back" UTF16-encoded hexadecimal. We often use the JavaScript language, which uses UTF16 coding internally, and its storage mode is large terminal sequence. Let's take a look at an example:

Console.group ('Test Unicode:'); console.log (('back' .charCodeAt (0)) .toString (16) .toUpperCase ())

Obviously different from what Editplus just showed, the order is the opposite, because the byte order is different. For more information, please see UTF-16.

UTF8

UTF8 uses a variable-length encoding of 1 to 6 bytes, but usually we only think of it as a single-byte or three-byte implementation, because other cases are really rare. UTF8 coding is displayed by the combination of multiple bytes, which is the mechanism of the computer to deal with UTF8, it is not divided into byte order, and each byte is very regular, see the figure above, I will not elaborate on it here.

Mutual conversion between UTF16 and UTF8

UTF16 to UTF8

The conversion between UTF16 and UTF8 can be realized through the conversion table in the figure above. Judging the interval where the Unicode code is located, we can see that the character is composed of several bytes, and then it is realized by shift. Let's use the Chinese character 'Hui' to give an example of conversion.

We already know that the Unicode code of the Chinese character 'Hui' is 0x56DE, which is between Ubun00000800-U+0000FFFF, so it is represented by three bytes.

So we need to change the double-byte value of 0x56DE into a three-byte value. Note that the x part of the figure above is the corresponding bytes of 0x56DE. If you count the number of x, you will find that it is exactly 16 bits.

Change train of thought

Take 4 bits from the 0x56DE and put them in the low bit and combine them with the binary 1110, which is * * bytes. Take 6 bits from the remaining bytes in the 0x56DE and put them in the low bit and combine them with the binary 10. This is the second byte. The third byte is implemented in a similar manner.

Code implementation

In order for you to understand better, the following code is just for the conversion of Chinese characters' Hui'. The code is as follows:

The output looks like garbled because JavaScript doesn't know how to display UTF8 characters. You might say that abnormal conversion of the output is of no use, but you should know that the purpose of the conversion is also often used for transport or API needs.

UTF8 to UTF16

This is the inverse conversion from UTF16 to UTF8, which also needs to be implemented against the conversion table. In the last example, we have got the UTF8 code of the Chinese character 'Hui', which is three bytes, we just need to convert it to double bytes according to the conversion table, as shown in the figure, we need to keep all the x.

The code is as follows:

/ * * convert the comparison table * Ubun00000000-Utt0000007F 0xxxxxxx * Ubl00000080-U+000007FF 110xxxxx 10xxxxxx * Ubun00000800-U+0000FFFF 1110xxxx 10xxxxxx 10xxxxxx * Ubl00010000-U+001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx * Ubl00200000-U+03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx * Ubl04000000-U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx * / / * 'back' Unicode code is: 0x56DE, it is between Ubun00000800-U+0000FFFF, so it takes up three bytes. * Utt00000800-U+0000FFFF 1110xxxx 10xxxxxx 10xxxxxx * / var ucode = 0x56DE; / / 1110xxxx var byte1 = 0xE0 | (ucode > > 12) & 0x0F); / / 10xxxxxx var byte2 = 0x80 | ((ucode > > 6) & 0x3F); / / 10xxxxxx var byte3 = 0x80 | (ucode & 0x3F); var utf8 = String.fromCharCode (byte1) + String.fromCharCode (byte2) + String.fromCharCode (byte3); console.group ('Test UTF16ToUTF8:'); console.log (utf8) Console.groupEnd (); / * *-* / consists of three bytes, so take out var C1 = utf8.charCodeAt (0) Var c2 = utf8.charCodeAt (1); var c3 = utf8.charCodeAt (2); / * * needs to be converted by judging the special location, but here it is known to be three bytes, so ignore the judgment, but directly get all the x to form 16 bits. * var 00000800-U+0000FFFF 1110xxxx 10xxxxxx 10xxxxxx * / / discard * * the top four bits of the second byte and the top four bits of the second byte form a byte var b1 = (C1 > 2) & 0x0F); / / similarly, the combination of the second byte and the third byte var b2 = ((c2 & 0x03) 6) & 0x1F); / / 10xxxxxx var byte2 = 0x80 | (code & 0x3F) Res.push (String.fromCharCode (byte1), String.fromCharCode (byte2));} else if (code > = 0x0800 & & code > 12) & 0x0F); / / 10xxxxxx var byte2 = 0x80 | ((code > > 6) & 0x3F) / / 10xxxxxx var byte3 = 0x80 | (code & 0x3F); res.push (String.fromCharCode (byte1), String.fromCharCode (byte2), String.fromCharCode (byte3)) } else if (code > = 0x00010000 & & code = 0x00200000 & & code = 0x04000000 & & code > 7) & 0xFF) = = 0x0) {/ / single byte / / 0xxxxxxx res.push (str.charAt (I)) } else if ((code > > 5) & 0xFF) = = 0x6) {/ / double byte / / 110xxxxx 10xxxxxx var code2 = str.charCodeAt (+ + I) Var byte1 = (code & 0x1F) > 4) & 0xFF) = = 0xE) {/ / three bytes / / 1110xxxx 10xxxxxx 10xxxxxx var code2 = str.charCodeAt (+ + I); var code3 = str.charCodeAt (+ + I); var byte1 = (code > 2) & 0x0F) Var byte2 = ((code2 & 0x03) 3) & 0xFF) = = 0x1E) {/ / four bytes / / 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx} else if (code > > 2) & 0xFF) = = 0x3E) {/ / five bytes / / 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx } else / * * if (code > > 1) & 0xFF) = = 0x7E) * / {/ / six bytes / / 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}} return res.join ('') }, encode: function (str) {if (! str) {return';} var utf8 = this.UTF16ToUTF8 (str); / / convert to UTF8 var I = 0; / / ergodic index var len = utf8.length; var res = []; while (I

< len) { var c1 = utf8.charCodeAt(i++) & 0xFF; res.push(this.table[c1 >

> 2]); / / need to add 2 = if (I = = len) {res.push (this.table [(C1 & 0x3) 4) & 0x0F)]); res.push (this.table [(c2 & 0x0F) 4) & 0x0F)]); res.push (this.table [((c2 & 0x0F) > 6)]) Res.push (this.table [c3 & 0x3F]);} return res.join ('');}, decode: function (str) {if (! str) {return';} var len = str.length; var I = 0; var res = []; while (I

< len) { code1 = this.table.indexOf(str.charAt(i++)); code2 = this.table.indexOf(str.charAt(i++)); code3 = this.table.indexOf(str.charAt(i++)); code4 = this.table.indexOf(str.charAt(i++)); c1 = (code1 >

4); c2 = ((code2 & 0xF) > 2); c3 = ((code3 & 0x3))

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.