In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "what are the basic knowledge points of Java coding". In the daily operation, I believe many people have doubts about the basic knowledge points of Java coding. The editor consulted all kinds of data and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the questions of "what are the basic knowledge points of Java coding?" Next, please follow the editor to study!
1. ASCII coding
In the 1960s, the United States formulated a set of character coding, which made unified provisions on the relationship between English characters and binary bits. This is called ASCII code and has been used to this day. The ASCII code specifies a total of 128 characters, such as the space "SPACE" is 32 (binary 00100000) and the uppercase letter An is 65 (binary 01000001). The 128 symbols (including 32 control symbols that cannot be printed) occupy only the last 7 bits of a byte, and the first bit is uniformly specified as 0. 0characters 31 are control characters such as line feed carriage return deletion, and 32 characters 126 are printed characters that can be entered through the keyboard and displayed.
128 symbols is enough to encode English, but 128 symbols is not enough to represent other languages. For example, in French, if there is a phonetic symbol above the letter, it cannot be represented by an ASCII code. As a result, some European countries decided to use the highest bits of idle bytes to incorporate new symbols. For example, the French word é has a code of 130 (binary 10000010). In this way, the coding system used by these European countries can represent up to 256 symbols.
However, new problems have emerged here. Different countries have different letters, so even if they all use 256 symbols, they represent different letters. For example, 130 stands for é in French, Gimel in Hebrew and another symbol in Russian. But in any case, in all these coding methods, the symbol represented by 0mur127 is the same, only the paragraph 128mur255 is different.
As for the characters of Asian countries, more symbols are used, and the number of Chinese characters is about 100000. One byte can only represent 256 symbols, which is certainly not enough, so multiple bytes must be used to express a symbol. For example, the common coding method in simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so theoretically it can represent up to 65536 symbols.
2. Unicode coding
It is conceivable that if there is a code that includes all the symbols in the world. Each symbol is given a unique code, so that the above problem will not occur. Unicode coding is such a kind of coding.
Unicode is a large collection of characters that can now hold more than 1 million symbols. The coding of each symbol is different. For example, Upri0639 stands for the Arabic letter Ain,U+0041 for English, and the capital letter Agraine Upri4E25 stands for the Chinese character "Yan".
It is important to note that Unicode is just a set of symbols, which only specifies the binary of the symbol, but not how the binary should be stored. This creates two problems:
The first question is, how can you tell unicode from ascii? How does the computer know that three bytes represent one symbol instead of three symbols separately?
The second problem is that we already know that only one byte of English letters is enough. if unicode uniformly stipulates that each symbol is represented by three or four bytes, then there must be two to three bytes 0 in front of each English letter, which is a great waste of storage, and the size of the text file will be two or three times larger, which is unacceptable.
Remember, Unicode is just a standard for mapping characters and numbers. It has no limit on the number of characters supported, nor does it require characters to occupy two, three, or any other number of bytes. How Unicode characters are encoded into bytes in memory is another topic, which is defined by UTF (Unicode Transformation Formats).
3. UTF-8 coding
With the popularity of the Internet, there is a strong demand for a unified coding method. UTF-8 is the most widely used way to implement unicode on the Internet. Other implementations include UTF-16 and UTF-32, which are rarely used on the Internet. Again, the relationship here is that UTF-8 is one of the ways Unicode is implemented.
UTF-8 (8-bit Unicode Transformation Format) is a kind of variable length character coding for Unicode, also known as universal code. Founded by Ken Thompson in 1992. It has now been standardized to RFC 3629. UTF-8 encodes Unicode characters with 1 to 4 bytes. It can be used on the web page to unify the display of simplified Chinese and other languages (such as English, Japanese, Korean).
One of the biggest features of UTF-8 is that it is a variable length coding method. It can use 1 to 4 bytes to represent a symbol, and the byte length varies according to different symbols (the UTF-8 code can hold 2 ^ 21 characters, a total of more than 2 million characters).
UTF-8 's coding rules are simple, with only two:
For a single-byte symbol, the first bit of the byte is set to 0, and the last seven bits are the Unicode code of the symbol. So for the English alphabet, the UTF-8 code and the ASCII code are the same.
For n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n + 1 bit is set to 0, and the first two bits of the next byte are all set to 10. The rest of the unmentioned binary bits are all unicode codes for this symbol.
The following table summarizes the coding rules, with the letter x representing the bits that can be encoded.
Unicode symbol range | UTF-8 encoding method
Number of UTF bytes? (hexadecimal) | (binary)
-+-one byte 0000 0000-0000 007F | 0xxxxxxx
Two bytes 0000 0080-0000 07FF | 110xxxxx 10xxxxxx
Three bytes 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
Four bytes 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Next, take the Chinese character "Yan" as an example to demonstrate how to implement UTF-8 coding.
It is known that the unicode of "Yan" is 4E25 (100111000100101). According to the above table, it can be found that the 4E25 is within the range of the third line (0000 0800-0000 FFFF), so the UTF-8 code of "Yan" requires three bytes, that is, the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, starting with the last binary bit of "Yan", fill in the x in the format from back to front, and the extra bits fill in 0. Thus, the UTF-8 code of "Yan" is "11100100 10111000 10100101", and the conversion to hexadecimal is E4B8A5.
4. The difference between UTF8, UTF16 and UTF32
First of all, we need to determine the concept that Unicode is a character set in which all the characters in the world define a unique encoding. It only specifies the binary code of each symbol, and there are no detailed storage rules. UTF-8, UTF-16 and UTF-32 are the storage format definitions of Unicode. (take a column in a communication as a comparison, a signal (analogically referred to as Unicode coding) will be encoded into different high and low signals through different coding methods.
4.1 UCS-2 and UCS-4
Unicode was born to integrate all the languages in the world. Any text corresponds to a value in Unicode,? This value is called a code point. The value of the code point is usually written in U+ABCD format. The corresponding relationship between text and code point is UCS-2 (Universal Character Set coded in 2 octets). As the name implies, UCS-2 represents a code point in two bytes, and its value range is U+0000~U+FFFF.
In order to express more words, people put forward UCS-4, that is, using four bytes to represent code points. Its scope is U+00000000~U+7FFFFFFF, where U+00000000~U+0000FFFF and UCS-2 are the same.
Note that UCS-2 and UCS-4 only specify the correspondence between code points and text, not how code points are stored in the computer. The storage mode is called UTF (Unicode Transformation Format), and UTF-16 and UTF-8 are the most widely used ones.
4.2 UTF-16
UTF-16 is specified by RFC2781, which uses two bytes to represent a code point. It is not difficult to guess that UTF-16 is exactly corresponding to UCS-2, that is, the code points specified by UCS-2 are saved directly through Big Endian or Little Endian. UTF-16 includes three types: UTF-16,UTF-16BE (Big Endian) and UTF-16LE (Little Endian). While UTF-16BE and UTF-16LE are not difficult to understand, UTF-16 needs to indicate whether the file is Big Endian or Little Endian by beginning with a character called BOM (Byte Order Mark). BOM is the character U+FEFF. Actually, BOM is a clever idea. Because UCS-2 does not define U+FEFF, whenever a byte sequence such as FF FE or FEFF appears, it can be considered U+FEFF, and you can tell whether it is Big Endian or Little Endian.
BOM (Byte Order Mark) is used to tell the reader the byte order of the document at the beginning of the document. UTF-8 does not need BOM to indicate byte order, but BOM can be used to indicate how it is encoded. The UTF-8 encoding of the character "ZERO WIDTH NO-BREAK SPACE" is EF BB BF. So if the receiver receives a byte stream that starts with EF BB BF, it knows that this is UTF-8 coding. Only UTF-16 needs to add bom. Because it is encoded in unicode order, it is two bytes in the BMP range and needs to be identified as large or small byte order.
?
Low byte order (Little Endian) and high byte order (Big Endian)
Low byte order and high byte order are just a convention for storing and reading a byte (called words) in memory. This means that when you ask your computer to use UTF-16 to store the letter A (two bytes) in memory, which byte order scheme you use determines whether you put the first byte before or after the second byte. This is not easy to understand. Let's look at an example: when you use UTF-16 to save a piece of content, the second half of it may look like this on different systems:
00 68: 00 65: 00 6C?00 6C?00 6F (high byte order, high-order bytes are stored in front)
68 00'65 00'6C 00'6C 00'6C 00'6F 00 (low byte order, low byte is stored in front)
The byte order scheme is just a matter of preference for microprocessor architecture designers. For example, Intel uses low byte order and Motorola uses high byte order.
For example, the result of the three characters "ABC" encoded in various ways is as follows:
Code value of code type UTF-16BE00 41 00 42 00 43UTF-16LE41 00 42 00 43 00UTF-16 (Big Endian) FE FF 00 41 00 42 00 43UTF-16 (Little Endian) FF FE 41 00 42 00 43 00UTF-16 (without BOM) 00 41 00 42 00 434.3 UTF-32
UTF-32 represents code points with four bytes, so that all code points of UCS-4 can be fully represented without the need for complex algorithms like UTF-16. ? Similar to UTF-16, UTF-32 includes three encodings: UTF-32, UTF-32BE, and UTF-32LE, and UTF-32 also requires BOM characters.
4.4 how does the text editor know the encoding of the text
When a software opens a text, the first thing it does is to decide which character set and which encoding to save the text. Software generally uses three ways to determine the character set and encoding of the text.
Detect header identification (BOM)
EF? BB? BF? UTF-8?
FE? FF? UTF-16/UCS-2,? Big? Endian?
FF? FE? UTF-16/UCS-2,? Little? Endian?
FF? FE? 00? 00? UTF-32/UCS-4,? Little? Endian. ?
00? 00? FE? FF? UTF-32/UCS-4,? Big-endian.
The software itself guesses the encoding of the current file according to the coding rules.
Prompt the user to enter the encoding of the current file himself
5. The difference between GBK, GB2312 and GB18030
GB2312 is an extension of ASCL code, which takes up two bytes. The meaning of a character less than 127is the same as before, but when two characters greater than 127are connected together, it represents a Chinese character, the first byte (he calls it high byte) is used from 0xA1 to 0xF7, and the last byte (low byte) is from 0xA1 to 0xFE, so that we can combine about 7000 simplified Chinese characters. In these codes, we have also compiled mathematical symbols, Roman and Greek letters, and Japanese katakana, and even the numbers, punctuation, and letters that already exist in ASCII have been re-coded with two bytes long, which is often called "full width" characters, while those below 127th are called "half width" characters.
There are still not enough characters that GB2312 can represent, so GBK appears. GBK is an extension of GB1212, which also takes up 2 bytes. GBK no longer requires that the low byte must be the internal code after 127. as long as the first byte is greater than 127, it means that this is the beginning of a Chinese character, regardless of whether it is followed by the content of the extended character set. Results the extended coding scheme is called GBK standard. GBK includes all the contents of GB2312 and adds nearly 20000 new Chinese characters (including traditional characters) and symbols.
GB18030 uses variable length encoding, which can be 1 byte, 2 bytes, and 4 bytes. Is an extension of GB2312 and GBK, fully compatible with both.
After the above introduction, we can see that Unicode is a world standard, which makes coding tables for all language symbols in the world, while GBK and GB2312 mainly encode Chinese characters.
6. Coding problems in Java
We know that coding is usually involved in character-to-byte or byte-to-character conversion, and the scenarios that need this conversion are mainly in the case of Imax O, which includes disk Imax O? And the network IPUP O. However, most of the garbled codes caused by Ipicuro are network Ipicuro.
The user initiates a HTTP request from the browser, and the places where the code needs to be stored are URL, Cookie, and Parameter. After receiving the HTTP request, the server needs to parse the HTTP protocol, in which the URI, Cookie and POST form parameters need to be decoded. The server may also need to read the data in the database and text files locally or elsewhere in the network, which may have encoding problems. When the Servlet has processed all the requested data, it needs to encode the data and send it to the browser requested by the user through Socket. It is then decoded into text by the browser. These processes are shown in the following figure:
As shown in the figure above, a HTTP request is designed to be codec in many places. What are the rules for their codec? The following will focus on:
Coding and Decoding of URL
The user submits a URL. There may be Chinese in the URL, so it needs to be encoded. How to encode the URL? What are the rules for coding? How to decode it? The following figure shows a URL:
Port is configured in Tomcat, Context Path is configured in, and Servlet Path is configured in web.xml of Web application.
JunshanExample / servlets/servlet/*
PathInfo is the specific Servlet,QueryString of our request, which is the parameter to be passed. Note that the URL is entered directly in the browser, so the request is made through the Get method. If the request is made by the POST method, the QueryString will be submitted to the server through a form, which will be described later.
PathInfo and QueryString appear in Chinese in the image above. When we type the URL directly into the browser, how do we encode and parse the URL on the browser and server side? In order to verify how the browser encodes URL, we choose FireFox browser and observe the actual content of our requested URL through the HTTPFox plug-in. Here are the test results of URL:HTTP://localhost:8080/examples/servlets/servlet/ Jun Mountain? author= Jun Mountain in Chinese FireFox3.6.12:
The coding results of Jun Mountain are: e5909b e5b1b1rebe fd c9 bd, referring to the previous coding, we can see that PathInfo is UTF-8 coding and QueryString is GBK coding, as to why there is "%"? Looking at the URL coding specification RFC3986, we can see that the browser encoding URL encodes non-ASCII characters into hexadecimal digits according to a certain encoding format, and then adds "%" to each hexadecimal representation byte, so the final URL becomes the format of the figure above.
From the above test results, we can see that browsers have different encodings of PathInfo and QueryString, and different browsers may have different encodings of PathInfo, which makes it very difficult to decode the server. Let's take Tomcat as an example to see how the URL is decoded by Tomcat.
Protected void convertURI (MessageBytes uri, Request request) throws Exception {ByteChunk bc = uri.getByteChunk (); int length = bc.getLength (); CharChunk cc = uri.getCharChunk (); cc.allocate (length,-1); String enc = connector.getURIEncoding (); if (enc! = null) {B2CConverter conv = request.getURIConverter () Try {if (conv = = null) {conv = new B2CConverter (enc); request.setURIConverter (conv) } catch (IOException e) {...} if (conv! = null) {try {conv.convert (bc, cc, cc.getBuffer (). Length-cc.getEnd ()); uri.setChars (cc.getBuffer (), cc.getStart (), cc.getLength ()) Return;} catch (IOException e) {...} / / Default encoding: fast conversion byte [] bbuf = bc.getBuffer (); char [] cbuf = cc.getBuffer (); int start = bc.getStart (); for (int I = 0; I < length) Cbuf +) {cbuf [I] = (char) (bbuf [I + start] & 0xff);} uri.setChars (cbuf, 0, length);}
You can see from the above code that the character set that decodes the URI portion of URL is defined in connector, and if not, it will be parsed with the default encoding ISO-8859-1. Therefore, if there is a Chinese URL, it is best to set URIEncoding to UTF-8 encoding.
How to interpret QueryString? GET HTTP request QueryString and POST HTTP request form parameters are saved as Parameters, and the parameter values are obtained through request.getParameter. They are decoded the first time the request.getParameter method is called. The parseParameters method of org.apache.catalina.connector.Request is called when the request.getParameter method is called. This method will decode the parameters passed in GET and POST, but their decoded character sets may not be the same. The decoding of the POST form will be described later. Where is the decoded character set for QueryString defined? It is transmitted to the server through HTTP's Header and is also in URL. Is it the same as the decoded character set of URI? From the previous browsers' different encoding formats for PathInfo and QueryString, it can be inferred that the decoded character set will certainly not be consistent. It is true that the decoded character set of QueryString is either the Charset defined in ContentType in Header or the default ISO-8859-1. To use the encoding defined in ContentType, you have to set useBodyEncodingForURI in connector to true. The name of this configuration item is a bit confusing, and it doesn't decode the entire URI in BodyEncoding, but just BodyEncoding in QueryString, which is worth noting.
From the above URL encoding and decoding process, it is more complex, and encoding and decoding is not completely controlled by us in the application, so in our application we should try to avoid using non-ASCII characters in URL, otherwise we are likely to encounter garbled problems, of course, it is best to set the two parameters URIEncoding and useBodyEncodingForURI in our server.
Coding and Decoding of HTTP Header
When the client initiates a HTTP request, in addition to the above URL, it may also pass other parameters such as Cookie, redirectPath, etc. in the Header, the values set by these users may also have encoding problems, how does Tomcat decode them?
Decoding the items in Header is also carried out by calling request.getHeader, if the requested Header entry is not decoded, then call the toString method of MessageBytes, this method will convert from byte to char using the default encoding is also ISO-8859-1, and we can not set other decoding formats of Header, so if you set Header with non-ASCII character decoding, there will be garbled.
We add Header in the same way, do not pass non-ASCII characters in Header, if you have to, we can first encode these characters with org.apache.catalina.util.URLEncoder and then add them to Header, so that the information will not be lost in the process of browser-to-server transmission, if we want to access these items and then decode them according to the corresponding character set.
Coding and Decoding of POST form
It was mentioned earlier that the decoding of the parameters submitted by the POST form occurs on the first call to request.getParameter. Unlike QueryString, the POST form parameters are passed to the server through HTTP's BODY. When we click the submit button on the page, the browser will first encode the parameters filled in the form according to the Charset encoding format of ContentType and then submit it to the server, which is also decoded with the character set in ContentType. So there is generally no problem with the parameters submitted through the POST form, and the character set encoding is set by ourselves, which can be set through request.setCharacterEncoding (charset).
In addition, for the parameter of multipart/form-data type, that is, the encoding of the uploaded file also uses the character set encoding defined by ContentType, it is worth noting that the uploaded file is transferred to the local temporary directory of the server by byte stream. This process does not involve character encoding, but the real encoding is to add the file content to the parameters. If it cannot be encoded with this code, it will be encoded with the default code ISO-8859-1.
Coding and Decoding of HTTP BODY
When the resource requested by the user has been successfully obtained, the content will be returned to the client browser through Response, which is encoded and then decoded to the browser. The codec character set of this process can be set through response.setCharacterEncoding, which will overwrite the value of request.getCharacterEncoding and return to the client through the Content-Type of Header. When the browser receives the returned socket stream, it will be decoded through the charset of Content-Type. If charset is not set in the returned HTTP Header, the browser will decode it according to the charset in Html. If it is not defined, the browser will use the default encoding to decode it.
Other areas that need to pay attention to coding
In addition to URL and parameter encoding problems, there may be many encodings on the server side, such as reading xml, velocity template engine, JSP, or reading data from the database.
Xml files can be formatted by setting headers.
Velocity template sets the encoding format:
Services.VelocityService.input.encoding=UTF-8
JSP sets the encoding format:
Access to the database is done through the client JDBC driver. To access the data with JDBC should be consistent with the built-in code of the data, you can set up a JDBC URL such as? MySQL:
Url= "jdbc:mysql://localhost:3306/DB?useUnicode=true&characterEncoding=GBK" 8. Analysis of garbled code problem
Let's take a look, when we encounter some garbled code, how should we deal with these problems? The only reason for garbled code is the inconsistent character set between encoding and decoding in char-to-byte or byte-to-char conversion. Since multiple codecs are often involved in one operation, it is difficult to find out which part of the problem occurs when garbled occurs. According to my own experience, it is often the fastest to find out the cause step by step from the source.
At this point, the study of "what are the basic knowledge points of Java coding" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.