The method of converting UTF-8, UTF-16 and UTF-32 codes to each other 07/02 Update SLTechnology News&Howtos

The method of converting UTF-8, UTF-16 and UTF-32 codes to each other

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "the method of code conversion between UTF-8, UTF-16 and UTF-32". In the daily operation, I believe that many people have doubts about the method of code conversion among UTF-8, UTF-16 and UTF-32. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "the conversion methods of UTF-8, UTF-16 and UTF-32 codes". Next, please follow the editor to study!

Recently, I am considering writing a general-purpose string class that can cross platforms. The first thing I need to solve is the problem of transcoding.

Vs saves the code file by default, using the local code (Chinese is GBK, Japanese is Shift-JIS) or UTF-8 with BOM.

Gcc is UTF-8, with or without BOM (the character set of the source code can be specified by the parameter-finput-charset).

Then the source code can be saved with UTF-8 with BOM. Unicode under windows is UTF-16-encoded; Linux uses UTF-8 or UTF-32. Therefore, in any system, the program needs to consider the conversion between UTF codes when dealing with strings.

The algorithm code is posted directly below. In terms of algorithm, I borrowed from http://blog.csdn.net/jhqin 's UnicodeConverter, but added some generics outside to make it relatively easy to use.

Core algorithm (from UnicodeConverter):

Namespace transform {/ * UTF-32 to UTF-8 * / inline static size_t utf (uint32 src, uint8* des) {if (src = = 0) return 0; static const byte PREFIX [] = {0x00, 0xC0, 0xE0, 0xF0, 0xF8, 0xFC} Static const uint32 CODE_UP [] = {0x80, / / Ubun00000000-Utt0000007F 0x800, / / Ubun00000080-U+000007FF 0x10000, / / Ubun00000800-U+0000FFFF 0x200000, / / Ubun00010000-U+001FFFFF 0x4000000, / / Ubun00200000-U+03FFFFFF 0x80000000 / / Ubun04000000-U+7FFFFFFF} Size_t I, len = sizeof (CODE_UP) / sizeof (uint32); for (I = 0; I

< len; ++i) if (src < CODE_UP[i]) break; if (i == len) return 0; // the src is invalid len = i + 1; if (des) { for(; i >

0;-- I) {des [I] = static_cast ((src & 0x3F) | 0x80); src > > = 6;} des [0] = static_cast (src | PREFIX [len-1]);} return len } / * UTF-8 to UTF-32 * / inline static size_t utf (const uint8* src, uint32& des) {if (! src | | (* src) = = 0) return 0; uint8 b = * (src++); if (b)

< 0x80) { des = b; return 1; } if (b < 0xC0 || b >

0xFD) return 0; / / the src is invalid size_t len; if (b

< 0xE0) { des = b & 0x1F; len = 2; } else if (b < 0xF0) { des = b & 0x0F; len = 3; } else if (b < 0xF8) { des = b & 0x07; len = 4; } else if (b < 0xFC) { des = b & 0x03; len = 5; } else { des = b & 0x01; len = 6; } size_t i = 1; for (; i < len; ++i) { b = *(src++); if (b < 0x80 || b >

0xBF) return 0; / / the src is invalid des = (des = 0xD800 & & W1 = 0xDC00 & & w28), its external interface can be abstracted into these two forms:

Type_t utf (T src, U * des) type_t utf (const T * src, U * des)

The transformation from small to big comes in the following two forms:

Type_t utf (const T * src, U & des) type_t utf (const T * src, U * des)

Plus the second pointer parameter can be given a default value (null pointer), so the appropriate generic class can be written like this:

Template Y), bool = (X! = Y) > struct detail; / * UTF-X (32 typedef typename utf_type::type_t src_t; typedef typename utf_type::type_t des_t 16) to UTF-Y (16 template 8) * / template struct detail {typedef typename utf_type::type_t src_t; typedef typename utf_type::type_t des_t Template static typename enable_if::type_t utf (T src, U * des) {return transform::utf ((src_t) (src), (des_t*) (des));} template static typename enable_if::type_t utf (T src) {return transform::utf ((src_t) (src), (des_t*) (0) } template static typename enable_if::type_t utf (const T* src, U* des) {return transform::utf ((const src_t*) (src), (des_t*) (des)) } template static typename enable_if::type_t utf (const T* src) {return transform::utf ((src_t) (src), (des_t*) (0));}}; / * UTF-X (16) to UTF-Y (32) * / template struct detail {typedef typename utf_type::type_t src_t Typedef typename utf_type::type_t des_t; template static typename enable_if::type_t utf (const T* src, U & des) {des_t tmp; / / for disable the warning strict-aliasing from gcc 4.4 size_t ret = transform::utf ((const src_t*) (src), tmp); des = tmp; return ret } template static typename enable_if::type_t utf (const T* src, U* des) {return transform::utf ((const src_t*) (src), (des_t*) (des)) } template static typename enable_if::type_t utf (const T* src) {return transform::utf ((const src_t*) (src), (des_t*) (0);}}

The end of the external application class can be quite simple:

Template struct converter: detail {}

Through the detail above, we can also easily write an external template that controls which conversion algorithm is selected by specifying the numbers 8 and 16.

With converter, the same type of requirements (UTF-8 to wchar_t) can become much more relaxed and enjoyable:

Const char* c = "World"; wstring s; size_t n; wchar_t w; while (!! (n = converter::utf (c, w) / / here!! To shield gcc warnings {s.push_back (w); c + = n;} FILE* fp = fopen ("test_converter.txt", "wb"); fwrite (s.c_str (), sizeof (wchar_t), s.length (), fp); fclose (fp)

The above short code converts a piece of UTF-8 text to wchar_t character by character, and then push_back to wstring one by one, and finally outputs the converted string to test_converter.txt.

In fact, the generics above are still cumbersome. Why not use generic parameters directly on transform::utf?

At first, I only thought of the above method, naturally due to the habitual desire to manually specify how to convert the code. For example, the original idea was to make a template like this: utf (S1, S2), specifying two numbers to determine the format of input and output.

Later, it was found that it might be more direct to specify the type of string / character directly.

Looking back, you can see that the word length (8, 16, 32) required for conversion is already specified in the parameter type: the char or byte type of 8bits is definitely not used to store UTF-32.

So you only need to generalize the parameters of the core algorithm above. At this point, the code is written like this:

Namespace transform {namespace private_ {template struct utf_type; template struct utf_type {typedef uint8 type_t;}; template struct utf_type {typedef uint16 type_t;}; template struct utf_type {typedef uint32 type_t;} Template struct check {static const bool value = ((sizeof (T) = = sizeof (typename utf_type::type_t)) & &! is_pointer::value);} using namespace transform::private_ / * UTF-32 to UTF-8 * / template typename enable_if::type_t utf (T src, U * des) {if (src = = 0) return 0; static const byte PREFIX [] = {0x00, 0xC0, 0xE0, 0xF0, 0xF8, 0xFC} Static const uint32 CODE_UP [] = {0x80, / / Ubun00000000-Utt0000007F 0x800, / / Ubun00000080-U+000007FF 0x10000, / / Ubun00000800-U+0000FFFF 0x200000, / / Ubun00010000-U+001FFFFF 0x4000000, / / Ubun00200000-U+03FFFFFF 0x80000000 / / Ubun04000000-U+7FFFFFFF} Size_t I, len = sizeof (CODE_UP) / sizeof (uint32); for (I = 0; I

< len; ++i) if (src < CODE_UP[i]) break; if (i == len) return 0; // the src is invalid len = i + 1; if (des) { for(; i >

0;-- I) {des [I] = static_cast ((src & 0x3F) | 0x80); src > > = 6;} des [0] = static_cast (src | PREFIX [len-1]);} return len } / * UTF-8 to UTF-32 * / template typename enable_if::type_t utf (const T* src, U & des) {if (! src | | (* src) = = 0) return 0; uint8 b = * (src++); if (b)

< 0x80) { des = b; return 1; } if (b < 0xC0 || b >

0xFD) return 0; / / the src is invalid size_t len; if (b

0xBF) return 0; / / the src is invalid des = (des = 0xD800 & & W1 = 0xDC00 & & w2

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.