QString nonsense (2) & the problem of Chinese garbled codes in QT 5 07/12 Update SLTechnology News&Howtos

QString nonsense (2) & the problem of Chinese garbled codes in QT 5

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

QString nonsense (2)

For a long time, many people have known that once Chinese is directly used in C++ source code, it will be very difficult for such source code to cross-platform (i18n).

With:

Under Windows: MSVC2010 becomes mainstream

Under Linux: GCC upgraded to 4.6

The Chinese problem in C++ can be regarded as a more elegant and cross-platform Workaround.

(this article discusses the compiler scope: GCC4.6+, MSVC2010sp1+. This article belongs to the QString series, but does not cover QString for the time being)

C++ Chinese questions in order to use Chinese correctly in C++, you must understand the following two concepts:

Source character set (the source character set)

What kind of encoding is used to save the source file?

Execute character set (the execution character set)

What kind of encoding is saved in the executable program (string encoding in memory when the program is executed)

The problem of Clipper 98: neither the source character set nor the executive character set is specified.

This. How do you understand it? Take a look at an example.

Is the example demanding?

A simple C++ program, just hope it can be in simplified Chinese Windows, Chinese Windows, English version Windows, Linux, MAC OS... The results are consistent.

/ / main.cppint main () {char mystr [] = "honest knowledge can't be careless"; return sizeof mystr;} can try to ask yourself two questions.

What kind of coding is this source file saved? Is there a definite answer?

What is in mystr? Is there a definite answer?

For C++, neither is certain.

If you have a fixed platform, you can still bear it.

If you want to cross the platform, this kind of thing.

GCC under GCC, both of these can be encoded with your own preference (if not specified, the default is UTF8)

-finput-charset=charset-fexec-charset=charset has one more option in addition to the first two options:

-fwide-exec-charset=charsetwide? You might as well guess what it does.

MSVCMSVC does not have options similar to the previous one.

How to solve the source character set?

Do you have BOM, others will be interpreted as BOM, and if not, the local Locale character set will be used (depending on system settings)

How to execute the character set?

Use the local Locale character set (varies with system settings)

It's pretty bossy (of course, you can use # pragma setlocale ("...") in the source code, but the functionality is limited, for example, Windows doesn't have utf8's locale, so.

By the way, what about the wide-exec-charset corresponding to GCC?

How to solve the wide execution character set?

You might as well consider it first.

What shall I do? There are only two compilers, and it looks so complicated. The number of C++ compilers is much more than 2. 5%.

If you want to cross-platform, you must ensure that both character sets are "determined", and it seems that the ideal character set for this task can only be.

UTF-8 solution if we save the source code as utf8 and execute the character set as utf8, then the world will be peaceful. Source files that use non-ASCII characters will be able to flow among users in different countries without barriers.

It is not difficult to save the source code as UTF-8, but the execution character set needs to be UTF-8. It's not that simple.

For GCC, the problem is simple (the default encoding option is sufficient):

As long as the source file is saved as utf8 (with or without BOM)

Early gcc did not accept utf8 source files with BOM, and now, at least in GCC4.6, this restriction no longer exists.

For MSVC, the problem is extremely complex:

For MSVC2003, as long as the source code is saved as utf8 without BOM

For MSVC2005, MSVC2008 that doesn't have hot patches on top of SP1. There's nothing I can do.

It wasn't until MSVC2010sp1 that a solution was provided. Save the source code as utf8,utf16,..., with BOM and then add

# pragma execution_character_set ("utf-8") to cross GCC4.6+ and MSVC2010sp1+, we need to take their intersection: that is

The source code is saved as utf8 with BOM

Add # pragma to MSVC

/ / main.cpp#if _ MSC_VER > = 1600#pragma execution_character_set ("utf-8") # endifint main () {char mystr [] = "honest knowledge, not a bit careless"; return sizeof mystr;} Clearin11 when MSVC supports the String Literals of Clover 11, we don't need to use that crappy pragma directly.

Char mystr [] = U8 "honest knowledge can't be sloppy". (although it's okay under GCC now, you'll have to wait for Visual C++ 12 to cross platforms).

Is there a problem? Isn't there a wchar_t in Clear98? isn't it used to represent unicode characters?

Section 5.2 of the Unicode 4.0 standard says:

The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compilershould not use wchar_t for storing Unicode text. The wchar_t type is intended forstoring compiler-defined wide characters, which may be Unicode characters in some compilers. Looking back at the options of GCC

-fwide-exec-charset=charset although the default encoding provided by GCC is UTF16 or UTF32 (depending on the width of the wchar_t), the encoding is optional.

Although this thing is not guaranteed to be cross-platform and is not fun, since wchar_t is used to represent utf16 characters under windows and corresponds directly to the system API interface, it is important before the popularity of type char16_t.

The U8 mentioned earlier in the execution of the character set is one of the efforts made by the implementation of the character set by the implementation of the character set.

The new three executive character sets, utf8, utf16 and utf32, are clearly defined.

Char*

U8 "Chinese"

Char16_t*

U "Chinese"

Char32_t*

U "Chinese"

However, Clear11 does not specify the source character set.

Const char* mystr=u8 "Chinese"; the C++ standard says to the compiler, "I don't care what the specific encoding of this file is, but you must generate a stream of bytes corresponding to the utf8 code for me."

Doesn't the compiler seem a little stupid? I don't know the encoding of the source file. How do I convert it?

So:

MSVC said: the source file must have BOM, or I think you are the local locale code.

GCC said: I think you are the utf8 code, unless I am informed of other codes through the command line

Under the Cellular 11 standard, the simple way to deal with the source code is to use UTF8 with BOM to save.

The second article

Today, with Change QString's default codec to be UTF-8 entering the master branch of Qt5, we can finally re-examine the Chinese support of Qt.

20120516 update: it is recommended to read the article Source code must be UTF-8 and QString wants it by Thiago Macieira, the maintainer of the QtCore module

Qt5Qt5 without setCodecXXX assumes that the execution character set is UTF8, and users are no longer allowed to change it without authorization. In this way, all kinds of side effects of setCodecXXX in Qt4 no longer exist, and the Chinese problem is more simple.

QString S1 = "Chinese"; QString S2 ("Chinese"); QString S3 = tr ("Chinese") QString S4 = QStringLiteral ("Chinese"); / / as long as the string does not need translation, please follow this QString S5 = QString::fromWCharArray (L "Chinese"); QString S6 = U8 "Chinese"; / / C++11QString S7 = tr (U8 "Chinese").. All of this will work by default in Qt5. The only requirement is to make sure that the the execution character set of your C++ is UTF-8.

Competition for all kinds of writing methods? Simplicity is not necessarily good. The simplest and most direct use is:

QString S1 = "Chinese"; QString S2 ("Chinese characters"); QString S6 = U8 "Chinese"; / / Clearing 11. What's wrong with that?

After defining the macro QT _ NO_CAST_FROM_ASCII, the above code cannot be compiled (by the way, it seems that the macro should be renamed, QT_NO_CAST_FROM_CSTRING will be more worthy of the name)

The most misused is in Qt4, where QObject::tr () is one of the functions that are misused:

QString S3 = tr ("Chinese") Reason:

In Qt4, many users are affected by the overwhelming number of setCodecForTr (), and then rely on it to solve Chinese problems.

It is used for translation (I18N and L10N), and if you don't have the need for it, you really don't need to use it. (in Qt4, I only noticed that two mainland netizens and one Japanese netizen had demand and really tried it, so the others should be misused, right?

When the puzzling wchar_t first came into contact with Qt and QString, he thought many times why he didn't use wchar_t and why.

QString S5 = QString::fromWCharArray (L "Chinese"); this is really useful under Windows: first, it is the string used by the Windows system API, and second, it is the same internal representation as QString. However, due to various considerations of MSVC, we are encouraged to use TEXT/_T, which makes it a stranger to us.

But from the C++ standard, wchar_t is not char16_t after all, so cross-platform is not good. Under linux, this line of code requires a conversion from utf32 to utf16.

QStringLiteral this is a macro, a rather complex macro:

QString S4 = QStringLiteral ("Chinese") Before we introduce this macro, let's take a look at the disadvantages of the following writing:

QString S1 = "Chinese"; QString S2 ("Chinese"); QString S3 = tr ("Chinese") QString S6 = U8 "Chinese"; / / Clearing 11. First of all, the strings of two Chinese characters are put into the constant area by the compiler in the form of UTF-8 coding. (at least 7 bytes?)

Then, when the program is running, the QString instance is constructed, which needs to apply for space on the heap and store the corresponding string in utf16 format.

Is there any waste?

Inside the solution QString is UTF16. If the C++ compiler provides the string of UTF16 directly at compile time, then it is enough for us to save it directly inside QString. Like this

Save the cost of having two different copies (that is, the corresponding conversion, malloc)

For Chinese characters, UTF16 itself is the UTF8 provincial space.

At present, we don't have a reliable way to use UTF16's executive character set (the execution character set) on C++.

Although L "..." (wchar_t*) is UTF16 under Windows, it is not cross-platform.

Clocking 11 can guarantee this, u "..." (char16_t), but mainstream compilers have not yet provided perfect support.

These two points lead to the complexity of QStringLiteral.

For source code, please see qtbase/src/corelib/tools/qstring.h.

(using macros, templates, and lambda expressions in the code is still quite complicated. Only snippets are selected here.)

If the compiler supports char16_t, use the

# define QT_UNICODE_LITERAL_II (str) u "" strtypedef char16_t qunicodechar;...

otherwise. If you use wchar_t under the Windows platform, or in other environments with a width of 2 wchar_t

# if defined (Q_CC_MSVC) # define QT_UNICODE_LITERAL_II (str) L##str#else# define QT_UNICODE_LITERAL_II (str) L "" str#endiftypedef wchar_t qunicodechar;...

otherwise. Compiler does not support, Qt as a library, there must be no way

# define QStringLiteral (str) QString::fromUtf8 (str, sizeof (str)-1)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.