C++ Logo

sg16

Advanced search

Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

From: Thiago Macieira <thiago_at_[hidden]>
Date: Wed, 28 Jul 2021 08:38:52 -0700
On Tuesday, 27 July 2021 23:30:41 PDT Tom Honermann via SG16 wrote:
> * On POSIX systems, what would it mean to run a program built to
> target a UTF-8 environment in an environment with LC_ALL set, e.g.,
> zh_HK.big5hkscs? Should that be UB? Should the .big5hkscs property
> be ignored? Should we specify that the implementation implicitly
> transcode?

Hello Tom

And thanks to Corentin for the effort so far.

I think the answer to the question above will depend on what "program built to
target UTF-8" effectively means to the binary. If it's just a compiler setting
with no effect on the output object files and executable binary, then running
on a non-UTF-8 environment is simply UB.

But on the other hand if this does like Windows' manifest and add a flag to
the executable indicating it's targeting UTF-8, then the *C* library runtime
can be updated to deny running in that environment or "fix" it. We added the
fixing code to Qt 6: for the locale you gave as example, Qt 6's
QCoreApplication will print a (US-ASCII) warning and then switch the locale to
"zh_HK.UTF-8". That way, all the C and C++ libraries locale functions work "as
expected". For the internal needs of the application, this suffices.

But externally that's a different story.

The program will likely output mojibake to the terminal. There's an old escape
sequence to switch it to UTF-8, but that thing is a state of the terminal, not
the application, so the application would at minimum need to know to switch it
off before exiting. But that won't handle the cases of application unclean
exit or when a non-UTF-8 child process writes to the terminal. So I don't
think this is a solvable problem, at all.

File names are what they are. Qt has treated file names that fail to be
decoded by the locale codec as filesystem corruption for a long time. For us,
it's impossible to open such a file and our directory-listing classes skip
over them, as if they weren't there. Every time such a bug is filed, I
recommend people run a "file system corruption recovery tool" to fix the
encoding. I don't think the C++ standard can mandate this and I don't think
the C++ standard library implementations would want to implement it that way
either. At least, std::fs::path is able to represent those undecodable file
names.

> * On POSIX systems, localedef can be used to define a locale with its
> own character set and character classifications. Can
> implementations reasonably reason about the encoding of such locales?

You can ask the C library what the encoding is. That's what I introduced in
QCoreApplication:

    const char *locale = setlocale(LC_ALL, "");
    const char *codec = nl_langinfo(CODESET);
    if (Q_UNLIKELY(strcmp(codec, "UTF-8") != 0 && strcmp(codec, "utf8") != 0))
{

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel DPG Cloud Engineering

Received on 2021-07-28 10:38:57