On Wed, Oct 23, 2024 at 9:33 PM Victor Zverovich via SG16 <sg16@lists.isocpp.org> wrote:

> Could you please complete the picture here?
> You said your example uses UTF-8 literal encoding.

Yes, but a similar problem also exists for the legacy encoding case, only the output is corrupted differently:

I presented the results for UTF-8 because this is what we care most about in the long term.

> What's the C locale encoding for the execution character set
> under the Windows Belarusian localization?
> What's the expected input encoding for "mbstowcs" in that localization?
> What's the generated output encoding for "mbstowcs" in that localization?

The C locale is initialized to “C” by default which is not specific to Belarusian localization in any way. I haven’t checked which encoding mbstowcs is using in this case - can do it as a follow-up if there is interest, just reporting user-observable behavior which is obviously unsatisfactory.

> (We already know that a literal encoding that is incompatible
> with the locale's encoding is hard to program for.)

Exactly and this is why the literal encoding is a good choice - it is detectable statically and the locale encoding, if set, should normally be compatible with it. Large parts of the design of std::format and std::print, approved by SG16 in C++20 and C++23, are based on this.

> Note that pthread_setname_np is not in POSIX. What you quoted is the way
> this function operates on Solaris.

Sure, this particular API is not part of the POSIX standard but an implementation of pthreads. Even on Linux nothing says that the string is in the C locale encoding and looking at the implementation it is basically passed via a syscall “as is”. The main point is that according to the current wording the standard library has to do potentially lossy transcoding on some platforms, including one or more POSIX platforms, for no good reason.

> What is "ACP version"?

Active Code Page (ACP): most Windows APIs have Unicode (<FunctionName>W) and non-Unicode (<FunctionName>A) versions with the latter using ACP which is unrelated to the C locale encoding.

- Victor

On Wed, Oct 23, 2024 at 11:43 AM Jens Maurer <jens.maurer@gmx.net> wrote:

On 23/10/2024 20.05, Victor Zverovich via SG16 wrote:
> This gives mojibake on Windows with Belarusian localization.

Could you please complete the picture here?
You said your example uses UTF-8 literal encoding.
What's the C locale encoding for the execution character set
under the Windows Belarusian localization?
What's the expected input encoding for "mbstowcs" in that localization?
What's the generated output encoding for "mbstowcs" in that localization?

(We already know that a literal encoding that is incompatible
with the locale's encoding is hard to program for.)

> Neither pthread_setname_np nor pthread_getname_np assume the C locale encoding. For example, quoting https://docs.oracle.com/cd/E88353_01/html/E37843/pthread-setname-np-3c.html <https://docs.oracle.com/cd/E88353_01/html/E37843/pthread-setname-np-3c.html>:
>
> The thread name is a string of length 31 bytes or less, UTF-8 encoded.

Note that pthread_setname_np is not in POSIX. What you quoted is the way
this function operates on Solaris.

In contrast, my Linux man page says:

The thread name is a meaningful C language string,
whose length is restricted to 16 characters, including the
terminating null byte ('\0').

(No, I don't know what "meaningful" means.)

> This means that the POSIX implementation in P2019R7 is actually incorrect and doesn’t match the wording.

The paper says "on most POSIX implementation". Apparently, Solaris is different here.
Are there any non-UTF environments on Solaris these days?
Would Solaris transcode from UTF-8 to the encoding of that other environment?
I doubt it. Thread names are just few bytes in a special memory area where
the usual tools can find them; I can't imagine any Unix doing any sort of
encoding recognition/translation when setting a name.

Do you have a more complete survey of Unix-like operating systems?

> SetThreadDescription always uses UTF-16 on Windows and there is no C locale or ACP version.

What is "ACP version"?

We already know we have to transcode for Windows (because wchar_t),
but all the standard tools such as mbstowcs use the encoding specified by
the C locale (input and output), so they're unsuitable to reliably output
UTF-16.

Maybe we want our set_thread_name to simply take char8_t (in UTF-8 encoding)
and be done with it. And no, I'm not worried about copying / transcoding
16-32 bytes even on Unix.

Jens

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16