Date: Wed, 23 Oct 2024 12:33:24 -0700
> Could you please complete the picture here?
> You said your example uses UTF-8 literal encoding.
Yes, but a similar problem also exists for the legacy encoding case, only
the output is corrupted differently:
[image: image.png]
I presented the results for UTF-8 because this is what we care most about
in the long term.
> What's the C locale encoding for the execution character set
> under the Windows Belarusian localization?
> What's the expected input encoding for "mbstowcs" in that localization?
> What's the generated output encoding for "mbstowcs" in that localization?
The C locale is initialized to “C” by default which is not specific to
Belarusian localization in any way. I haven’t checked which encoding
mbstowcs is using in this case - can do it as a follow-up if there is
interest, just reporting user-observable behavior which is obviously
unsatisfactory.
> (We already know that a literal encoding that is incompatible
> with the locale's encoding is hard to program for.)
Exactly and this is why the literal encoding is a good choice - it is
detectable statically and the locale encoding, if set, should normally be
compatible with it. Large parts of the design of std::format and std::print,
approved by SG16 in C++20 and C++23, are based on this.
> Note that pthread_setname_np is not in POSIX. What you quoted is the way
> this function operates on Solaris.
Sure, this particular API is not part of the POSIX standard but an
implementation of pthreads. Even on Linux nothing says that the string is
in the C locale encoding and looking at the implementation it is basically
passed via a syscall “as is”. The main point is that according to the
current wording the standard library has to do potentially lossy
transcoding on some platforms, including one or more POSIX platforms, for
no good reason.
> What is "ACP version"?
Active Code Page (ACP): most Windows APIs have Unicode (<FunctionName>W)
and non-Unicode (<FunctionName>A) versions with the latter using ACP which
is unrelated to the C locale encoding.
- Victor
On Wed, Oct 23, 2024 at 11:43 AM Jens Maurer <jens.maurer_at_[hidden]> wrote:
>
>
> On 23/10/2024 20.05, Victor Zverovich via SG16 wrote:
> > This gives mojibake on Windows with Belarusian localization.
>
> Could you please complete the picture here?
> You said your example uses UTF-8 literal encoding.
> What's the C locale encoding for the execution character set
> under the Windows Belarusian localization?
> What's the expected input encoding for "mbstowcs" in that localization?
> What's the generated output encoding for "mbstowcs" in that localization?
>
> (We already know that a literal encoding that is incompatible
> with the locale's encoding is hard to program for.)
>
> > Neither pthread_setname_np nor pthread_getname_np assume the C locale
> encoding. For example, quoting
> https://docs.oracle.com/cd/E88353_01/html/E37843/pthread-setname-np-3c.html
> <
> https://docs.oracle.com/cd/E88353_01/html/E37843/pthread-setname-np-3c.html
> >:
> >
> > The thread name is a string of length 31 bytes or less, UTF-8
> encoded.
>
> Note that pthread_setname_np is not in POSIX. What you quoted is the way
> this function operates on Solaris.
>
> In contrast, my Linux man page says:
>
> The thread name is a meaningful C language string,
> whose length is restricted to 16 characters, including the
> terminating null byte ('\0').
>
> (No, I don't know what "meaningful" means.)
>
> > This means that the POSIX implementation in P2019R7 is actually
> incorrect and doesn’t match the wording.
>
> The paper says "on most POSIX implementation". Apparently, Solaris is
> different here.
> Are there any non-UTF environments on Solaris these days?
> Would Solaris transcode from UTF-8 to the encoding of that other
> environment?
> I doubt it. Thread names are just few bytes in a special memory area where
> the usual tools can find them; I can't imagine any Unix doing any sort of
> encoding recognition/translation when setting a name.
>
> Do you have a more complete survey of Unix-like operating systems?
>
> > SetThreadDescription always uses UTF-16 on Windows and there is no C
> locale or ACP version.
>
> What is "ACP version"?
>
> We already know we have to transcode for Windows (because wchar_t),
> but all the standard tools such as mbstowcs use the encoding specified by
> the C locale (input and output), so they're unsuitable to reliably output
> UTF-16.
>
> Maybe we want our set_thread_name to simply take char8_t (in UTF-8
> encoding)
> and be done with it. And no, I'm not worried about copying / transcoding
> 16-32 bytes even on Unix.
>
> Jens
>
>
> You said your example uses UTF-8 literal encoding.
Yes, but a similar problem also exists for the legacy encoding case, only
the output is corrupted differently:
[image: image.png]
I presented the results for UTF-8 because this is what we care most about
in the long term.
> What's the C locale encoding for the execution character set
> under the Windows Belarusian localization?
> What's the expected input encoding for "mbstowcs" in that localization?
> What's the generated output encoding for "mbstowcs" in that localization?
The C locale is initialized to “C” by default which is not specific to
Belarusian localization in any way. I haven’t checked which encoding
mbstowcs is using in this case - can do it as a follow-up if there is
interest, just reporting user-observable behavior which is obviously
unsatisfactory.
> (We already know that a literal encoding that is incompatible
> with the locale's encoding is hard to program for.)
Exactly and this is why the literal encoding is a good choice - it is
detectable statically and the locale encoding, if set, should normally be
compatible with it. Large parts of the design of std::format and std::print,
approved by SG16 in C++20 and C++23, are based on this.
> Note that pthread_setname_np is not in POSIX. What you quoted is the way
> this function operates on Solaris.
Sure, this particular API is not part of the POSIX standard but an
implementation of pthreads. Even on Linux nothing says that the string is
in the C locale encoding and looking at the implementation it is basically
passed via a syscall “as is”. The main point is that according to the
current wording the standard library has to do potentially lossy
transcoding on some platforms, including one or more POSIX platforms, for
no good reason.
> What is "ACP version"?
Active Code Page (ACP): most Windows APIs have Unicode (<FunctionName>W)
and non-Unicode (<FunctionName>A) versions with the latter using ACP which
is unrelated to the C locale encoding.
- Victor
On Wed, Oct 23, 2024 at 11:43 AM Jens Maurer <jens.maurer_at_[hidden]> wrote:
>
>
> On 23/10/2024 20.05, Victor Zverovich via SG16 wrote:
> > This gives mojibake on Windows with Belarusian localization.
>
> Could you please complete the picture here?
> You said your example uses UTF-8 literal encoding.
> What's the C locale encoding for the execution character set
> under the Windows Belarusian localization?
> What's the expected input encoding for "mbstowcs" in that localization?
> What's the generated output encoding for "mbstowcs" in that localization?
>
> (We already know that a literal encoding that is incompatible
> with the locale's encoding is hard to program for.)
>
> > Neither pthread_setname_np nor pthread_getname_np assume the C locale
> encoding. For example, quoting
> https://docs.oracle.com/cd/E88353_01/html/E37843/pthread-setname-np-3c.html
> <
> https://docs.oracle.com/cd/E88353_01/html/E37843/pthread-setname-np-3c.html
> >:
> >
> > The thread name is a string of length 31 bytes or less, UTF-8
> encoded.
>
> Note that pthread_setname_np is not in POSIX. What you quoted is the way
> this function operates on Solaris.
>
> In contrast, my Linux man page says:
>
> The thread name is a meaningful C language string,
> whose length is restricted to 16 characters, including the
> terminating null byte ('\0').
>
> (No, I don't know what "meaningful" means.)
>
> > This means that the POSIX implementation in P2019R7 is actually
> incorrect and doesn’t match the wording.
>
> The paper says "on most POSIX implementation". Apparently, Solaris is
> different here.
> Are there any non-UTF environments on Solaris these days?
> Would Solaris transcode from UTF-8 to the encoding of that other
> environment?
> I doubt it. Thread names are just few bytes in a special memory area where
> the usual tools can find them; I can't imagine any Unix doing any sort of
> encoding recognition/translation when setting a name.
>
> Do you have a more complete survey of Unix-like operating systems?
>
> > SetThreadDescription always uses UTF-16 on Windows and there is no C
> locale or ACP version.
>
> What is "ACP version"?
>
> We already know we have to transcode for Windows (because wchar_t),
> but all the standard tools such as mbstowcs use the encoding specified by
> the C locale (input and output), so they're unsuitable to reliably output
> UTF-16.
>
> Maybe we want our set_thread_name to simply take char8_t (in UTF-8
> encoding)
> and be done with it. And no, I'm not worried about copying / transcoding
> 16-32 bytes even on Unix.
>
> Jens
>
>
Received on 2024-10-23 19:33:40