Date: Wed, 23 Oct 2024 20:43:13 +0200
On 23/10/2024 20.05, Victor Zverovich via SG16 wrote:
> This gives mojibake on Windows with Belarusian localization.
Could you please complete the picture here?
You said your example uses UTF-8 literal encoding.
What's the C locale encoding for the execution character set
under the Windows Belarusian localization?
What's the expected input encoding for "mbstowcs" in that localization?
What's the generated output encoding for "mbstowcs" in that localization?
(We already know that a literal encoding that is incompatible
with the locale's encoding is hard to program for.)
> Neither pthread_setname_np nor pthread_getname_np assume the C locale encoding. For example, quoting https://docs.oracle.com/cd/E88353_01/html/E37843/pthread-setname-np-3c.html <https://docs.oracle.com/cd/E88353_01/html/E37843/pthread-setname-np-3c.html>:
>
> The thread name is a string of length 31 bytes or less, UTF-8 encoded.
Note that pthread_setname_np is not in POSIX. What you quoted is the way
this function operates on Solaris.
In contrast, my Linux man page says:
The thread name is a meaningful C language string,
whose length is restricted to 16 characters, including the
terminating null byte ('\0').
(No, I don't know what "meaningful" means.)
> This means that the POSIX implementation in P2019R7 is actually incorrect and doesn’t match the wording.
The paper says "on most POSIX implementation". Apparently, Solaris is different here.
Are there any non-UTF environments on Solaris these days?
Would Solaris transcode from UTF-8 to the encoding of that other environment?
I doubt it. Thread names are just few bytes in a special memory area where
the usual tools can find them; I can't imagine any Unix doing any sort of
encoding recognition/translation when setting a name.
Do you have a more complete survey of Unix-like operating systems?
> SetThreadDescription always uses UTF-16 on Windows and there is no C locale or ACP version.
What is "ACP version"?
We already know we have to transcode for Windows (because wchar_t),
but all the standard tools such as mbstowcs use the encoding specified by
the C locale (input and output), so they're unsuitable to reliably output
UTF-16.
Maybe we want our set_thread_name to simply take char8_t (in UTF-8 encoding)
and be done with it. And no, I'm not worried about copying / transcoding
16-32 bytes even on Unix.
Jens
> This gives mojibake on Windows with Belarusian localization.
Could you please complete the picture here?
You said your example uses UTF-8 literal encoding.
What's the C locale encoding for the execution character set
under the Windows Belarusian localization?
What's the expected input encoding for "mbstowcs" in that localization?
What's the generated output encoding for "mbstowcs" in that localization?
(We already know that a literal encoding that is incompatible
with the locale's encoding is hard to program for.)
> Neither pthread_setname_np nor pthread_getname_np assume the C locale encoding. For example, quoting https://docs.oracle.com/cd/E88353_01/html/E37843/pthread-setname-np-3c.html <https://docs.oracle.com/cd/E88353_01/html/E37843/pthread-setname-np-3c.html>:
>
> The thread name is a string of length 31 bytes or less, UTF-8 encoded.
Note that pthread_setname_np is not in POSIX. What you quoted is the way
this function operates on Solaris.
In contrast, my Linux man page says:
The thread name is a meaningful C language string,
whose length is restricted to 16 characters, including the
terminating null byte ('\0').
(No, I don't know what "meaningful" means.)
> This means that the POSIX implementation in P2019R7 is actually incorrect and doesn’t match the wording.
The paper says "on most POSIX implementation". Apparently, Solaris is different here.
Are there any non-UTF environments on Solaris these days?
Would Solaris transcode from UTF-8 to the encoding of that other environment?
I doubt it. Thread names are just few bytes in a special memory area where
the usual tools can find them; I can't imagine any Unix doing any sort of
encoding recognition/translation when setting a name.
Do you have a more complete survey of Unix-like operating systems?
> SetThreadDescription always uses UTF-16 on Windows and there is no C locale or ACP version.
What is "ACP version"?
We already know we have to transcode for Windows (because wchar_t),
but all the standard tools such as mbstowcs use the encoding specified by
the C locale (input and output), so they're unsuitable to reliably output
UTF-16.
Maybe we want our set_thread_name to simply take char8_t (in UTF-8 encoding)
and be done with it. And no, I'm not worried about copying / transcoding
16-32 bytes even on Unix.
Jens
Received on 2024-10-23 18:43:19