Date: Mon, 18 Nov 2024 09:32:33 -0500
On 11/18/24 8:58 AM, Corentin Jabot via SG16 wrote:
> Can we please take a decision?
I intend to schedule P2019 (Thread attributes) <https://wg21.link/p2019>
for further discussion when we resume meeting after the new year.
Tom.
> As the author I fundamentally do not care about whether we pick the
> literal or execution literal encoding, and I do not expect it will
> impact, in any way shape or form
> implementations (that will just assume the literal encoding is a
> subset of the execution encoding)
>
> However, we should not hold that paper on just that, we have time to
> revisit that question later.
> Thanks
>
> On Wed, Oct 23, 2024 at 9:33 PM Victor Zverovich via SG16
> <sg16_at_[hidden]> wrote:
>
> > Could you please complete the picture here?
> > You said your example uses UTF-8 literal encoding.
>
> Yes, but a similar problem also exists for the legacy encoding
> case, only the output is corrupted differently:
>
> image.png
>
> I presented the results for UTF-8 because this is what we care
> most about in the long term.
>
> > What's the C locale encoding for the execution character set
> > under the Windows Belarusian localization?
> > What's the expected input encoding for "mbstowcs" in that
> localization?
> > What's the generated output encoding for "mbstowcs" in that
> localization?
>
> The C locale is initialized to “C” by default which is not
> specific to Belarusian localization in any way. I haven’t checked
> which encoding mbstowcs is using in this case - can do it as a
> follow-up if there is interest, just reporting user-observable
> behavior which is obviously unsatisfactory.
>
> > (We already know that a literal encoding that is incompatible
> > with the locale's encoding is hard to program for.)
>
> Exactly and this is why the literal encoding is a good choice - it
> is detectable statically and the locale encoding, if set, should
> normally be compatible with it. Large parts of the design of
> std::format and std::print, approved by SG16 in C++20 and C++23,
> are based on this.
>
> > Note that pthread_setname_np is not in POSIX. What you quoted
> is the way
> > this function operates on Solaris.
>
> Sure, this particular API is not part of the POSIX standard but an
> implementation of pthreads. Even on Linux nothing says that the
> string is in the C locale encoding and looking at the
> implementation it is basically passed via a syscall “as is”. The
> main point is that according to the current wording the standard
> library has to do potentially lossy transcoding on some platforms,
> including one or more POSIX platforms, for no good reason.
>
> > What is "ACP version"?
>
> Active Code Page (ACP): most Windows APIs have Unicode
> (<FunctionName>W) and non-Unicode (<FunctionName>A) versions with
> the latter using ACP which is unrelated to the C locale encoding.
>
> - Victor
>
> On Wed, Oct 23, 2024 at 11:43 AM Jens Maurer <jens.maurer_at_[hidden]>
> wrote:
>
>
>
> On 23/10/2024 20.05, Victor Zverovich via SG16 wrote:
> > This gives mojibake on Windows with Belarusian localization.
>
> Could you please complete the picture here?
> You said your example uses UTF-8 literal encoding.
> What's the C locale encoding for the execution character set
> under the Windows Belarusian localization?
> What's the expected input encoding for "mbstowcs" in that
> localization?
> What's the generated output encoding for "mbstowcs" in that
> localization?
>
> (We already know that a literal encoding that is incompatible
> with the locale's encoding is hard to program for.)
>
> > Neither pthread_setname_np nor pthread_getname_np assume the
> C locale encoding. For example, quoting
> https://docs.oracle.com/cd/E88353_01/html/E37843/pthread-setname-np-3c.html
> <https://docs.oracle.com/cd/E88353_01/html/E37843/pthread-setname-np-3c.html>:
> >
> > The thread name is a string of length 31 bytes or less,
> UTF-8 encoded.
>
> Note that pthread_setname_np is not in POSIX. What you quoted
> is the way
> this function operates on Solaris.
>
> In contrast, my Linux man page says:
>
> The thread name is a meaningful C language string,
> whose length is restricted to 16 characters, including the
> terminating null byte ('\0').
>
> (No, I don't know what "meaningful" means.)
>
> > This means that the POSIX implementation in P2019R7 is
> actually incorrect and doesn’t match the wording.
>
> The paper says "on most POSIX implementation". Apparently,
> Solaris is different here.
> Are there any non-UTF environments on Solaris these days?
> Would Solaris transcode from UTF-8 to the encoding of that
> other environment?
> I doubt it. Thread names are just few bytes in a special
> memory area where
> the usual tools can find them; I can't imagine any Unix doing
> any sort of
> encoding recognition/translation when setting a name.
>
> Do you have a more complete survey of Unix-like operating systems?
>
> > SetThreadDescription always uses UTF-16 on Windows and there
> is no C locale or ACP version.
>
> What is "ACP version"?
>
> We already know we have to transcode for Windows (because
> wchar_t),
> but all the standard tools such as mbstowcs use the encoding
> specified by
> the C locale (input and output), so they're unsuitable to
> reliably output
> UTF-16.
>
> Maybe we want our set_thread_name to simply take char8_t (in
> UTF-8 encoding)
> and be done with it. And no, I'm not worried about copying /
> transcoding
> 16-32 bytes even on Unix.
>
> Jens
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>
> Can we please take a decision?
I intend to schedule P2019 (Thread attributes) <https://wg21.link/p2019>
for further discussion when we resume meeting after the new year.
Tom.
> As the author I fundamentally do not care about whether we pick the
> literal or execution literal encoding, and I do not expect it will
> impact, in any way shape or form
> implementations (that will just assume the literal encoding is a
> subset of the execution encoding)
>
> However, we should not hold that paper on just that, we have time to
> revisit that question later.
> Thanks
>
> On Wed, Oct 23, 2024 at 9:33 PM Victor Zverovich via SG16
> <sg16_at_[hidden]> wrote:
>
> > Could you please complete the picture here?
> > You said your example uses UTF-8 literal encoding.
>
> Yes, but a similar problem also exists for the legacy encoding
> case, only the output is corrupted differently:
>
> image.png
>
> I presented the results for UTF-8 because this is what we care
> most about in the long term.
>
> > What's the C locale encoding for the execution character set
> > under the Windows Belarusian localization?
> > What's the expected input encoding for "mbstowcs" in that
> localization?
> > What's the generated output encoding for "mbstowcs" in that
> localization?
>
> The C locale is initialized to “C” by default which is not
> specific to Belarusian localization in any way. I haven’t checked
> which encoding mbstowcs is using in this case - can do it as a
> follow-up if there is interest, just reporting user-observable
> behavior which is obviously unsatisfactory.
>
> > (We already know that a literal encoding that is incompatible
> > with the locale's encoding is hard to program for.)
>
> Exactly and this is why the literal encoding is a good choice - it
> is detectable statically and the locale encoding, if set, should
> normally be compatible with it. Large parts of the design of
> std::format and std::print, approved by SG16 in C++20 and C++23,
> are based on this.
>
> > Note that pthread_setname_np is not in POSIX. What you quoted
> is the way
> > this function operates on Solaris.
>
> Sure, this particular API is not part of the POSIX standard but an
> implementation of pthreads. Even on Linux nothing says that the
> string is in the C locale encoding and looking at the
> implementation it is basically passed via a syscall “as is”. The
> main point is that according to the current wording the standard
> library has to do potentially lossy transcoding on some platforms,
> including one or more POSIX platforms, for no good reason.
>
> > What is "ACP version"?
>
> Active Code Page (ACP): most Windows APIs have Unicode
> (<FunctionName>W) and non-Unicode (<FunctionName>A) versions with
> the latter using ACP which is unrelated to the C locale encoding.
>
> - Victor
>
> On Wed, Oct 23, 2024 at 11:43 AM Jens Maurer <jens.maurer_at_[hidden]>
> wrote:
>
>
>
> On 23/10/2024 20.05, Victor Zverovich via SG16 wrote:
> > This gives mojibake on Windows with Belarusian localization.
>
> Could you please complete the picture here?
> You said your example uses UTF-8 literal encoding.
> What's the C locale encoding for the execution character set
> under the Windows Belarusian localization?
> What's the expected input encoding for "mbstowcs" in that
> localization?
> What's the generated output encoding for "mbstowcs" in that
> localization?
>
> (We already know that a literal encoding that is incompatible
> with the locale's encoding is hard to program for.)
>
> > Neither pthread_setname_np nor pthread_getname_np assume the
> C locale encoding. For example, quoting
> https://docs.oracle.com/cd/E88353_01/html/E37843/pthread-setname-np-3c.html
> <https://docs.oracle.com/cd/E88353_01/html/E37843/pthread-setname-np-3c.html>:
> >
> > The thread name is a string of length 31 bytes or less,
> UTF-8 encoded.
>
> Note that pthread_setname_np is not in POSIX. What you quoted
> is the way
> this function operates on Solaris.
>
> In contrast, my Linux man page says:
>
> The thread name is a meaningful C language string,
> whose length is restricted to 16 characters, including the
> terminating null byte ('\0').
>
> (No, I don't know what "meaningful" means.)
>
> > This means that the POSIX implementation in P2019R7 is
> actually incorrect and doesn’t match the wording.
>
> The paper says "on most POSIX implementation". Apparently,
> Solaris is different here.
> Are there any non-UTF environments on Solaris these days?
> Would Solaris transcode from UTF-8 to the encoding of that
> other environment?
> I doubt it. Thread names are just few bytes in a special
> memory area where
> the usual tools can find them; I can't imagine any Unix doing
> any sort of
> encoding recognition/translation when setting a name.
>
> Do you have a more complete survey of Unix-like operating systems?
>
> > SetThreadDescription always uses UTF-16 on Windows and there
> is no C locale or ACP version.
>
> What is "ACP version"?
>
> We already know we have to transcode for Windows (because
> wchar_t),
> but all the standard tools such as mbstowcs use the encoding
> specified by
> the C locale (input and output), so they're unsuitable to
> reliably output
> UTF-16.
>
> Maybe we want our set_thread_name to simply take char8_t (in
> UTF-8 encoding)
> and be done with it. And no, I'm not worried about copying /
> transcoding
> 16-32 bytes even on Unix.
>
> Jens
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>
Received on 2024-11-18 14:32:35