On Sat, Jan 30, 2021 at 7:53 PM Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Sat, Jan 30, 2021 at 12:25 PM Corentin <corentin.jabot@gmail.com> wrote:

On Sat, Jan 30, 2021 at 5:54 PM Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Sat, Jan 30, 2021 at 6:18 AM Corentin <corentin.jabot@gmail.com> wrote:

On Sat, Jan 30, 2021 at 5:39 AM Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Wed, Jan 27, 2021 at 3:57 AM Corentin via SG16 <sg16@lists.isocpp.org> wrote:

Has consensus been found that UTF-16 is a valid wide execution character set (encoding)? Are there general library facilities to handle conversion of strings from wide execution character set (encoding)s with characters that are encoded in more than one wchar_t code unit to other encodings?

I don't think so but the status quo does not match existing practice.
I think we could easily fix the core language but we may have to modify the library wording because I think some functions can't deal with wide multi-byte? Not sure https://github.com/sg16-unicode/sg16/issues/9
Paper needed :)
The general case would involve something like JeanHeyd's paper with those C library functions.
  • Each member of the basic character set is uniquely represented by a single byte whose value, as read via a glvalue of type `char`, is positive
I think "basic character set" above isn't just the basic source character set. I think "basic execution character set" as a term happens to be the right name for what we need (just that the current definition is not what we want; we don't want a coded character set, and there shouldn't be a "narrow" and wide version). Also: The second half should read "... single code unit whose value is positive" now that you've defined the code units appropriately for that to work.

I meant source here
What we want some characters a subset of (0+0000-U+0127) to always be 1/representable 2/representable in 1 code unit. The indirection doesn't serve much purpose (Unless I am missing something).
Except to give a convenient name for the addition of BELL, BACKSPACE, and CARRIAGE RETURN (CR).
Thanks for pointing this out explicitly. I think we have to leave the "locale-specific" around somewhere.

The additional things that the current wording is probably trying to say are:
In the execution environment, the library operates using locale-specific encodings for wide strings and byte strings.
The characters in the basic execution character set shall be represented in each locale-specific encoding.

I think we want to say ( to match existing practice ), that the execution environment has an encoding / character set that is either the same or a super set of the execution character set (same values but may have extra members).
It is unclear that "local specific" currently say that.
I don't think the encoding interpretation of the above (which I think was the intended interpretation) actually matches existing practice (except perhaps for the "C" locale). That different locales present in runtime environments may encode characters within the basic execution character set differently is a practical reality (web search for "PPCS variant characters").

Unfortunately, when that's the case (and I agree that's the case more often than we'd like, another good example is shift-jis/win-1251), string literals cannot be interpreted properly by "locale specific" runtime functions.
Such runtime function expects an encoding that is not the same as the string literal, it cannot interpret it correctly, which can lead to mojibake, etc.

I think this issue should be described somewhere (in the library?) and be specifically UB.

What do you think ?
My current impression is that there may be a narrow-enough scope here that a separate paper could come out of this thread without pulling in the world.

We may want to leave the local-specific part out of this paper to contain the scope.
If we are removing the existing words that talk about "locale-specific", then we aren't really leaving the locale-specific part out of the paper. I am not sure the existing words for "locale-specific" are all that salvageable given the surrounding changes that we want.
We may have to resolve the wchar_t par first though
AFAICT, we can implement the wording improvement for the status quo of wchar_t without making it more difficult to handle the larger question of UTF-16, etc.