C++ Logo

sg16

Advanced search

Re: [SG16] [isocpp-lib-ext] Sending P1885R8 Naming Text Encodings to Demystify Them directly to electronic polling for C++23

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sat, 16 Oct 2021 00:05:44 +0200
On Sat, Oct 16, 2021, 00:00 Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 15/10/2021 23.25, Tom Honermann wrote:
> > On 10/15/21 4:29 PM, Jens Maurer via Lib-Ext wrote:
>
> > Yes, but unless I'm mistaken, IANA does not currently specify any wide
> > EBCDIC encodings.
>
> Right. I can find several EBCDIC variants for European languages,
> but nothing that looks Asian. Maybe IBM didn't bother registering
> the wide-EBCDIC variants with IANA.
>
> > We can (and probably should) add some guidance in the
> > prose of the paper based on IBM documentation, but I'm otherwise unaware
> > of how normative guidance could be provided.
>
> If we agree that a (hypothetical) Asian EBCDIC big-endian platform
> should return the same value from wide_literal() as a similar
> little-endian platform (and this implies native endianness),
> then I think we're essentially saying that the return value
> of wide_literal() is interpreted as an encoding form, never
> as an encoding scheme. (Which means the IANA list is not really
> applicable.)
>

Ebcdic wide encodings being non portable or registered i have no desire to
force implementers into a direction: tgrt can do what they want



> >> (As a side note, it seems odd that we're keen on using IANA (i.e. a
> list of
> >> encoding schemes) for wide_literal(), but then we make every effort to
> read
> >> this as an encoding form.)
> >
> > Indeed, a consequence of trying to balance available specifications,
> > utility, and programmer expectations.
>
> Maybe we should bite the bullet and simply list the small number
> of permitted return values for wide_literal() (i.e. UTF-16 and UTF-32
> and "other"; possibly UCS-2 and UCS-4), given that IANA is only really
> helpful for the non-wide case.
>
> (Since there are no wide EBCDIC encodings in IANA, IBM needs to return
> "other" on those platforms anyway.)
>
> >>> One of the intended guarantees is that, when sizeof(wchar_t) != 1,
> that the underlying byte representation of a wide string literal match an
> encoding scheme associated with the encoding form as indicated by
> wide_literal(). For example:
> >>>
> >>> * sizeof(wchar_t) == 2, wide_literal() returns UTF-16. Ok;
> reinterpret_cast<const char*>(L"text") yields a sequence of bytes that
> constitutes valid UTF-16BE or UTF-16LE.
> >>> * sizeof(wchar_t) == 4, wide_literal() returns UTF-16. Invalid;
> reinterpret_cast<const char*>(L"text") yields a sequence of bytes that is
> not valid UTF-16BE or UTF-16LE (due to each code point being stored across
> 4 bytes instead of 2).
> >>>
> >>> It may be that the paper would benefit from some updates to make this
> more clear, but I don't have any specific suggestions at this time.
> >> I think the wording currently has no guidance for the sizeof(wchar_t)
> == 1, CHAR_BIT == 16 case,
> >> and whether it is supposed to be treated differently from the
> sizeof(wchar_t) == 2, CHAR_BIT == 16
> >> case.
> >
> > Is this strictly a wording concern? Or do you find the design intent to
> > be unclear? (I think you may have intended CHAR_BIT == 8 for the second
> > case, though it doesn't really matter).
>
> I did not intend CHAR_BIT == 8 for the second case. I want a wchar_t that
> has some excess bits that are unused (i.e. always 0) when used to store
> octets. (The answer is probably more obvious for CHAR_BIT == 12.)
> This is all intended to probe the "object representation" model.
>
> Jens
>

Received on 2021-10-15 17:05:57