sg16: Re: [SG16] [isocpp-lib-ext] Sending P1885R8 Naming Text Encodings to Demystify Them directly to electronic polling for C++23

From: Corentin <corentin.jabot_at_[hidden]>
Date: Tue, 19 Oct 2021 23:09:10 +0200

On Tue, Oct 19, 2021 at 10:38 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

>
> Essentially agreed with your points;
> I notice that talking about "encoding form"
> (not encoding scheme and object representation)
> answers a lot of the questions "naturally" and
> seems to do what we want
>

This is contradicting previous polls and conclusions that IANA describe
encoding schemes,
and users only care about encoding schemes. Cf SG-16 minutes of previous
meeting.
"Encoding form" only applies to UTF encoding anyway.

I will not change this wording, given past polls.

And again, we have no experience with text handling on platform that have
CHAR_BITS != 8,
so we can either have it return unknown, or let to implementers to decide
whether their string
encodings match that of registered encoding (By saying nothing, which is
the status quo),
rather than trying to force a definition that will not match standard
practice (which would then force implementers to return unknown anyway).

> Jens
>
>
> On 19/10/2021 22.10, Tom Honermann wrote:
> > On 10/19/21 3:07 PM, Jens Maurer via Lib-Ext wrote:
> >>>>>>> One of the intended guarantees is that, when sizeof(wchar_t) != 1,
> that the underlying byte representation of a wide string literal match an
> encoding scheme associated with the encoding form as indicated by
> wide_literal(). For example:
> >>>>>>>
> >>>>>>> * sizeof(wchar_t) == 2, wide_literal() returns UTF-16. Ok;
> reinterpret_cast<const char*>(L"text") yields a sequence of bytes that
> constitutes valid UTF-16BE or UTF-16LE.
> >>>>>>> * sizeof(wchar_t) == 4, wide_literal() returns UTF-16.
> Invalid; reinterpret_cast<const char*>(L"text") yields a sequence of bytes
> that is not valid UTF-16BE or UTF-16LE (due to each code point being stored
> across 4 bytes instead of 2).
> >>>>>>>
> >>>>>>> It may be that the paper would benefit from some updates to make
> this more clear, but I don't have any specific suggestions at this time.
> >>>>>> I think the wording currently has no guidance for the
> sizeof(wchar_t) == 1, CHAR_BIT == 16 case,
> >>>>>> and whether it is supposed to be treated differently from the
> sizeof(wchar_t) == 2, CHAR_BIT == 16
> >>>>>> case.
> >>>>> Is this strictly a wording concern? Or do you find the design intent
> to
> >>>>> be unclear? (I think you may have intended CHAR_BIT == 8 for the
> second
> >>>>> case, though it doesn't really matter).
> >>>> I did not intend CHAR_BIT == 8 for the second case. I want a wchar_t
> that
> >>>> has some excess bits that are unused (i.e. always 0) when used to
> store
> >>>> octets. (The answer is probably more obvious for CHAR_BIT == 12.)
> >>>> This is all intended to probe the "object representation" model.
> >>> Do you feel that the wording sufficiently covers CHAR_BIT being 16 for
> >>> ordinary strings (where excess bits would presumably also need to be
> 0)?
> >> IANA talks about encoding schemes, which refers to octets, not bytes.
> >> But C++ deals in bytes, not octets.
> >> With CHAR_BIT = 16 and UTF-16, I can imagine two possible layouts:
> >> One that puts a UTF-16 code unit into each char, and one that puts
> >> the octets of (e.g.) UTF16BE into consecutive chars, so that two
> >> chars form a UTF-16 code unit and each char stores a value <= 255.
> >
> > I think we previously determined that the latter runs afoul of
> [lex.charset]p3 <http://eel.is/c++draft/lex.charset#3> because there
> would be 0 valued elements that do not correspond to the null character
> (this effectively corresponds to a multibyte encoding in which trailing
> code units may have 0 values).
> >
> > If we take the perspective that what is returned indicates an encoding
> form, then an implementation that does the second thing would have to
> return other or unknown.
> >
> >> Files are essentially sequences of bytes (chars), so when reading
> >> an external UTF-16 file, I actually expect the second layout to
> >> appear, even though that means every second octet in memory
> >> is 0 (although you can't really observe that octet in isolation).
> >
> > I don't have sufficient experience with CHAR_BIT = 16 implementations to
> know what they actually do, but my intuition is that bytes in files would
> be mapped to bytes in memory as you indicate. I don't know whether to
> expect UTF-16 files on such an implementation to map code units to bytes or
> octets to bytes.
> >
> >> >From another angle, how would you expect UTF-8 to be represented
> >> in chars on a CHAR_BIT = 16 platform? Put two UTF-8 code units
> >> into a single char in some (which?) byte order (because we have the
> >> space for it)?
> > I think I can side-step the question by stating that an implementation
> that does the latter would have to return other or unknown because
> accessing the string would not yield values corresponding to the encoding
> form.
> >>> My intent would be for the excess bits to always be 0 as an (implied)
> >>> artifact of requiring the underlying representation to adhere to a
> valid
> >>> IANA registered encoding scheme for the returned encoding form (with
> our
> >>> twist on UTF-16 implying native endianness as opposed to use of a BOM).
> >> For the case of CHAR_BIT == 16 and UTF-16 code units split across two
> chars,
> >> there is no "native" endianess how to split up a UTF-16 code unit into
> >> consecutive chars.
> >
> > With the perspective that the IDs correspond to encoding forms, then
> there is no suitable IANA mapping for that case.
> >
> > Tom.
> >
> >> Jens
> >> _______________________________________________
> >> Lib-Ext mailing list
> >> Lib-Ext_at_[hidden]
> >> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> >> Link to this post: http://lists.isocpp.org/lib-ext/2021/10/21037.php
> >
> >
>
>

Received on 2021-10-19 16:09:23