sg16: Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sun, 3 Oct 2021 11:03:26 +0200

On Sat, Oct 2, 2021 at 12:55 AM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

> On 02/10/2021 00.14, Tom Honermann wrote:
> > On 10/1/21 4:17 PM, Jens Maurer wrote:
> >> On 01/10/2021 19.40, Tom Honermann via SG16 wrote:
> >>> * How is the IANA registry intended to be applied? Which IANA
> encoding would be considered a match for each of the following cases?
> >> My guess is we're specifically discussing the return value of the
> wide_literal()
> >> function in the proposal.
> > Yes.
> >> None of the three cases below is describing a conforming implementation
> of (core language) C++
> >> to start with, so these questions leave me confused as to their
> applicability to standardizing
> >> something like P1885.
> > For the moment, let's assume that we adopt a resolution for D2460R0 that
> allows the use of a variable length encoding for the wide literal encoding.
> >> Assuming the core language restrictions are lifted (and the
> specification
> >> interactions with C and the wide-character functions from C analyzed):
> >>
> >>> o Wide literal encoding is UTF-16, sizeof(wchar_t) is 2,
> CHAR_BIT is >= 8, little endian architecture.
> >> UTF16
> > Why not UTF16LE? (I know why, but I'd like to hear what is offered for
> rationale).
>
> - This is most consistent with (the absence of) differentiation for e.g.
> UCS-2 and UCS-4
> and other similar wide encodings.
> - There are already standard ways to determine the endianess of the
> platform,
> which is (arguably) orthogonal to the question of encoding form.
>
> >>> o Wide literal encoding is UTF-16, sizeof(wchar_t) is 1,
> CHAR_BIT is >= 16, architecture endianness is irrelevant since code units
> are a single byte.
> >> UTF16
> >
> > This is, of course, the right answer. But I've seen claims in some of
> the email threads that the IANA registered encodings correspond to encoding
> schemes in which case, each wchar_t element would correspond to a
> byte/octet of either the UTF16-BE or UTF-16LE encoding schemes. On the
> other hand, the paper states:
>
> >> "A registered character encoding is a character encoding form in the
> IANA Character Sets registry."
>
> The IANA registry is certainly confused in that it offers all of UTF16 and
> UTF16LE and UTF16BE
> as alternatives. This smells like a category error.
>
> In particular since other wide encodings shown (e.g. UCS-2 and UCS-4)
> don't show
> such differentiation, although the endianess diversity obviously applies
> to them,
> too. (Assuming the resulting byte sequence is the interesting property.)
>
> >>> o Wide literal encoding is UTF-16LE, sizeof(wchar_t) is 1,
> CHAR_BIT is >= 8, architecture endianness is irrelevant since code units
> are a single byte.
> >> That was a bit terse. Ok, you mean an implementation that uses wchar_t
> same size as char
> >> and puts wide literals in a sequence of byte-sized wchar_t items with
> UTF-16LE encoding.
> > Yes.
> >> Note that code units are NOT a single byte (it's UTF-16, so code units
> are 16 bits,
> >> but a byte can be 8 bits in this scenario).
> > Yes, my bad, a cut and paste bug.
>
> ... and what is the question you actually wanted to ask?
> I still don't get it.
>
> >> It feels this is a particularly non-conforming implementation, because
> wchar_t can't
> >> even hold a UTF-16 code unit (which needs 16-bit for storage). I think
> the given
> >> scenario is just out-of-scope for C++.
> >
> > My intent was that wchar_t values correspond to bytes/octets as encoded
> with UTF-16LE here. If the current wchar_t restriction is lifted as
> suggested above, I believe this would be conforming and I would expect
> wide_literal() to return UTF16LE.
>
> Even after lifting the restriction on wchar_t, I continue to believe that
> a single
> wchar_t object should be able to hold a single code unit (not: code point)
> of the
> encoding. The code units of UTF-16LE are still 16-bit quantities, so an
> 8-bit
> wchar_t would not be conforming.
>
> Two follow-on thoughts:
>
> - It would seem odd to have a platform that uses one endianess for UTF-16
> code units
> and another one for the rest of the integers. If we do not admit such
> possibility,
> we don't ever need UTF16BE or UTF16LE (because the endianess is implied by
> the
> platform endianess).
>

+1

>
> - The preceding bullet applies to wide_literal() and friends, which exist
> on a
> given platform. When considering files (streams of octets), there is no
> implied
> platform endianess, and the differentiation UTF16LE vs. UTF16BE does make
> sense.
>
> > A similar concern can be illustrated with char:
> >
> > * Ordinary literal encoding is UTF-16, CHAR_BIT is >= 16, each char
> element is a code unit of the encoding form.
> > * Ordinary literal encoding is UTF-16LE, CHAR_BIT is >= 8, each char
> element is a byte of the encoding scheme.
> >
> > If we identify these as UTF16 and UTF16LE (as we should),
>
> Again, I disagree. UTF-16LE has 16-bit code units, which don't fit into an
> 8-bit char,
> so this is non-conforming. If you wish to define your own encoding that
> has 8-bit
> code units created by a UTF16LE sequence, feel free to do so and label it
> Tom16 or so.
>

Exactly

>
> > then we aren't being consistent with regard to use of the IANA
> registered encodings as encoding schemes or encoding forms. How do we
> specify which encodings denote encoding schemes and which ones denote
> encoding forms?
>
> And which ones should wide_literal() return?
>

If you construct a text_encoding object by hand,
like text_encoding("utf16") it denotes an encoding form.
The same is true for the literal functions. The fact that we ALSO know the
endianness of the platforms makes it an encoding scheme, but the invariant
is not maintained or implied
by the text_encoding object itself.
Now. utf16le/be are always encoding schemes, and a conforming
implementation can return that if they want to. Is it useful for users?

>
> > Neither the IANA registry nor the referenced RFCs are clear here,
> particularly for UTF16. Jens' answers above are the ones that we want, but
> I don't think the paper specifies that, nor provides rationale.
> >
> > To be clear, I believe what we want is:
> >
> > * For UTF16, each char or wchar_t element corresponds to a code unit.
> > * For UTF16LE and UTF16BE, each char or wchar_t element corresponds to
> a byte/octet.
>
> An intermediate stage of discussion with Hubert was that the implementation
> is supposed to (always) return encoding names that fully specify the width
> an
> endianess, so UTF16 would never be returned, but just UTF16BE and UTF16LE.
> For UCS-4, we'd need to invent UCS4LE and UCS4BE and UCS4VAX.
>
> This would more directly map to the expected use-case calling iconv,
> which always takes a sequence of bytes.
> > The paper attempts to avoid these questions by stating this is all
> implementation-defined and that is probably fine; I'm asking these
> questions more to ensure the paper is clear in intent and wording and to
> ensure we're consistent with regard to programmers expectations.
>
> I understand we can require very little in this area normatively
> (except probably the handling of Unicode), but we should nonetheless
> agree on and give clear guidance what implementations should do.
> Otherwise, we'll just get different return values from different
> compilers on the same platform, which helps nobody.
>

Agreed (as long as we keep that manageable, there is no bottom to they
abyss and we are dangerously close from falling into it)

>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-10-03 04:03:40