C++ Logo


Advanced search

Re: [SG16] [isocpp-lib-ext] Sending P1885R8 Naming Text Encodings to Demystify Them directly to electronic polling for C++23

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Fri, 15 Oct 2021 15:42:03 +0200
On 15/10/2021 15.20, Corentin wrote:
> On Fri, Oct 15, 2021 at 1:14 PM Ville Voutilainen <ville.voutilainen_at_[hidden] <mailto:ville.voutilainen_at_[hidden]>> wrote:
> On Fri, 15 Oct 2021 at 13:55, Jens Maurer via Lib-Ext
> <lib-ext_at_[hidden] <mailto:lib-ext_at_[hidden]>> wrote:
> >
> > On 15/10/2021 12.41, Bryce Adelstein Lelbach aka wash wrote:
> > > Jens, this does not sound like a library design matter.
> > >
> > > Can we please stop holding this paper up in LEWG unless there are library design questions?
> > > If there are questions about the specifics of wording or text/Unicode details, there are groups that can deal with that (LWG and SG16).
> > > Just because LEWG says we approve this paper does not mean it automatically goes into the standard, it just means we are happy with the library design.
> >
> > I am raising concerns I have about the current state of the paper.
> >
> > If the chair of LEWG deems those concerns not to be relevant at the
> > level of LEWG, I'm fine with that, and I'll raise them again in LWG
> > and/or plenary, as need be.
> The following bit has a design question in it:
> > > The paper is missing a normative definition of "encoding scheme"
> > > with particular attention to the fact that an octet is not a
> > > C++ byte. From such a definition, I would hope to gain clarity
> > > how UTF-16 should be handled on a platform with CHAR_BITS == 16.
> I was not expecting this to be up for polling this week, and I have limited time,
> but the intent of SG16 was made clear. I asked SG16 last time if they had further concerns with the design and they did not.
> The intent/model chosen is "can be reinterpreted to char*, fed to iconv and iconv will do something sensible.
> Unfortunately that does not help with the char_bits=16 case, as we do not have existing practices with text library on systems with char_bits !=8.
> The definition of encoding scheme used is independent of char_bits, as long as the bit pattern of the object representation is consistent with the specification of an encoding.
> Jens, did you raise that question at the last SG16 meeting?

The last SG16 meeting did not leave lots of room for discussion after
the two presentations. I focused on getting the "object representation"
model understood. (Maybe I'm stupid, but given the absence of any mention
of "object representation" in the previous incarnations of the paper, that
model was news to me.)

The CHAR_BIT == 16 question was brought up by Tom Honermann in
"[SG16] Agenda for the 2021-10-06 SG16 telecon" on 2021-10-01.
In particular, the question is whether a platform where sizeof(wchar_t) == 1
and CHAR_BIT == 16 should return UTF16 for wide_literal()
(if that's the encoding they use).
Similarly, if sizeof(wchar_t) == 2 on such a CHAR_BIT == 16 platform,
is UTF16 the expected return value if the UTF16 code units are dispersed
across two chars? (First 8 bits in the first 16-bit char, second 8 bits
in the second 16-bit char.)

I don't think I stayed to the end of the last SG16 meeting (late hour here);
sorry if those questions were discussed. Please point to the wording that
addresses these questions.


Received on 2021-10-15 08:42:09