C++ Logo

sg16

Advanced search

Re: [SG16] Agenda for the 2021-12-01 SG16 telecon

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Fri, 3 Dec 2021 18:47:53 -0500
On Fri, Dec 3, 2021 at 5:48 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 03/12/2021 22.58, Hubert Tong wrote:
> > On Fri, Dec 3, 2021 at 4:55 PM Jens Maurer via SG16 <
> sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> >
> > On 03/12/2021 22.03, Tom Honermann wrote:
> > > On 12/1/21 2:28 PM, Corentin Jabot wrote:
> >
> > >> I think Jens is right. MSVC does handle Shift-JIS specifically
> but I'm not sure we can/should mandate something that work universally, the
> burden on implementation could be high)
> > >
> > > Are you suggesting that we should revisit the consensus for the
> proposed resolution for LWG3576 <https://cplusplus.github.io/LWG/issue3576
> <https://cplusplus.github.io/LWG/issue3576>> from our 2021-08-25 telecon <
> https://github.com/sg16-unicode/sg16-meetings#august-25th-2021 <
> https://github.com/sg16-unicode/sg16-meetings#august-25th-2021>>?
> >
> > Reading https://cplusplus.github.io/LWG/issue3576 <
> https://cplusplus.github.io/LWG/issue3576>
> > right now (I wasn't present in August, it seems),
> > this says
> >
> > "any codepoint of the literal encoding other than { or }"
> >
> > This seems to be a category error: A literal encoding produces
> > code units (see [lex.string]), not code points.
> >
> > One could certainly endeavor to reconstruct code points from
> > code units, but it appears some encodings don't really have
> > a code point space to start with. For example, wide-EBCDIC
> > paired with some narrow EBCDIC shifts between the two, but
> > it seems there is no single "code point" space that would
> > contain values for characters from both sets.
> >
> >
> > I believe the numeric value of a wchar_t would serve as the "code point"
> space in such a case.
>
> Is there actually a wchar_t encoding corresponding to each (char-based)
> shift-state encoding?
>

Various 2-byte wchar_t encodings have an issue with representing all of the
multibyte characters, yes. However, I thought your question was about an
abstract code point space. The EBCDIC multibyte character sets work fine
with 2-byte wchar_t types.


>
> A Unicode counterexample:
> If we use UTF-8 for char and UTF-16 for wchar_t, neither encoding directly
> provides code point values.
>

Yes. I think either the wording should say code unit or multibyte character.


>
> Jens
>
>

Received on 2021-12-03 17:48:24