sg16: Re: [SG16] Agenda for the 2021-07-14 SG16 telecon

From: Peter Brett <pbrett_at_[hidden]>
Date: Mon, 12 Jul 2021 10:18:51 +0000

Hi Jens,

Please could you suggest how to phrase the following in normative wording?

"If we've decided to treat the file as UTF-8, then it has to validate as UTF-8, and the series of scalar values encoded therein is passed directly to phase 2 *exactly as decoded*, without any substitutions, additions or omissions."

It is extremely important that there is absolutely no opportunity for implementation "character mapping" shenanigans in the UTF-8 case. This is what this wording is trying to rule out. It is based on the assertion that there isn't a "mapping" between the translation character set and UTF-8, because the UTF-8 source file is a literal serialization of P2314 translation characters.

I have clearly failed to provide suitable wording to make that 100% clear. Introducing the phrase "is mapped" here does not help without normative wording that the mapping is unto.

Thanks,

Peter

> -----Original Message-----
> From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Jens Maurer via SG16
> Sent: 12 July 2021 11:05
> To: sg16_at_[hidden]; Hubert Tong <hubert.reinterpretcast_at_[hidden]>
> Cc: Jens Maurer <Jens.Maurer_at_[hidden]>; Tom Honermann <tom_at_[hidden]>
> Subject: Re: [SG16] Agenda for the 2021-07-14 SG16 telecon
>
> EXTERNAL MAIL
>
>
> On 12/07/2021 10.13, Corentin Jabot via SG16 wrote:
> >
> >
> > On Sun, Jul 11, 2021 at 9:09 PM Hubert Tong
> <hubert.reinterpretcast_at_[hidden] <mailto:hubert.reinterpretcast_at_[hidden]>>
> wrote:
> >
> > On Sun, Jul 11, 2021 at 12:56 PM Corentin Jabot
> <corentinjabot_at_[hidden] <mailto:corentinjabot_at_[hidden]>> wrote:
> >
> > In the third paragraph of phase 1:
> > [ ... ], then the physical source file shall be a well-formed
> UTF-8 sequence.
> > Each UCS scalar value encoded in the UTF-8 sequence is mapped
> to the corresponding element of the translation character set.
> >
> >
> > Just to clarify: I am suggesting the above for the wording (it was not
> merely a quote providing context for the later comment). This version
> separates the diagnostic requirement from the description of the processing.
> >
> >
> > I purposefully avoided the term mapping here. because the set of source
> characters and the set of translation set characters are the same there is
> no need to specify a mapping.
>
> The current text appears to equate sequences of UTF-8 code units with
> elements of a character set. That's not correct; we first need
> to parse UTF-8 to form a code point (eh, scalar value),
> which is then something we can relate to elements of the translation
> character set.
>
> I think "is mapped" is fine (a 1:1 mapping is still a mapping), in
> particular
> since we also "map" (although in an implementation-defined manner) in the
> non-UTF-8 cases.
>
> Jesn
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg
> 16__;!!EHscmS1ygiU1lA!SPVV_5zxpJH_HQYFa51DCUd_GGGM0DTg_qfz7wjlg0KUKSfzCalRQe
> 2Lb0lH0g$

Received on 2021-07-12 05:19:04