Subject: Re: Agenda for the 2021-07-14 SG16 telecon
From: Peter Brett (pbrett_at_[hidden])
Date: 2021-07-12 05:18:51
Please could you suggest how to phrase the following in normative wording?
"If we've decided to treat the file as UTF-8, then it has to validate as UTF-8, and the series of scalar values encoded therein is passed directly to phase 2 *exactly as decoded*, without any substitutions, additions or omissions."
It is extremely important that there is absolutely no opportunity for implementation "character mapping" shenanigans in the UTF-8 case. This is what this wording is trying to rule out. It is based on the assertion that there isn't a "mapping" between the translation character set and UTF-8, because the UTF-8 source file is a literal serialization of P2314 translation characters.
I have clearly failed to provide suitable wording to make that 100% clear. Introducing the phrase "is mapped" here does not help without normative wording that the mapping is unto.
> -----Original Message-----
> From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Jens Maurer via SG16
> Sent: 12 July 2021 11:05
> To: sg16_at_[hidden]; Hubert Tong <hubert.reinterpretcast_at_[hidden]>
> Cc: Jens Maurer <Jens.Maurer_at_[hidden]>; Tom Honermann <tom_at_[hidden]>
> Subject: Re: [SG16] Agenda for the 2021-07-14 SG16 telecon
> EXTERNAL MAIL
> On 12/07/2021 10.13, Corentin Jabot via SG16 wrote:
> > On Sun, Jul 11, 2021 at 9:09 PM Hubert Tong
> <hubert.reinterpretcast_at_[hidden] <mailto:hubert.reinterpretcast_at_[hidden]>>
> > On Sun, Jul 11, 2021 at 12:56 PM Corentin Jabot
> <corentinjabot_at_[hidden] <mailto:corentinjabot_at_[hidden]>> wrote:
> > In the third paragraph of phase 1:
> > [ ... ], then the physical source file shall be a well-formed
> UTF-8 sequence.
> > Each UCS scalar value encoded in the UTF-8 sequence is mapped
> to the corresponding element of the translation character set.
> > Just to clarify: I am suggesting the above for the wording (it was not
> merely a quote providing context for the later comment). This version
> separates the diagnostic requirement from the description of the processing.
> > I purposefully avoided the term mapping here. because the set of source
> characters and the set of translation set characters are the same there is
> no need to specify a mapping.
> The current text appears to equate sequences of UTF-8 code units with
> elements of a character set. That's not correct; we first need
> to parse UTF-8 to form a code point (eh, scalar value),
> which is then something we can relate to elements of the translation
> character set.
> I think "is mapped" is fine (a 1:1 mapping is still a mapping), in
> since we also "map" (although in an implementation-defined manner) in the
> non-UTF-8 cases.
> SG16 mailing list
SG16 list run by firstname.lastname@example.org