sg16: Re: [SG16] Wording for P2295 based on P2314

From: Peter Brett <pbrett_at_[hidden]>
Date: Mon, 14 Jun 2021 10:11:22 +0000

Hi Jens,

I'm sorry about the lack of clarity in communication.

Corentin contacted me privately to tell me that he would be happy to put this wording into a new revision of the paper.

Good luck with the project editing work!

Best regards,

Peter

> -----Original Message-----
> From: Jens Maurer <Jens.Maurer_at_[hidden]mx.net>
> Sent: 14 June 2021 10:48
> To: Peter Brett <pbrett_at_cadence.com>; Hubert Tong
> <hubert.reinterpretcast_at_gmail.com>
> Cc: sg16_at_[hidden]; corentin.jabot_at_gmail.com
> Subject: Re: Wording for P2295 based on P2314
>
> EXTERNAL MAIL
>
>
> On 14/06/2021 10.28, Peter Brett wrote:
> > Hi Jens and Hubert,
> >
> > Given that the mailing deadline is tomorrow, please could you review the
> wording proposal below and provide your feedback?
>
> Sorry, I'm busy with "project co-editor" tasks.
>
> That said, your e-mail was directed to Corentin, but I didn't see
> a response from him that would indicate he'd even consider using
> these words. Regardless, putting the update into the mailing would
> certainly be progress by establishing a new base for the next round
> of review.
>
> Jens
>
>
>
> > Many thanks,
> >
> > Peter
> >
> >> -----Original Message-----
> >> From: Peter Brett <pbrett_at_[hidden]>
> >> Sent: 08 June 2021 17:50
> >> To: corentin.jabot_at_gmail.com
> >> Cc: sg16_at_[hidden]
> >> Subject: Wording for P2295 based on P2314
> >>
> >> Hi Corentin,
> >>
> >> In our most recent meeting on 2021-05-26, you were asked to reword
> >> his unpublished D2295R4 "Support for UTF-8 as a portable source file
> >> encoding" based on the most recent revision of P2314 "Character sets
> and
> >> encodings" (currently R2).
> >>
> >> [lex.phases] as modified by P2314:
> >>
> >>> 1. Physical source file characters are mapped, in an
> >>> implementation-defined manner, to the translation character set
> >>> (introducing new-line characters for end-of-line indicators). The
> >>> set of physical source file characters accepted is
> >>> implementation-defined.
> >>
> >> [lex.charset] as modified by P2314:
> >>
> >>> 1. The translation character set consists of the following elements:
> >>>
> >>> - each character named by ISO/IEC 10646, as identified by its
> unique
> >>> UCS scalar value, and
> >>> - a distinct character for each UCS scalar value where no named
> >>> character is assigned
> >>
> >> As I understand it, the design intent for P2295 is as follows:
> >>
> >> - UTF-8 source files shall be supported
> >>
> >> - Users shall be able to specify that source files are to be assumed to
> >> be UTF-8 encoded.
> >>
> >> - Files that were assumed to be UTF-8 encoded but contained some non-
> UTF-8
> >> content shall be ill-formed.
> >>
> >> - The contents of UTF-8 source files shall be transmitted to phase 2 of
> >> translation verbatim. There's no implementation freedom to mess with
> >> it.
> >>
> >> My suggested approach for [lex.phases] is as follows. Let's take
> >> advantage of the fact that P2314 defines the translation character set
> >> as *exactly* the set of UCS scalar values to completely elide the
> >> mapping step from phase 1 of translation when processing UTF-8 source
> >> files.
> >>
> >> 1. The encoding scheme of a physical source file is determined in
> an
> >> implementation-defined manner. An implementation shall support
> >> the UTF-8 encoding scheme. An implementation shall define a
> >> mechanism for specifying that UTF-8 is the encoding scheme for a
> >> physical source file.
> >>
> >> If the encoding scheme of a physical source file is UTF-8, then
> >> it shall be a well-formed sequence of translation character set
> >> elements encoded as UTF-8 code units.
> >>
> >> If the encoding scheme of a physical source file is not UTF-8,
> >> then physical source file characters are mapped, in an
> >> implementation-defined manner, to the translation character set
> >> (introducing new-line characters for end-of-line indicators).
> >> The set of physical source file characters accepted is
> >> implementation-defined.
> >>
> >> 2. If the first character is U+FEFF BYTE ORDER MARK, it is
> >> deleted. ...
> >>
> >> What do you think?
> >>
> >> Best regards,
> >>
> >> Peter

Received on 2021-06-14 05:11:33