sg16: Re: [SG16] Wording for P2295 based on P2314

From: Peter Brett <pbrett_at_[hidden]>
Date: Mon, 14 Jun 2021 08:28:03 +0000

Hi Jens and Hubert,

Given that the mailing deadline is tomorrow, please could you review the wording proposal below and provide your feedback?

Many thanks,

Peter

> -----Original Message-----
> From: Peter Brett <pbrett_at_[hidden]>
> Sent: 08 June 2021 17:50
> To: corentin.jabot_at_[hidden]
> Cc: sg16_at_[hidden]
> Subject: Wording for P2295 based on P2314
>
> Hi Corentin,
>
> In our most recent meeting on 2021-05-26, you were asked to reword
> his unpublished D2295R4 "Support for UTF-8 as a portable source file
> encoding" based on the most recent revision of P2314 "Character sets and
> encodings" (currently R2).
>
> [lex.phases] as modified by P2314:
>
> > 1. Physical source file characters are mapped, in an
> > implementation-defined manner, to the translation character set
> > (introducing new-line characters for end-of-line indicators). The
> > set of physical source file characters accepted is
> > implementation-defined.
>
> [lex.charset] as modified by P2314:
>
> > 1. The translation character set consists of the following elements:
> >
> > - each character named by ISO/IEC 10646, as identified by its unique
> > UCS scalar value, and
> > - a distinct character for each UCS scalar value where no named
> > character is assigned
>
> As I understand it, the design intent for P2295 is as follows:
>
> - UTF-8 source files shall be supported
>
> - Users shall be able to specify that source files are to be assumed to
> be UTF-8 encoded.
>
> - Files that were assumed to be UTF-8 encoded but contained some non-UTF-8
> content shall be ill-formed.
>
> - The contents of UTF-8 source files shall be transmitted to phase 2 of
> translation verbatim. There's no implementation freedom to mess with
> it.
>
> My suggested approach for [lex.phases] is as follows. Let's take
> advantage of the fact that P2314 defines the translation character set
> as *exactly* the set of UCS scalar values to completely elide the
> mapping step from phase 1 of translation when processing UTF-8 source
> files.
>
> 1. The encoding scheme of a physical source file is determined in an
> implementation-defined manner. An implementation shall support
> the UTF-8 encoding scheme. An implementation shall define a
> mechanism for specifying that UTF-8 is the encoding scheme for a
> physical source file.
>
> If the encoding scheme of a physical source file is UTF-8, then
> it shall be a well-formed sequence of translation character set
> elements encoded as UTF-8 code units.
>
> If the encoding scheme of a physical source file is not UTF-8,
> then physical source file characters are mapped, in an
> implementation-defined manner, to the translation character set
> (introducing new-line characters for end-of-line indicators).
> The set of physical source file characters accepted is
> implementation-defined.
>
> 2. If the first character is U+FEFF BYTE ORDER MARK, it is
> deleted. ...
>
> What do you think?
>
> Best regards,
>
> Peter

Received on 2021-06-14 03:28:13