C++ Logo

SG16

Advanced search

Subject: Re: Wording for P2295 based on P2314
From: Jens Maurer (Jens.Maurer_at_[hidden])
Date: 2021-06-14 04:47:47


On 14/06/2021 10.28, Peter Brett wrote:
> Hi Jens and Hubert,
>
> Given that the mailing deadline is tomorrow, please could you review the wording proposal below and provide your feedback?

Sorry, I'm busy with "project co-editor" tasks.

That said, your e-mail was directed to Corentin, but I didn't see
a response from him that would indicate he'd even consider using
these words. Regardless, putting the update into the mailing would
certainly be progress by establishing a new base for the next round
of review.

Jens

> Many thanks,
>
> Peter
>
>> -----Original Message-----
>> From: Peter Brett <pbrett_at_[hidden]>
>> Sent: 08 June 2021 17:50
>> To: corentin.jabot_at_[hidden]
>> Cc: sg16_at_[hidden]
>> Subject: Wording for P2295 based on P2314
>>
>> Hi Corentin,
>>
>> In our most recent meeting on 2021-05-26, you were asked to reword
>> his unpublished D2295R4 "Support for UTF-8 as a portable source file
>> encoding" based on the most recent revision of P2314 "Character sets and
>> encodings" (currently R2).
>>
>> [lex.phases] as modified by P2314:
>>
>>> 1. Physical source file characters are mapped, in an
>>> implementation-defined manner, to the translation character set
>>> (introducing new-line characters for end-of-line indicators). The
>>> set of physical source file characters accepted is
>>> implementation-defined.
>>
>> [lex.charset] as modified by P2314:
>>
>>> 1. The translation character set consists of the following elements:
>>>
>>> - each character named by ISO/IEC 10646, as identified by its unique
>>> UCS scalar value, and
>>> - a distinct character for each UCS scalar value where no named
>>> character is assigned
>>
>> As I understand it, the design intent for P2295 is as follows:
>>
>> - UTF-8 source files shall be supported
>>
>> - Users shall be able to specify that source files are to be assumed to
>> be UTF-8 encoded.
>>
>> - Files that were assumed to be UTF-8 encoded but contained some non-UTF-8
>> content shall be ill-formed.
>>
>> - The contents of UTF-8 source files shall be transmitted to phase 2 of
>> translation verbatim. There's no implementation freedom to mess with
>> it.
>>
>> My suggested approach for [lex.phases] is as follows. Let's take
>> advantage of the fact that P2314 defines the translation character set
>> as *exactly* the set of UCS scalar values to completely elide the
>> mapping step from phase 1 of translation when processing UTF-8 source
>> files.
>>
>> 1. The encoding scheme of a physical source file is determined in an
>> implementation-defined manner. An implementation shall support
>> the UTF-8 encoding scheme. An implementation shall define a
>> mechanism for specifying that UTF-8 is the encoding scheme for a
>> physical source file.
>>
>> If the encoding scheme of a physical source file is UTF-8, then
>> it shall be a well-formed sequence of translation character set
>> elements encoded as UTF-8 code units.
>>
>> If the encoding scheme of a physical source file is not UTF-8,
>> then physical source file characters are mapped, in an
>> implementation-defined manner, to the translation character set
>> (introducing new-line characters for end-of-line indicators).
>> The set of physical source file characters accepted is
>> implementation-defined.
>>
>> 2. If the first character is U+FEFF BYTE ORDER MARK, it is
>> deleted. ...
>>
>> What do you think?
>>
>> Best regards,
>>
>> Peter


SG16 list run by sg16-owner@lists.isocpp.org