sg16: Re: [SG16] What do we want from source to internal conversion?

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 21 Jun 2020 00:23:30 -0400

On 6/20/20 6:54 PM, Hubert Tong wrote:
> On Mon, Jun 15, 2020 at 11:38 AM Corentin via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
>
> On Mon, 15 Jun 2020 at 17:17, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/15/20 7:14 AM, Corentin via SG16 wrote:
>> But (I think) we should
>>
>> - Tighten the specification to describe a semantic rather
>> than implementation defined mapping, while also making sure
>> that mappings prescribed by vendors, or Unicode, such as
>> UTF-EBCDIC are not accidentally forbidden.
>
> That sounds good to me. However, as we've recently learned,
> there are certain implementation shenanigans that need to be
> accounted for:
>
> * gcc stripping white space following a line continuation
> character (I think you intend to adopt this behavior in
> general though).
> * trigraphs
>
> Whitespace stripping will be a ewg proposal in that mailing.
> Hopefully trigraph can either happen in an undocumented phase 0,
> or be understood as part of "semantic mapping"
>
> Sorry for the late post. It's been a busy week. I believe the
> discussion has progressed such that reviving this thread might not be
> the most productive, but the point of trigraphs appear to have last
> appeared only on this thread.
>
> The behaviour of trigraphs is still subject to "magic" reversal in raw
> strings, so a "phase 0" approach leaves the reversal as "magic".

I think the model of extended characters discussed in the SG16 telecon
this week can handle this. If we think of a trigraph as an extended
character distinct from any other spelling of an extended character that
denotes the same abstract character, then this situation is analogous to
the Shift-JIS case of the same abstract character having multiple code
point assignments in the source input character set. By allowing the
set of extended characters to be implementation-defined (e.g., not
simply the Unicode character set), then extended characters carry the
original spelling through translation phases until a conversion to
another character set is required (e.g., until a an identifier requires
conformance to Unicode NFC in phase 3, or string literal encoding in
phase 5).

Tom.

Received on 2020-06-20 23:26:45