C++ Logo

sg16

Advanced search

Re: [SG16] What do we want from source to internal conversion?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sun, 21 Jun 2020 14:03:13 +0200
On Sun, 21 Jun 2020 at 06:23, Tom Honermann <tom_at_[hidden]> wrote:

> On 6/20/20 6:54 PM, Hubert Tong wrote:
>
> On Mon, Jun 15, 2020 at 11:38 AM Corentin via SG16 <sg16_at_[hidden]>
> wrote:
>
>>
>> On Mon, 15 Jun 2020 at 17:17, Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> On 6/15/20 7:14 AM, Corentin via SG16 wrote:
>>>
>>> But (I think) we should
>>>
>>> - Tighten the specification to describe a semantic rather than
>>> implementation defined mapping, while also making sure that mappings
>>> prescribed by vendors, or Unicode, such as UTF-EBCDIC are
>>> not accidentally forbidden.
>>>
>>> That sounds good to me. However, as we've recently learned, there are
>>> certain implementation shenanigans that need to be accounted for:
>>>
>>> - gcc stripping white space following a line continuation character
>>> (I think you intend to adopt this behavior in general though).
>>> - trigraphs
>>>
>>> Whitespace stripping will be a ewg proposal in that mailing.
>> Hopefully trigraph can either happen in an undocumented phase 0, or be
>> understood as part of "semantic mapping"
>>
> Sorry for the late post. It's been a busy week. I believe the discussion
> has progressed such that reviving this thread might not be the most
> productive, but the point of trigraphs appear to have last appeared only on
> this thread.
>
> The behaviour of trigraphs is still subject to "magic" reversal in raw
> strings, so a "phase 0" approach leaves the reversal as "magic".
>
> I think the model of extended characters discussed in the SG16 telecon
> this week can handle this. If we think of a trigraph as an extended
> character distinct from any other spelling of an extended character that
> denotes the same abstract character, then this situation is analogous to
> the Shift-JIS case of the same abstract character having multiple code
> point assignments in the source input character set. By allowing the set
> of extended characters to be implementation-defined (e.g., not simply the
> Unicode character set), then extended characters carry the original
> spelling through translation phases until a conversion to another character
> set is required (e.g., until a an identifier requires conformance to
> Unicode NFC in phase 3, or string literal encoding in phase 5).
>

The grammar talks about specific characters
So in phase trigraphs have to be translated to their equivalent basic
latin, otherwise they can't represent grammars tokens.

But I think phase 1 is handwavy enough that trigraphs can be magically
replaced if they do not appear in raw literals :)


> Tom.
>
>
>

Received on 2020-06-21 07:06:35