C++ Logo


Advanced search

Re: [SG16] What do we want from source to internal conversion?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sun, 21 Jun 2020 14:03:13 +0200
On Sun, 21 Jun 2020 at 06:23, Tom Honermann <tom_at_[hidden]> wrote:

> On 6/20/20 6:54 PM, Hubert Tong wrote:
> On Mon, Jun 15, 2020 at 11:38 AM Corentin via SG16 <sg16_at_[hidden]>
> wrote:
>> On Mon, 15 Jun 2020 at 17:17, Tom Honermann <tom_at_[hidden]> wrote:
>>> On 6/15/20 7:14 AM, Corentin via SG16 wrote:
>>> But (I think) we should
>>> - Tighten the specification to describe a semantic rather than
>>> implementation defined mapping, while also making sure that mappings
>>> prescribed by vendors, or Unicode, such as UTF-EBCDIC are
>>> not accidentally forbidden.
>>> That sounds good to me. However, as we've recently learned, there are
>>> certain implementation shenanigans that need to be accounted for:
>>> - gcc stripping white space following a line continuation character
>>> (I think you intend to adopt this behavior in general though).
>>> - trigraphs
>>> Whitespace stripping will be a ewg proposal in that mailing.
>> Hopefully trigraph can either happen in an undocumented phase 0, or be
>> understood as part of "semantic mapping"
> Sorry for the late post. It's been a busy week. I believe the discussion
> has progressed such that reviving this thread might not be the most
> productive, but the point of trigraphs appear to have last appeared only on
> this thread.
> The behaviour of trigraphs is still subject to "magic" reversal in raw
> strings, so a "phase 0" approach leaves the reversal as "magic".
> I think the model of extended characters discussed in the SG16 telecon
> this week can handle this. If we think of a trigraph as an extended
> character distinct from any other spelling of an extended character that
> denotes the same abstract character, then this situation is analogous to
> the Shift-JIS case of the same abstract character having multiple code
> point assignments in the source input character set. By allowing the set
> of extended characters to be implementation-defined (e.g., not simply the
> Unicode character set), then extended characters carry the original
> spelling through translation phases until a conversion to another character
> set is required (e.g., until a an identifier requires conformance to
> Unicode NFC in phase 3, or string literal encoding in phase 5).

The grammar talks about specific characters
So in phase trigraphs have to be translated to their equivalent basic
latin, otherwise they can't represent grammars tokens.

But I think phase 1 is handwavy enough that trigraphs can be magically
replaced if they do not appear in raw literals :)

> Tom.

Received on 2020-06-21 07:06:35