On 6/20/20 6:54 PM, Hubert Tong wrote:
On Mon, Jun 15, 2020 at 11:38 AM Corentin via SG16 <sg16@lists.isocpp.org> wrote:

On Mon, 15 Jun 2020 at 17:17, Tom Honermann <tom@honermann.net> wrote:
On 6/15/20 7:14 AM, Corentin via SG16 wrote:
But (I think) we should

- Tighten the specification to describe a semantic rather than implementation defined mapping, while also making sure that mappings prescribed by vendors, or Unicode, such as UTF-EBCDIC are not accidentally forbidden.

That sounds good to me.  However, as we've recently learned, there are certain implementation shenanigans that need to be accounted for:

  • gcc stripping white space following a line continuation character (I think you intend to adopt this behavior in general though).
  • trigraphs
Whitespace stripping will be a ewg proposal in that mailing.
Hopefully trigraph can either happen in an undocumented phase 0, or be understood as part of "semantic mapping"
Sorry for the late post. It's been a busy week. I believe the discussion has progressed such that reviving this thread might not be the most productive, but the point of trigraphs appear to have last appeared only on this thread.

The behaviour of trigraphs is still subject to "magic" reversal in raw strings, so a "phase 0" approach leaves the reversal as "magic".

I think the model of extended characters discussed in the SG16 telecon this week can handle this.  If we think of a trigraph as an extended character distinct from any other spelling of an extended character that denotes the same abstract character, then this situation is analogous to the Shift-JIS case of the same abstract character having multiple code point assignments in the source input character set.  By allowing the set of extended characters to be implementation-defined (e.g., not simply the Unicode character set), then extended characters carry the original spelling through translation phases until a conversion to another character set is required (e.g., until a an identifier requires conformance to Unicode NFC in phase 3, or string literal encoding in phase 5).

Tom.