Hey, Zach.  The following are some items I had intended to mention in our last telecon when we were discussing P1879R0, but we ran out of time.  These are offered in the spirit of making the paper the best it can be.

  1. I suggest referring to the MSVC /source-charset:utf-8 option instead of the /utf-8 option since the issue presented primarily concerns the assumed source file encoding.  However, the distinction is relevant for the later comments regarding omission of the u8 prefix in order to retain the exact code units from the source file; that behavior depends on source file encoding exactly matching execution encoding.  If source file encoding and execution encoding don't match, then ordinary string literal contents will be transcoded similarly to UTF literals.
  2. I think it would be useful to expand on the MSVC behavior.  In particular, state that, by default, MSVC assumes the Active Code Page for both the encoding of source files and the execution character set, and that the particular values that you witnessed at run-time were the result of the source files being decoded as Windows-1252 and then transcoded to UTF-8.  Specifically, the 0xCF code unit was interpreted as U+00CF {LATIN CAPITAL LETTER I WITH DIAERESIS} and encoded as 0xC3 0x8F and the 0x82 code unit was interpreted as U+201A {SINGLE LOW-9 QUOTATION MARK} and encoded as 0xE2 0x80 0x9A.
  3. Per the meeting summary from the telecon, there were suggestions of, instead of prohibiting use of UTF literals completely in non-UTF encoded source files, to instead restrict the set of characters that may be directly transcoded from the source file.  This would allow, for example, encoding U+03C2 {GREEK SMALL LETTER FINAL SIGMA} using u8"\u03C2", but not u8"ς" (in non-UTF encoded source files).  Unfortunately, it is not obvious where to draw the line between which source encoded characters are and are not allowed.  Some possibilities follow (these pretty much match what was discussed in the telecon):
    1. Restrict to source file characters from the basic source character set.  This solves the portability issue well, but is pretty restrictive.  '$' and '@' are not members of the basic source character set so this approach would require writing email addresses with an escape sequence for the '@' sign: e.g., u8"tom\u0040honermann.net".  Yuck (feel free to propose adding '@' to the basic source character set!)
    2. Restrict to characters that transcode to ASCII characters.  This solves the portability issue well for the MSVC compiler since all of its supported source encodings are ASCII derivatives (the compiler can diagnose any source file code units with a value above 0x7F).  It doesn't solve the issue well for EBCDIC code pages since they don't all share a common set of code points that map to ASCII characters (for example, in IBM-1047, 0x5F maps to U+005E {CIRCUMFLEX ACCENT} where as in IBM-037, 0x5F maps to U+00AC {NOT SIGN}).  Unfortunately, I don't think EBCDIC code pages have a common subset equivalent to ASCII for Windows code pages; that makes designing a solution that addresses EBCDIC strictly more challenging (I think it is reasonable to not try and solve this issue for EBCDIC).
  4. The proposed and alternatively discussed changes all break backward compatibility.  I think the paper should call this out explicitly, ideally with some analysis of the anticipated impact.  In particular, it may be worth noting that UTF literals are used on z/OS to obtain ASCII/Unicode strings needed for interaction on the web.

Was this paper submitted for the Belfast pre-meeting mailing?

Tom.