sg16: [SG16-Unicode] Additional feedback on P1879R0 - The u8 string literal prefix does not do what you think it does

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 14 Oct 2019 01:00:47 -0400

Hey, Zach. The following are some items I had intended to mention in
our last telecon when we were discussing P1879R0
<https://github.com/tzlaine/small_wg1_papers/blob/master/P1879_please_dont_rewrite_my_string_literals.md>,
but we ran out of time. These are offered in the spirit of making the
paper the best it can be.

1. I suggest referring to the MSVC /source-charset:utf-8 option instead
    of the /utf-8 option since the issue presented primarily concerns
    the assumed source file encoding. However, the distinction is
    relevant for the later comments regarding omission of the u8 prefix
    in order to retain the exact code units from the source file; that
    behavior depends on source file encoding exactly matching execution
    encoding. If source file encoding and execution encoding don't
    match, then ordinary string literal contents will be transcoded
    similarly to UTF literals.
2. I think it would be useful to expand on the MSVC behavior. In
    particular, state that, by default, MSVC assumes the Active Code
    Page for both the encoding of source files and the execution
    character set, and that the particular values that you witnessed at
    run-time were the result of the source files being decoded as
    Windows-1252 and then transcoded to UTF-8. Specifically, the 0xCF
    code unit was interpreted as U+00CF {LATIN CAPITAL LETTER I WITH
    DIAERESIS} and encoded as 0xC3 0x8F and the 0x82 code unit was
    interpreted as U+201A {SINGLE LOW-9 QUOTATION MARK} and encoded as
    0xE2 0x80 0x9A.
3. Per the meeting summary from the telecon
    <https://github.com/sg16-unicode/sg16-meetings#october-9th-2019>,
    there were suggestions of, instead of prohibiting use of UTF
    literals completely in non-UTF encoded source files, to instead
    restrict the set of characters that may be directly transcoded from
    the source file. This would allow, for example, encoding U+03C2
    {GREEK SMALL LETTER FINAL SIGMA} using u8"\u03C2", but not u8"ς" (in
    non-UTF encoded source files). Unfortunately, it is not obvious
    where to draw the line between which source encoded characters are
    and are not allowed. Some possibilities follow (these pretty much
    match what was discussed in the telecon):
     1. Restrict to source file characters from the basic source
        character set. This solves the portability issue well, but is
        pretty restrictive. '$' and '@' are not members of the basic
        source character set so this approach would require writing
        email addresses with an escape sequence for the '@' sign: e.g.,
        u8"tom\u0040honermann.net". Yuck (feel free to propose adding
        '@' to the basic source character set!)
     2. Restrict to characters that transcode to ASCII characters. This
        solves the portability issue well for the MSVC compiler since
        all of its supported source encodings are ASCII derivatives (the
        compiler can diagnose any source file code units with a value
        above 0x7F). It doesn't solve the issue well for EBCDIC code
        pages since they don't all share a common set of code points
        that map to ASCII characters (for example, in IBM-1047, 0x5F
        maps to U+005E {CIRCUMFLEX ACCENT} where as in IBM-037, 0x5F
        maps to U+00AC {NOT SIGN}). Unfortunately, I don't think EBCDIC
        code pages have a common subset equivalent to ASCII for Windows
        code pages; that makes designing a solution that addresses
        EBCDIC strictly more challenging (I think it is reasonable to
        not try and solve this issue for EBCDIC).
4. The proposed and alternatively discussed changes all break backward
    compatibility. I think the paper should call this out explicitly,
    ideally with some analysis of the anticipated impact. In
    particular, it may be worth noting that UTF literals are used on
    z/OS to obtain ASCII/Unicode strings needed for interaction on the web.

Was this paper submitted for the Belfast pre-meeting mailing?

Tom.

Received on 2019-10-14 07:00:52