Date: Mon, 14 Oct 2019 01:00:47 -0400
Hey, Zach. The following are some items I had intended to mention in
our last telecon when we were discussing P1879R0
<https://github.com/tzlaine/small_wg1_papers/blob/master/P1879_please_dont_rewrite_my_string_literals.md>,
but we ran out of time. These are offered in the spirit of making the
paper the best it can be.
1. I suggest referring to the MSVC /source-charset:utf-8 option instead
of the /utf-8 option since the issue presented primarily concerns
the assumed source file encoding. However, the distinction is
relevant for the later comments regarding omission of the u8 prefix
in order to retain the exact code units from the source file; that
behavior depends on source file encoding exactly matching execution
encoding. If source file encoding and execution encoding don't
match, then ordinary string literal contents will be transcoded
similarly to UTF literals.
2. I think it would be useful to expand on the MSVC behavior. In
particular, state that, by default, MSVC assumes the Active Code
Page for both the encoding of source files and the execution
character set, and that the particular values that you witnessed at
run-time were the result of the source files being decoded as
Windows-1252 and then transcoded to UTF-8. Specifically, the 0xCF
code unit was interpreted as U+00CF {LATIN CAPITAL LETTER I WITH
DIAERESIS} and encoded as 0xC3 0x8F and the 0x82 code unit was
interpreted as U+201A {SINGLE LOW-9 QUOTATION MARK} and encoded as
0xE2 0x80 0x9A.
3. Per the meeting summary from the telecon
<https://github.com/sg16-unicode/sg16-meetings#october-9th-2019>,
there were suggestions of, instead of prohibiting use of UTF
literals completely in non-UTF encoded source files, to instead
restrict the set of characters that may be directly transcoded from
the source file. This would allow, for example, encoding U+03C2
{GREEK SMALL LETTER FINAL SIGMA} using u8"\u03C2", but not u8"ς" (in
non-UTF encoded source files). Unfortunately, it is not obvious
where to draw the line between which source encoded characters are
and are not allowed. Some possibilities follow (these pretty much
match what was discussed in the telecon):
1. Restrict to source file characters from the basic source
character set. This solves the portability issue well, but is
pretty restrictive. '$' and '@' are not members of the basic
source character set so this approach would require writing
email addresses with an escape sequence for the '@' sign: e.g.,
u8"tom\u0040honermann.net". Yuck (feel free to propose adding
'@' to the basic source character set!)
2. Restrict to characters that transcode to ASCII characters. This
solves the portability issue well for the MSVC compiler since
all of its supported source encodings are ASCII derivatives (the
compiler can diagnose any source file code units with a value
above 0x7F). It doesn't solve the issue well for EBCDIC code
pages since they don't all share a common set of code points
that map to ASCII characters (for example, in IBM-1047, 0x5F
maps to U+005E {CIRCUMFLEX ACCENT} where as in IBM-037, 0x5F
maps to U+00AC {NOT SIGN}). Unfortunately, I don't think EBCDIC
code pages have a common subset equivalent to ASCII for Windows
code pages; that makes designing a solution that addresses
EBCDIC strictly more challenging (I think it is reasonable to
not try and solve this issue for EBCDIC).
4. The proposed and alternatively discussed changes all break backward
compatibility. I think the paper should call this out explicitly,
ideally with some analysis of the anticipated impact. In
particular, it may be worth noting that UTF literals are used on
z/OS to obtain ASCII/Unicode strings needed for interaction on the web.
Was this paper submitted for the Belfast pre-meeting mailing?
Tom.
our last telecon when we were discussing P1879R0
<https://github.com/tzlaine/small_wg1_papers/blob/master/P1879_please_dont_rewrite_my_string_literals.md>,
but we ran out of time. These are offered in the spirit of making the
paper the best it can be.
1. I suggest referring to the MSVC /source-charset:utf-8 option instead
of the /utf-8 option since the issue presented primarily concerns
the assumed source file encoding. However, the distinction is
relevant for the later comments regarding omission of the u8 prefix
in order to retain the exact code units from the source file; that
behavior depends on source file encoding exactly matching execution
encoding. If source file encoding and execution encoding don't
match, then ordinary string literal contents will be transcoded
similarly to UTF literals.
2. I think it would be useful to expand on the MSVC behavior. In
particular, state that, by default, MSVC assumes the Active Code
Page for both the encoding of source files and the execution
character set, and that the particular values that you witnessed at
run-time were the result of the source files being decoded as
Windows-1252 and then transcoded to UTF-8. Specifically, the 0xCF
code unit was interpreted as U+00CF {LATIN CAPITAL LETTER I WITH
DIAERESIS} and encoded as 0xC3 0x8F and the 0x82 code unit was
interpreted as U+201A {SINGLE LOW-9 QUOTATION MARK} and encoded as
0xE2 0x80 0x9A.
3. Per the meeting summary from the telecon
<https://github.com/sg16-unicode/sg16-meetings#october-9th-2019>,
there were suggestions of, instead of prohibiting use of UTF
literals completely in non-UTF encoded source files, to instead
restrict the set of characters that may be directly transcoded from
the source file. This would allow, for example, encoding U+03C2
{GREEK SMALL LETTER FINAL SIGMA} using u8"\u03C2", but not u8"ς" (in
non-UTF encoded source files). Unfortunately, it is not obvious
where to draw the line between which source encoded characters are
and are not allowed. Some possibilities follow (these pretty much
match what was discussed in the telecon):
1. Restrict to source file characters from the basic source
character set. This solves the portability issue well, but is
pretty restrictive. '$' and '@' are not members of the basic
source character set so this approach would require writing
email addresses with an escape sequence for the '@' sign: e.g.,
u8"tom\u0040honermann.net". Yuck (feel free to propose adding
'@' to the basic source character set!)
2. Restrict to characters that transcode to ASCII characters. This
solves the portability issue well for the MSVC compiler since
all of its supported source encodings are ASCII derivatives (the
compiler can diagnose any source file code units with a value
above 0x7F). It doesn't solve the issue well for EBCDIC code
pages since they don't all share a common set of code points
that map to ASCII characters (for example, in IBM-1047, 0x5F
maps to U+005E {CIRCUMFLEX ACCENT} where as in IBM-037, 0x5F
maps to U+00AC {NOT SIGN}). Unfortunately, I don't think EBCDIC
code pages have a common subset equivalent to ASCII for Windows
code pages; that makes designing a solution that addresses
EBCDIC strictly more challenging (I think it is reasonable to
not try and solve this issue for EBCDIC).
4. The proposed and alternatively discussed changes all break backward
compatibility. I think the paper should call this out explicitly,
ideally with some analysis of the anticipated impact. In
particular, it may be worth noting that UTF literals are used on
z/OS to obtain ASCII/Unicode strings needed for interaction on the web.
Was this paper submitted for the Belfast pre-meeting mailing?
Tom.
Received on 2019-10-14 07:00:52