On 7/2/20 3:02 AM, Jens Maurer via Core wrote:
Got it.On 02/07/2020 07.39, Tom Honermann wrote:On 7/1/20 3:28 AM, Jens Maurer wrote:Suggestion: (quote) A /multi-character literal/ is a /character-literal/ whose /c-char-sequence/ consists of more than one /c-char/. A /non-encodable character literal/ is a character literal whose /c-char-sequence/ consists of a single /c-char/ that is not a /numeric-escape-sequence/ and that specifies a character that either lacks representation in the applicable associated character encoding or that cannot be encoded in a single code unit. The /encoding-prefix/ of a multi-character literal or a non-encodable character literal shall be absent or 'L'. Such /character-literal/s are conditionally-supported. The kind of a character-literal, its type, and its associated character encoding is determined by its /encoding-prefix/ and by its /c-char-sequence/ as specified in table Y, where latter cases exclude former ones. (end quote) This makes "multi-character" and "non-encodable" attributes of both non-prefix and L character literals, and we don't have to define the combinatorial explosion of terms here. In table Y, remove italics for "none" and "L" second and third rows and use "multi-character literal" and "non-encodable character literal" in both situations. Reorder in the order "multi-character", "non-encodable", ordinary.Ok, this seems like good direction. The only bit I'm questioning is the "where latter cases exclude former ones" and reordering within the table. The suggested ordering would seem to prioritize the special cases (which makes sense), but then the "latter cases exclude former ones" seems to reverse that such that the former cases aren't reached (because no restrictions regarding number of c-chars or encodability is placed on the ordinary cases). Perhaps the intent was that latter cases are excluded by former ones?Yeah, whatever makes sense. The point is that the sub-rows of the table are not independent, but have an implied "otherwise". We need to say something somewhere to make that happen.
The update does not address the concern that phase 5 encodes but phase 6 concatenates string literals, which might change the encoding. Example: "a" and u8"b" is concatenated as u8"ab". Suppose my ordinary literal encoding is EBCDIC. Encoding first means "a" is encoded to EBCDIC and u8"b" is encoded to UTF-8. And then, this is normatively equivalent to u8"ab". That doesn't add up. If we concatenate first and then encode, we have the issue that numeric-escape-sequences might alter their meaning by the concatenation. Example: "\33" "3" under the status quo must be encoded as \33 followed by "3" (two code units). When we concatenate first, this becomes "\333" (presumably a single code unit). This is particularly serious because using string literal concatenation is the only safe way I am aware of how to reliably terminate a hexadecimal-escape-sequence. It seems we need a nested mini-lexer here so that we first recognize escape-sequences, then we concatenate (to get the right type/encoding), then we encode. Ugh.Yes, I have intentionally chosen not to address this concern in this paper; in part because this paper is not intended to change behavior for implementations (other than to fix what seem to be unintended bugs in some implementations). But for this, there is implementation divergence that I think is not due to unintended behavior; Visual C++ does appear to implement the encode first approach prescribed by the standard. See https://github.com/sg16-unicode/sg16/issues/47 and https://msvc.godbolt.org/z/4buyxk.I don't understand the MSVC output for the case const char8_t* u8_2 = u8"" "\u0102"; /execution-charset:utf-8 /std:c++latest at all, assuming it is this line: $SG2797 DB 0c3H, 084H, 0e2H, 080H, 09aH, 00HUnchecking "Unused labels" in the "Filter..." drop down list makes it easier to correlate the lines.
I think this case does actually reflect unintended behavior in the compiler. What appears to be happening is that U+0102 is encoded as UTF-8 (0xC4 0x82) and then those individual code units are treated as Windows-1252 and again re-encoded as UTF-8. In Windows-1252, 0xC4 is U+00C4, 0x82 is U+201A, and encoding those as UTF-8 produces the sequence { 0xC3 0x84 } { 0xE2 0x80 0x9A }.
That is what I would expect as well.I thought /execution-charset:utf-8 would select UTF-8 for ordinary string literals, so even in an "encode first, then concatenate" world, there should be no difference vs. u8"" u8"\u0102".
I'll send a separate email requesting this so that it gets more visibility.A core issue was requested for this in , but I don't think it was ever added to the active issues list.Mike, what's the number of the core issue for this?
Good idea, will do.Tom, please add half a paragraph to the front matter of your paper, referring to the core issue and saying that your paper doesn't (attempt to) address it. In the interim (so that we don't forget), add the link to the e-mail you gave above.
Ok. I was intentional in not prescribing either a "sequence at a whole" or a "one at a time" approach. Explicitly acknowledging both approaches makes sense; I like your suggestion.We should be clear in the text whether an implementation is allowed to encode a sequence of non-numeric-escape-sequence s-chars as a whole, or whether each character is encoded separately. There was concern that "separately" doesn't address stateful encodings, where the encoding of string character i+1 may depend on what string character i was.I added notes about that, but it sounds like you want something that explicitly grants such allowances normatively, is that correct?I'm seeing notes about stateful encodings; I'm not seeing a note about the "sequence as a whole" approach in general. Maybe in lex.string pZ.1 augment the note "The encoding of a string may differ from the sequence of code units obtained by encoding each character in the string individually."
I meant only that we could discuss the terminology SG16 desires for the future in conjunction with our current terminology discussions; that need not have any impact on this paper at this time.Maybe replace "associated character encoding" -> "associated literal encoding" globally to avoid the mention of "character" here.Despite the use of the "C" word, "character encoding" is more consistent with Unicode terminology. Though if we really want to be consistent, we should use "character encoding form" (which ISO/IEC 10646 then calls simply "encoding form"). This is something we could discuss at the SG16 meeting next week.The paper is in CWG's court; involving SG16 is not helpful at this stage absent more severe concerns that would involve sending back the paper as a CWG action. That said, everybody (including members of SG16) are invited to CWG telecons to offer their opinion.
I'll address this more in my response to Corentin's most recent reply, but I believe the term "character encoding" is correct here. Wikipedia's definition is useful. Note that Shift-JIS is a character encoding despite the fact that it encodes non-characters (e.g., escape sequences).off-topic remarks: Since we'd be using a new term such as "literal encoding" here, I don't think Unicode will get into our way. I'd like to point out that "character encoding" (also in the Unicode meaning) sounds like a character-at-a-time encoding, which we expressly don't want to require. So, choosing a different term than one that has Unicode semantic connotations seems wise.
_______________________________________________"These sequences should have no effect on encoding state for stateful character encodings." -> "These sequences are assumed not to affect encoding state for stateful character encodings." In general, we can't use "should" (normative encouragement) in notes.Ah, yes. I should know by now that I shouldn't use should.There are a few more "should"s that need fixing, beyond the one instance I highlighted specifically.Thanks, I'll do global search and replace.
Tom.
Core mailing list
Core@lists.isocpp.org
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
Link to this post: http://lists.isocpp.org/core/2020/07/9530.php