On Thu, Jul 2, 2020, 18:45 Tom Honermann via Core <core@lists.isocpp.org> wrote:
On 7/2/20 3:02 AM, Jens Maurer via Core wrote:
On 02/07/2020 07.39, Tom Honermann wrote:
On 7/1/20 3:28 AM, Jens Maurer wrote:

      
Suggestion:

(quote)
A /multi-character literal/ is a /character-literal/ whose /c-char-sequence/ consists of
more than one /c-char/.  A /non-encodable character literal/ is a character literal
whose /c-char-sequence/ consists of a single /c-char/ that is not a
/numeric-escape-sequence/ and that specifies a character that either lacks representation
in the applicable associated character encoding or that cannot be encoded in a single code unit.
The /encoding-prefix/ of a multi-character literal or a non-encodable character literal
shall be absent or 'L'.  Such /character-literal/s are conditionally-supported.

The kind of a character-literal, its type, and its associated character encoding is determined
by its /encoding-prefix/ and by its /c-char-sequence/ as specified in table Y, where latter
cases exclude former ones.
(end quote)

This makes "multi-character" and "non-encodable" attributes of both non-prefix and L
character literals, and we don't have to define the combinatorial explosion of terms
here.  In table Y, remove italics for "none" and "L" second and third rows and use
"multi-character literal" and "non-encodable character literal" in both situations.
Reorder in the order "multi-character", "non-encodable", ordinary.
Ok, this seems like good direction.  The only bit I'm questioning is the
"where latter cases exclude former ones" and reordering within the
table.  The suggested ordering would seem to prioritize the special
cases (which makes sense), but then the "latter cases exclude former
ones" seems to reverse that such that the former cases aren't reached
(because no restrictions regarding number of c-chars or encodability is
placed on the ordinary cases).  Perhaps the intent was that latter cases
are excluded by former ones?
Yeah, whatever makes sense.  The point is that the sub-rows of the table
are not independent, but have an implied "otherwise".  We need to say
something somewhere to make that happen.
Got it.

      
The update does not address the concern that phase 5 encodes but phase 6
concatenates string literals, which might change the encoding.

Example: "a" and u8"b" is concatenated as u8"ab".
Suppose my ordinary literal encoding is EBCDIC.
Encoding first means "a" is encoded to EBCDIC and u8"b" is encoded to UTF-8.
And then, this is normatively equivalent to u8"ab".  That doesn't add up.

If we concatenate first and then encode, we have the issue that
numeric-escape-sequences might alter their meaning by the concatenation.
Example:  "\33" "3"  under the status quo must be encoded as \33 followed by "3"
(two code units).
When we concatenate first, this becomes "\333" (presumably a single code unit).
This is particularly serious because using string literal concatenation is the
only safe way I am aware of how to reliably terminate a hexadecimal-escape-sequence.

It seems we need a nested mini-lexer here so that we first recognize escape-sequences,
then we concatenate (to get the right type/encoding), then we encode.  Ugh.
Yes, I have intentionally chosen not to address this concern in this
paper; in part because this paper is not intended to change behavior for
implementations (other than to fix what seem to be unintended bugs in
some implementations).  But for this, there is implementation divergence
that I think is not due to unintended behavior; Visual C++ does appear
to implement the encode first approach prescribed by the standard.  See
https://github.com/sg16-unicode/sg16/issues/47 and
https://msvc.godbolt.org/z/4buyxk.
I don't understand the MSVC output for the case

const char8_t* u8_2 = u8"" "\u0102";
/execution-charset:utf-8 /std:c++latest

at all, assuming it is this line:

$SG2797 DB        0c3H, 084H, 0e2H, 080H, 09aH, 00H

Unchecking "Unused labels" in the "Filter..." drop down list makes it easier to correlate the lines.

I think this case does actually reflect unintended behavior in the compiler.  What appears to be happening is that U+0102 is encoded as UTF-8 (0xC4 0x82) and then those individual code units are treated as Windows-1252 and again re-encoded as UTF-8.  In Windows-1252, 0xC4 is U+00C4, 0x82 is U+201A, and encoding those as UTF-8 produces the sequence { 0xC3 0x84 } { 0xE2 0x80 0x9A }.

I thought /execution-charset:utf-8 would select UTF-8 for
ordinary string literals, so even in an "encode first, then
concatenate" world, there should be no difference vs.
u8"" u8"\u0102".
That is what I would expect as well.

      
A core issue was requested for this in
, but I don't think it was
ever added to the active issues list.
Mike, what's the number of the core issue for this?
I'll send a separate email requesting this so that it gets more visibility.
Tom, please add half a paragraph to the front matter of your paper,
referring to the core issue and saying that your paper doesn't
(attempt to) address it.  In the interim (so that we don't forget),
add the link to the e-mail you gave above.
Good idea, will do.

      
We should be clear in the text whether an implementation is allowed to encode
a sequence of non-numeric-escape-sequence s-chars as a whole, or whether
each character is encoded separately.  There was concern that "separately"
doesn't address stateful encodings, where the encoding of string character
i+1 may depend on what string character i was.

      
I added notes about that, but it sounds like you want something that
explicitly grants such allowances normatively, is that correct?
I'm seeing notes about stateful encodings; I'm not seeing a note about
the "sequence as a whole" approach in general.

Maybe in lex.string pZ.1 augment the note

"The encoding of a string may differ from the sequence of code units
obtained by encoding each character in the string individually."
Ok.  I was intentional in not prescribing either a "sequence at a whole" or a "one at a time" approach.  Explicitly acknowledging both approaches makes sense; I like your suggestion.

I am a bit tired of explaining that but now an implementation can renormalize strings in phase 5....

      
Maybe replace "associated character encoding" -> "associated literal encoding"
globally to avoid the mention of "character" here.
Despite the use of the "C" word, "character encoding" is more consistent
with Unicode terminology.  Though if we really want to be consistent, we
should use "character encoding form" (which ISO/IEC 10646 then calls
simply "encoding form").  This is something we could discuss at the SG16
meeting next week.
The paper is in CWG's court; involving SG16 is not helpful at this stage
absent more severe concerns that would involve sending back the paper
as a CWG action.  That said, everybody (including members of SG16) are
invited to CWG telecons to offer their opinion.
I meant only that we could discuss the terminology SG16 desires for the future in conjunction with our current terminology discussions; that need not have any impact on this paper at this time.
off-topic remarks: Since we'd be using a new term such as
"literal encoding" here, I don't think Unicode will get into our
way.  I'd like to point out that "character encoding" (also in the
Unicode meaning) sounds like a character-at-a-time encoding, which
we expressly don't want to require.  So, choosing a different term
than one that has Unicode semantic connotations seems wise.
I'll address this more in my response to Corentin's most recent reply, but I believe the term "character encoding" is correct here.  Wikipedia's definition is useful.  Note that Shift-JIS is a character encoding despite the fact that it encodes non-characters (e.g., escape sequences).

yes, that term seems correct (fyi Non graphical characters are still considered characters)

      
"These sequences should have no effect on encoding state for stateful character encodings."
-> "These sequences are assumed not to affect encoding state for stateful character encodings."

In general, we can't use "should" (normative encouragement) in notes.
Ah, yes.  I should know by now that I shouldn't use should.
There are a few more "should"s that need fixing, beyond the one instance
I highlighted specifically.

Thanks, I'll do global search and replace.

Tom.


_______________________________________________
Core mailing list
Core@lists.isocpp.org
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
Link to this post: http://lists.isocpp.org/core/2020/07/9530.php