On 12/5/18 10:33 PM, Steve Downey wrote:

All of the u8 strings I saw contained no escape sequences.
Not that \u escapes would change the argument. They work identically in source and explicit encoding.

Right now, u8"" means transcode from source encoding to UTF-8 rather than to execution encoding.

I suspect that there are often errors where if the source encoding was not UTF-8, the result string would not be the intended one.

Would not be the intended one because the actual source encoding doesn't match the encoding the compiler uses to read the source? I'm not sure how to interpret "if the source encoding was not UTF-8".

I think you're describing a situation something like this: Actual source file encoding is UTF-8. Compiler reads the source as "8-bit ASCII" and non-ASCII code unit values are just passed through (since transcoding ASCII to UTF-8 is a no-op if not checking for non-ASCII values), resulting in u8 literals happening to have the UTF-8 contents the programmer expects despite the source encoding mismatch. In this particular case though, correcting the encoding mismatch would produce the same results (for u8 literals and also for ordinary literals iff the presumed execution encoding was also UTF-8).

Tom.

On Wed, Dec 5, 2018, 22:19 Tom Honermann <tom@honermann.net> wrote:

On 12/5/18 8:31 PM, Markus Scherer wrote:

On Wed, Dec 5, 2018 at 3:34 PM Steve Downey <sdowney@gmail.com> wrote:

How many contain text that is not already UTF-8?

I am not sure what you are asking. Most of the u8"literals" I am seeing contain non-ASCII characters. Many as literal characters, a bunch of \uhhhh, and a few \U00hhhhhh.

I was likewise uncertain about this question.

Steve, I'm guessing the question you're trying to get at is, would there be any behavioral difference if the u8 prefix was simply dropped? I think this is equivalent to asking the question, are the source files for these examples encoded as UTF-8 and is the compiler invoked such that the source encoding and presumed execution encoding are both UTF-8 (always the case for Clang, the default for gcc unless -finput-charset or -fexec-charset is used, and not the case for MSVC unless /utf-8 is used).

Tom.