C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] Draft: char8_t backward compatibility remediation paper

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 5 Dec 2018 23:33:43 -0500
On 12/5/18 10:33 PM, Steve Downey wrote:
> All of the u8 strings I saw contained no escape sequences.
> Not that \u escapes would change the argument. They work identically
> in source and explicit encoding.
> Right now, u8"" means transcode from source encoding to UTF-8 rather
> than to execution encoding.
> I suspect that there are often errors where if the source encoding was
> not UTF-8, the result string would not be the intended one.

Would not be the intended one because the actual source encoding doesn't
match the encoding the compiler uses to read the source? I'm not sure
how to interpret "if the source encoding was not UTF-8".

I think you're describing a situation something like this: Actual source
file encoding is UTF-8. Compiler reads the source as "8-bit ASCII" and
non-ASCII code unit values are just passed through (since transcoding
ASCII to UTF-8 is a no-op if not checking for non-ASCII values),
resulting in u8 literals happening to have the UTF-8 contents the
programmer expects despite the source encoding mismatch. In this
particular case though, correcting the encoding mismatch would produce
the same results (for u8 literals and also for ordinary literals iff the
presumed execution encoding was also UTF-8).

Tom.

>
>
>
>
> On Wed, Dec 5, 2018, 22:19 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 12/5/18 8:31 PM, Markus Scherer wrote:
>> On Wed, Dec 5, 2018 at 3:34 PM Steve Downey <sdowney_at_[hidden]
>> <mailto:sdowney_at_[hidden]>> wrote:
>>
>> How many contain text that is not already UTF-8?
>>
>>
>> I am not sure what you are asking. Most of the u8"literals" I am
>> seeing contain non-ASCII characters. Many as literal characters,
>> a bunch of \uhhhh, and a few \U00hhhhhh.
>
> I was likewise uncertain about this question.
>
> Steve, I'm guessing the question you're trying to get at is, would
> there be any behavioral difference if the u8 prefix was simply
> dropped? I think this is equivalent to asking the question, are
> the source files for these examples encoded as UTF-8 and is the
> compiler invoked such that the source encoding and presumed
> execution encoding are both UTF-8 (always the case for Clang, the
> default for gcc unless -finput-charset or -fexec-charset is used,
> and not the case for MSVC unless /utf-8 is used).
>
> Tom.
>


Received on 2018-12-06 05:41:46