sg16: Re: [SG16] [isocpp-core] Updated draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Richard Smith <richardsmith_at_[hidden]>
Date: Tue, 21 Jul 2020 16:45:22 -0700

On Sun, Jul 19, 2020 at 8:52 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 7/18/20 2:57 AM, Jens Maurer wrote:
>
> On 18/07/2020 08.48, Tom Honermann via SG16 wrote:
>
> On 7/15/20 3:21 AM, Richard Smith wrote:
>
> On Tue, 14 Jul 2020, 22:56 Tom Honermann, <tom_at_[hidden] <mailto:tom_at_[hidden]> <tom_at_[hidden]>> wrote:
>
> On 7/14/20 3:23 AM, Richard Smith wrote:
>
> 5.13.3/Z.2.2:
> """
> — Otherwise, if the character-literal's encoding-prefix is absent or L, then the value is implementation-defined.
> """
>
> I appreciate that your wording reflects the behavior of the prior wording, but while we're here: do we really want '\ff' to have an implementation-defined value rather than being required to be (char)0xff (assuming 'char' is signed and 8-bit)? Now we guarantee 2s complement, perhaps we should just say you always get the result of converting the given value to char / wchar_t? (Similarly in 5.3.15/Z.2.)
>
> That seems reasonable, and I believe matches existing practice, but I'm not sure how to word it. Would we address cases like '\xFFFF' (with the same sign/size assumptions) explicitly? I don't think we can defer to the integral conversion rules since the source value doesn't have a specific type (the wording states "an integer value v". Perhaps we could steal the "type that is congruent to the source integer modulo 2N" wording?
>
> """
> — Otherwise, if the character-literal's encoding-prefix is absent or L, then the value is the unique value of the /character-literal/s type t that is congruent to v modulo 2N, where N is the width of t.
> """
>
> Yes, that it something like it seems quite reasonable to me.
>
> I looked into this and found that gcc 10.1, clang 10, Visual C++ 19.24, and icc 19 all accept '\xff' and produce a value of -1 as expected, but for '\x100', gcc and icc emit a warning, and Clang and Visual C++ reject. https://www.godbolt.org/z/6qa1b7. That leads me to believe this should be considered more of an evolutionary change and addressed in a different paper.
>
> A hex number is conceptually unsigned. We could say we take the character-literal's
> type (or its underlying type, if any), take the unsigned type corresponding to that
> (if it's not already unsigned), and you only get the "modulo 2^N" behavior if
> the hex value is in the range of representable values for that unsigned type.
>
> Ok, that seems pretty workable. I updated the D2029R3 draft
> <https://rawgit.com/sg16-unicode/sg16/master/papers/d2029r3.html> and posted
> it to the wiki
> <https://wiki.edg.com/bin/view/Wg21summer2020/CoreWorkingGroup> for the
> core issues processing telecon on Monday. A blurb has been added to the
> introduction and PR overview. The wording update states:
>
> > [lex.ccon]pZ.3: Otherwise, if the character-literal's encoding-prefix is
> absent or L, and V does not exceed the range of representable values of the
> corresponding unsigned type for the underlying type of the
> character-literal's type, then the value is the unique value of the
> character-literal's type T that is congruent to V modulo 2N, where N is
> the width of T.
>
> > [lex.string]pZ.2: Otherwise, if the string-literal's encoding-prefix is
> absent or L, and V does not exceed the range of representable values of the
> corresponding unsigned type for the underlying type of the string-literal's
> code unit type, then the value is the unique value of the string-literal's
> code unit type T that is congruent to V modulo 2N, where N is the width
> of T.
>
> Thanks! I happened to be browsing the relevant part of the C11 standard
today when I encountered:

"If an integer character constant contains a single character or escape
sequence, its value is the one that results when an object with type char
whose value is that of the single character or escape sequence is converted
to type int.
[...]
EXAMPLE 2 Consider implementations that use two’s complement representation
for integers and eight bits for objects that have type char. In an
implementation in which type char has the same range of values as signed
char, the integer character constant '\xFF' has the value −1; if type char
has the same range of values as unsigned char, the character constant
'\xFF' has the value +255."

So this change appears to increase the compatibility between C and C++ as
well :) (Don't look too hard at the non-example portion of the C wording,
though, as it does not appear to actually justify the results given by the
example.)

> Tom.
>
> Jens
>
>
>

Received on 2020-07-21 18:48:53