C++ Logo

SG16

Advanced search

Subject: Re: [isocpp-core] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)
From: Tom Honermann (tom_at_[hidden])
Date: 2020-06-30 11:19:39


On 6/30/20 10:49 AM, Corentin Jabot via Core wrote:
>
>
> On Tue, Jun 30, 2020, 16:32 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/30/20 1:31 AM, Corentin Jabot wrote:
>>
>>
>> On Tue, 30 Jun 2020 at 06:49, Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> On 6/30/20 12:15 AM, Corentin Jabot wrote:
>>>
>>>
>>> On Tue, Jun 30, 2020, 05:52 Tom Honermann <tom_at_[hidden]
>>> <mailto:tom_at_[hidden]>> wrote:
>>>
>>> On 6/28/20 2:03 AM, Corentin Jabot wrote:
>>>>
>>>>
>>>> On Sun, 28 Jun 2020 at 07:37, Corentin Jabot
>>>> <corentinjabot_at_[hidden]
>>>> <mailto:corentinjabot_at_[hidden]>> wrote:
>>>>
>>>>
>>>>
>>>> On Sun, Jun 28, 2020, 06:50 Tom Honermann via SG16
>>>> <sg16_at_[hidden]
>>>> <mailto:sg16_at_[hidden]>> wrote:
>>>>
>>>> A new draft revision of P2029 (Proposed
>>>> resolution for core issues 411, 1656, and 2333;
>>>> numeric and universal character escapes in
>>>> character and string literals) is now available
>>>> at
>>>> https://rawgit.com/sg16-unicode/sg16/master/papers/d2029r2.html.
>>>> This addresses the CWG feedback provided during
>>>> the March 23rd, 2020 core issues processing
>>>> teleconference
>>>> <http://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
>>>>
>>>> Wording review feedback prior to the next Core
>>>> issues processing teleconference would be much
>>>> appreciated!
>>>>
>>>> I really like the overall direction, a few comments:
>>>> - Can we not make conditionally supported escape
>>>> sequences part of the grammar?
>>>>
>>> This was requested by Core in the 2020-01-16 issues
>>> processing telecon
>>> <https://wiki.edg.com/bin/view/Wg21prague/IssuesProcessingTeleconference2020-01-16>.
>>>>
>>>> What I would do:
>>>> simple-escape-sequence:
>>>>   any member of the basic source character set other
>>>> than u, U, x, and the members of octal-digit
>>>>
>>>> And in 5.13, keep
>>>> Escape sequences not listed in Table 9 are
>>>> conditionally supported, with implementation-defined
>>>> semantics
>>> What problem would that solve?
>>>
>>>
>>> Not having separated grammar for non standard features,
>>> simpler grammar.
>> I prefer the current approach in the paper, but I have no
>> objection to doing what you suggest if the CWG expresses such
>> a preference.
>>>
>>>
>>>> - Can we not add notes for stateful encodings? It
>>>> doesn't add anything.
>>>>
>>> Stateful encodings were discussed in the 2020-03-23
>>> issues processing telecon
>>> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
>>>
>>>
>>>
>>> Sure, it is still a level of detail that doesn't add
>>> anything. I would like to avoid people in 30 years wondering
>>> why that this sentences are here.
>> Stateful encodings are still a thing.  They may still be a
>> thing in 30 years.
>>
>> I am not saying they aren't and wouldn't be, I am saying that the
>> current wording was enough for that to be implemented correctly
>> while the new wording does not.
> I'm not following.  What do you believe the new wording changes? 
> Discussion of stateful encodings is limited to non-normative notes.
>>
>>>> -- Wide multi character literals were not a thing,
>>>> let's not make them one now. same for conditional
>>>> character literals and conditional wide character
>>>> literals.
>>>>
>>>> Instead, please add text in (Z) to describe them?
>>>> ie:
>>>>
>>>> -ordinary and wide characters literal consisting of
>>>> a single basic-c-char, simple-escape-sequence, or
>>>> universal-character-name that specifies a character
>>>> that either lacks representation in the associated
>>>> character encoding or that cannot be encoded as a
>>>> single code unit
>>>> are conditionally supported and have an
>>>> implementation-defined value
>>>> - A wide character literal consisting of multiple
>>>> c-chars is conditionally-supported and has an
>>>> implementation-defined value.
>>>>
>>> Giving these odd literals a name was suggested by Core. 
>>> I agree with their suggested direction; giving them a
>>> name makes it easier to discuss and define them.
>>>
>>>
>>>
>>> No, especially wide multi characters that are simply not a
>>> thing, let's not make them one. The reason multi character
>>> literals exists and have a name is because their type is
>>> different from character literals.
>> They are a thing in C (see WG14 N2176 (the final draft WP
>> before C18) 6.4.4.4, "Character constants", p11).  I believe
>> their omission in C++ is just an oversight.  Compilers
>> support them.  I think they are a thing and giving them a
>> name is useful.
>>
>> They don't have a name in C either
> I don't see how giving them a name is in any way detrimental.
>>
>>> Should I send a mail to core? Because I really do not like
>>> that direction. (Especially as what you call wide multi
>>> character literal doesn't behave at all as multi character
>>> literals). We should also look at making them ill formed
>>> rather than giving them a name
>>
>> Arguably, you have already sent that mail to Core :)
>>
>> Haha indeed, nice :)
>>
>> I don't know what behavioral difference you are concerned
>> about.  The primary reason for differentiating them is to
>> allow the multicharacter case to be ill-formed
>> (conditionally-supported) and/or to have an encoding that
>> differs from single c-char literals.
>>
>> I think the standard should reflect existing practice.  These
>> odd literals are supported in common compilers.  If you would
>> like to make them ill-formed, you are certainly free to write
>> a paper, but implementations are already free to make them
>> ill-formed and I suspect the ones that don't would retain
>> support for them as an extension anyway.
>>
>> I am very concerned about giving names to anti features that
>> didn't have a name for the past 30 years, especially those that
>> are not used and were previously not a thing in C++ ( i guess we
>> disagree on our reading on the C standard). I am not concerned
>> about behavior changes
>> Describing them in a bullet point, rather than in this table keep
>> the table readable and meaning full and leave us with the
>> following features:
>>
>> * ordinary/wide/utf character literal
>> * multi character literal
>>
>> Which is then mostly symmetric with the table for strings.
>> The bullet point then describes these odds behaviors, which
>> again is not a behavior change. We are just talking about naming
>> and presentation.
>>
>> (multi character literal does need a name, both because it always
>> had one, and because it has a different type - also it is also
>> somewhat used)
> Whether you consider these literal kinds anti-features is not
> relevant.  Again, I don't see how giving them names is a bad
> thing; it doesn't legitimize them any more than describing them in
> a bullet list does.  A name can acquire both positive and negative
> connotations. The current direction in the paper reflects CWG
> guidance.  If you would like to participate in the next CWG review
> of this paper and argue your position, please feel free.  I don't
> intend to argue it further in this email thread.
>>
>>>>
>>>>
>>>> Please change
>>>> The sequence of characters denoted by each
>>>> contiguous sequence of basic-s-chars, r-chars,
>>>> simple-escape-sequences ([lex.ccon]), and
>>>> universal-character-names ([lex.charset]) is
>>>> encoded to a code unit sequence
>>>> To
>>>> Each basic-s-chars, r-chars,
>>>> simple-escape-sequences ([lex.ccon]), and
>>>> universal-character-names ([lex.charset]) is
>>>> encoded to a code unit sequence
>>>>
>>> The intent is to make it clear that these sequences are
>>> encoded as a group.  This is necessary for stateful
>>> encodings with SI/SO characters since such characters
>>> don't necessarily contribute a code unit sequence on
>>> their own.  This was also requested during the
>>> 2020-03-23 issues processing telecon
>>> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
>>>
>>>
>>> The effect is that I can encode things like e,U+0301 as a
>>> single code unit, which at the very least should not be
>>> allowed in a wording change.
>> Please read the wording again.  I don't think it states
>> that.  If you still think it does, please elaborate in detail.
>>
>>
>>  You use the term character ( which in this context is synonym of
>> abstract character)
>>  The sequence of *characters* denoted by *each contiguous
>> sequence *of basic-s-chars, r-chars, simple-escape-sequences
>> ([lex.ccon]), and universal-character-names ([lex.charset]) is
>> encoded to a code unit sequence using the string-literal's
>> associated character encoding. If a *character* lacks [...]
>>
>> Maybe : Each codepoint denoted by a single basic-s-chars,
>> r-chars, simple-escape-sequences ([lex.ccon]), and
>> universal-character-names is encoded to a code unit sequence
>> using the string-literal's associated character encoding. If that
>> codepoint lacks representation in the associated character encoding,
>>
>> Note that codepoint isn't particularly meaningful in this context
>> , could be "element", for example. The point is the sequence is
>> not converted as a whole.
>> Changing that is design ( I don't have a terribly strong opinion
>> either way, but it needs to be discussed outside of core, notably
>> because it would allow implementation to handle combining
>> characters differently).
> Again, that wording reflects prior CWG guidance.  Since the (wide)
> execution encoding is implementation-defined, I don't agree that
> this reflects a design change, but I defer to our benevolent CWG
> chair to make that determination as he sees fit.
>
>
> Please understand that none of my comments are addressed at you , but
> rather at the proposed wording, which I understand is following cwg
> guidance. I am disagreeing with that guidance
>
>
>>
>>> It's also a terrible reason as c-char and UCNs are Unicode
>>> characters at this point and cannot correspond to a
>>> statefull character as the source of the conversation. The
>>> thing they are converted to being an implementation defined
>>> sequencee of code unit, the possibility of a state shift is
>>> implied.
>>
>> What are you referring to as a "terrible reason"?
>>
>> That
>> > The intent is to make it clear that these sequences are encoded
>> as a group.  This is necessary for stateful encodings with SI/SO
>> characters since such characters don't necessarily contribute a
>> code unit sequence on their own
>>
>> Either:
>>  - These characters  appear as ucn and they should in fact
>> contribute to a code unit sequence
>>  - They are used as part of a stateful source encoding and would
>> have not been conserved past phase 1.
> In discussion elsewhere, we've discussed the example of a
> decomposed 'é' and noted that at least some compilers will encode
> the 'e' and combining acute separately, perhaps substituting a
> character for the combining acute if the execution character set
> doesn't support combining characters as distinct characters. 
> Unless someone has proven otherwise, I believe it would be
> conforming for an implementation to convert the decomposed code
> point sequence to a composed character that is representable in
> the execution encoding (e.g., ISO-8859-1; noting that such
> normalization is not desirable for Unicode encodings).  I see no
> reason to be more restrictive here.
>
>
>
> We have established elsewhere that implementation can't currently do
> that in phased 5 and that they should not be allowed to (we didn't
> poll that , but either ways this paper changes the status quo)

I don't agree that has been established.  If you feel otherwise, please
provide a reference.  I don't think this changes the status quo since
both translation phases 1 and 5 are implementation-defined (and I
believe it has been established that implementations are granted
considerable freedoms here; e.g., trigraphs and implementations that map
"private" to "public").

Tom.

>> SI/SO characters exist in Unicode and can therefore be
>> represented as UCNs.  In translation phase 5, an
>> implementation can treat them as part of a shift sequence
>> when converting to the execution encoding.
>>
>> Again that is a design change
> Per prior comments, I disagree and trust that the CWG chair will
> make an appropriate determination about this.
>
>
> Sure. I'd be happy to participate to the review, let me know!
>
>>>>
>>>>
>>>>
>>>> - please replace applicable character encoding
>>>> by character encoding
>>>>
>>> That doesn't seem correct to me; the wording needs to
>>> indicate which character encoding.  Note that there are
>>> three occurrences of "applicable associated character
>>> encoding"; I'm not sure which use you were referring to.
>>>
>>>
>>> Missed a word. Sorry. Meant associated character encoding.
>>> "Applicable associated" doesn't add anything. Maybe the "the
>>> literal associated encoding"
>>
>> That says the same thing to me.  If CWG expresses a
>> preference, I'll change it.
>>
>> Yes it does, just trying to be consistent in terminology.
>> associated literal encoding is consistent and what sg16 has been
>> using (including you, maybe you came up with that term :p)
>
> There are multiple "associated character encodings" to choose
> from; "applicable" indicates which one to choose.  Again, if the
> CWG requests a change, I'll change it.
>
> Tom.
>
>> Tom.
>>
>
>
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/06/9469.php



SG16 list run by sg16-owner@lists.isocpp.org