Date: Tue, 9 Jun 2020 17:05:49 -0400
On Tue, Jun 9, 2020 at 4:59 PM Corentin Jabot <corentinjabot_at_[hidden]>
wrote:
>
>
> On Tue, 9 Jun 2020 at 22:17, Hubert Tong <hubert.reinterpretcast_at_[hidden]>
> wrote:
>
>> On Tue, Jun 9, 2020 at 1:01 PM Corentin Jabot via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>>
>>>
>>> On Tue, 9 Jun 2020 at 18:45, Steve Downey <sdowney_at_[hidden]> wrote:
>>>
>>>> One thing I have realized while working on identifiers is that after
>>>> conversion from whatever the sources are, lexing and parsing are symbolic.
>>>> That is, 'a' doesn't have a value until it's rendered into a literal. That
>>>> is " The values of the members of the execution character sets and the
>>>> sets of additional members are locale-specific.
>>>> <http://eel.is/c++draft/lex.charset#3.sentence-5>"
>>>> http://eel.is/c++draft/lex.charset#3.sentence-5 really only comes into
>>>> play when rendering the "execution character set" into a characters or
>>>> strings. The execution character set and the source character set exist in
>>>> the same logical space right now, and the "source character set" isn't what
>>>> is in source files today.
>>>>
>>>
>>> Yep, and they don't have to have a value either. identifiers are not
>>> sorted etc.
>>> Everything in lex is symbolic anyway the phases don't exist in practice.
>>> However, the international representation being isomorphic to Unicode,
>>> it would be possible to describe in term of unicode with no observable
>>> behavior change.
>>>
>> I would like to allow characters not present in Unicode within character
>> literals, string literals, comments, and header names. More abstractly, I
>> would like to allow source -> encoding-used-for-output conversion.
>>
>
> Do you have an example of a use case you want to support?
>
I am still evaluating the round-trip mapping for EBCDIC.
> There are 3 scenarios:
>
> - The character exists in no digital encoding yet - that is the paper
> implementation case - nothing that we can do. you can't have Klingon in
> your C++.
> - The character exists in a digital encoding but not in Unicode. This
> represents a small number of the Big5 encodings characters, almost all
> pertaining to places and people names. Unicode documents a mapping for
> windows's Big5 code page.
> - The character has a non-unique mapping to unicode, such as a
> conversion source -> unicode -> execution might be different from a
> conversion source->execution. In this case an implementation can convert
> source -> execution directly (taking care of UCNS and other escape
> sequences) - as it is otherwise not observable. This use case is actually
> common and important, notably for Shift-JIS and ambiguities introduced by
> Han Unification.
>
> This last case is more broken by applying normalization. Do you have an
example where the mapping does not work even if normalization is not
applied?
>
>
>
>
>>
>>
>>> aka 'a" doesn't have a value but it is still the 'a' abstract character
>>> which is represented by U+0061 in Unicode
>>>
>>> I believe that sentence to be, however, very miss leading .
>>> The execution encodings are implementation defined rather than locale
>>> specific.
>>> It becomes locale-specific at runtime, and wording doesn't distinguish
>>> at all before compile time and runtime.
>>>
>>> And yeah, source character set is... the minimal subset of the internal
>>> representation character set
>>>
>>>
>>>>
>>>> On Tue, Jun 9, 2020 at 12:03 PM Corentin Jabot <corentinjabot_at_[hidden]>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, 9 Jun 2020 at 16:48, Tom Honermann via SG16 <
>>>>> sg16_at_[hidden]> wrote:
>>>>>
>>>>>> This is your friendly reminder that an SG16 telecon will be held
>>>>>> tomorrow, Wednesday June 10th, at 19:30 UTC (timezone conversion
>>>>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20200610T193000&p1=1440>).
>>>>>> To attend, visit https://bluejeans.com/140274541 at the start of the
>>>>>> meeting.
>>>>>>
>>>>>> The agenda for the meeting is:
>>>>>>
>>>>>> - Discuss terminology updates to strive for in C++23
>>>>>> - P1859R0: Standard terminology character sets and encodings
>>>>>> <https://wg21.link/p1859>
>>>>>> - Establish priorities for terms to address.
>>>>>> - Establish a methodology for drafting wording updates.
>>>>>>
>>>>>> Anticipated decisions to be made at this meeting include:
>>>>>>
>>>>>> - Prioritization of terminology updates to pursue.
>>>>>>
>>>>>> Prior to tomorrow's meeting, please:
>>>>>>
>>>>>> - review P1859R0, particularly the proposed terminology.
>>>>>> - think of other terminology changes to be considered.
>>>>>> - think of how we can divide up the work for making terminology
>>>>>> updates.
>>>>>>
>>>>>> Hey!
>>>>> Some feedback on P1859 after a first attempt at rewording the standard.
>>>>>
>>>>> I will start to say that it seems entirely reasonable and useful to
>>>>> rewrite [lex] in terms of this new
>>>>> terminology, and I think that trying to split that work would end up
>>>>> being counter productive ( however the library wording, which has its own
>>>>> definitions, could be reworded
>>>>> independently). It is not that much work and I'm willing to do that
>>>>> work.
>>>>>
>>>>> I found that I needed to use the following terms as defined by the
>>>>> Unicode Standard
>>>>>
>>>>> * abstract character
>>>>> * character set
>>>>> * character encoding
>>>>> * code units, codepoint
>>>>>
>>>>> (we can bikeshed codepoint vs scalar values in the grammar as UCNs are
>>>>> technically scalar values)
>>>>>
>>>>> The notion of character repertoire was not useful, that of character
>>>>> set is sufficient.
>>>>>
>>>>> The notion of basic source character set could be removed, instead
>>>>> describing lexing after phase 1 entirely in terms of Unicode - a couple of
>>>>> library functions would have to be reworded, as well as a note in the
>>>>> description of user defined literals as they use "basic source character
>>>>> set" as a proxy to describe something else.
>>>>>
>>>>> In particular, it is useful to separate entirely the notions of source
>>>>> encoding (which only exists in phase 1), internal representation, and
>>>>> literals encodings, there are 3 distinct and unrelated categories of
>>>>> character sets and encodings, which should have no relation to each other,
>>>>> beyond the existence of an uni-directional mapping from source to internal
>>>>> and internal to literal, so i think it would be valuable not to describe
>>>>> them in term of each other.
>>>>>
>>>>> It is useful to be able to talk about the Unicode character set rather
>>>>> than "the character set described in ISO/IEC 10646"
>>>>> The U+xxxx notation (+ unicode character names) is also useful to
>>>>> describe specific codepoints in the grammar.
>>>>>
>>>>> Similarly, the basic execution character set is not a very useful
>>>>> notion as it is only used as a mechanism to describe which
>>>>> characters are in the execution and execution wide character sets)
>>>>> While I didn't try to do it, I think it make sense to rename execution
>>>>> character set in something like narrow/wide literal character sets, in the
>>>>> vein of what P1859 proposes.
>>>>>
>>>>> It is useful to be able to talk about both literal encoding and
>>>>> literal character sets for each type of literal (a given encoding
>>>>> implicitly represents a character set).
>>>>>
>>>>> The notion of dynamic encoding proposed by P1859 and its relation to
>>>>> the literal encoding are not needed in lex and might be better described in
>>>>> library, although a note in lex might not hurt
>>>>>
>>>>> While I have not done that work yet, it seems useful to describe in
>>>>> the grammar in terms of unicode codepoints what constitutes a whitespace as
>>>>> well as a a new line
>>>>>
>>>>> With the exception of "character literal" (and "abstract character" )
>>>>> it seems valuable to systematically replace the use of the vacuous term
>>>>> "character" in the core wording.
>>>>> That might be slightly more involved in library as "character" is used
>>>>> all over the place, usually to mean "code unit"
>>>>>
>>>>> The pdf attached is meant to be illustrative of the scope of changes
>>>>> in the core wording, and also contain a number of design changes that are
>>>>> mostly out of scope of the terminology discussion (It is also full of
>>>>> bugs). These design change will appear in a paper in more details soon™
>>>>>
>>>>> It notably incorporates changes from P2029 which go a long way in
>>>>> improving the way character literals are described.
>>>>>
>>>>> Hope that helps,
>>>>> Corentin
>>>>>
>>>>>
>>>>>> Tom.
>>>>>> --
>>>>>> SG16 mailing list
>>>>>> SG16_at_[hidden]
>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>>
>>>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>
wrote:
>
>
> On Tue, 9 Jun 2020 at 22:17, Hubert Tong <hubert.reinterpretcast_at_[hidden]>
> wrote:
>
>> On Tue, Jun 9, 2020 at 1:01 PM Corentin Jabot via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>>
>>>
>>> On Tue, 9 Jun 2020 at 18:45, Steve Downey <sdowney_at_[hidden]> wrote:
>>>
>>>> One thing I have realized while working on identifiers is that after
>>>> conversion from whatever the sources are, lexing and parsing are symbolic.
>>>> That is, 'a' doesn't have a value until it's rendered into a literal. That
>>>> is " The values of the members of the execution character sets and the
>>>> sets of additional members are locale-specific.
>>>> <http://eel.is/c++draft/lex.charset#3.sentence-5>"
>>>> http://eel.is/c++draft/lex.charset#3.sentence-5 really only comes into
>>>> play when rendering the "execution character set" into a characters or
>>>> strings. The execution character set and the source character set exist in
>>>> the same logical space right now, and the "source character set" isn't what
>>>> is in source files today.
>>>>
>>>
>>> Yep, and they don't have to have a value either. identifiers are not
>>> sorted etc.
>>> Everything in lex is symbolic anyway the phases don't exist in practice.
>>> However, the international representation being isomorphic to Unicode,
>>> it would be possible to describe in term of unicode with no observable
>>> behavior change.
>>>
>> I would like to allow characters not present in Unicode within character
>> literals, string literals, comments, and header names. More abstractly, I
>> would like to allow source -> encoding-used-for-output conversion.
>>
>
> Do you have an example of a use case you want to support?
>
I am still evaluating the round-trip mapping for EBCDIC.
> There are 3 scenarios:
>
> - The character exists in no digital encoding yet - that is the paper
> implementation case - nothing that we can do. you can't have Klingon in
> your C++.
> - The character exists in a digital encoding but not in Unicode. This
> represents a small number of the Big5 encodings characters, almost all
> pertaining to places and people names. Unicode documents a mapping for
> windows's Big5 code page.
> - The character has a non-unique mapping to unicode, such as a
> conversion source -> unicode -> execution might be different from a
> conversion source->execution. In this case an implementation can convert
> source -> execution directly (taking care of UCNS and other escape
> sequences) - as it is otherwise not observable. This use case is actually
> common and important, notably for Shift-JIS and ambiguities introduced by
> Han Unification.
>
> This last case is more broken by applying normalization. Do you have an
example where the mapping does not work even if normalization is not
applied?
>
>
>
>
>>
>>
>>> aka 'a" doesn't have a value but it is still the 'a' abstract character
>>> which is represented by U+0061 in Unicode
>>>
>>> I believe that sentence to be, however, very miss leading .
>>> The execution encodings are implementation defined rather than locale
>>> specific.
>>> It becomes locale-specific at runtime, and wording doesn't distinguish
>>> at all before compile time and runtime.
>>>
>>> And yeah, source character set is... the minimal subset of the internal
>>> representation character set
>>>
>>>
>>>>
>>>> On Tue, Jun 9, 2020 at 12:03 PM Corentin Jabot <corentinjabot_at_[hidden]>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, 9 Jun 2020 at 16:48, Tom Honermann via SG16 <
>>>>> sg16_at_[hidden]> wrote:
>>>>>
>>>>>> This is your friendly reminder that an SG16 telecon will be held
>>>>>> tomorrow, Wednesday June 10th, at 19:30 UTC (timezone conversion
>>>>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20200610T193000&p1=1440>).
>>>>>> To attend, visit https://bluejeans.com/140274541 at the start of the
>>>>>> meeting.
>>>>>>
>>>>>> The agenda for the meeting is:
>>>>>>
>>>>>> - Discuss terminology updates to strive for in C++23
>>>>>> - P1859R0: Standard terminology character sets and encodings
>>>>>> <https://wg21.link/p1859>
>>>>>> - Establish priorities for terms to address.
>>>>>> - Establish a methodology for drafting wording updates.
>>>>>>
>>>>>> Anticipated decisions to be made at this meeting include:
>>>>>>
>>>>>> - Prioritization of terminology updates to pursue.
>>>>>>
>>>>>> Prior to tomorrow's meeting, please:
>>>>>>
>>>>>> - review P1859R0, particularly the proposed terminology.
>>>>>> - think of other terminology changes to be considered.
>>>>>> - think of how we can divide up the work for making terminology
>>>>>> updates.
>>>>>>
>>>>>> Hey!
>>>>> Some feedback on P1859 after a first attempt at rewording the standard.
>>>>>
>>>>> I will start to say that it seems entirely reasonable and useful to
>>>>> rewrite [lex] in terms of this new
>>>>> terminology, and I think that trying to split that work would end up
>>>>> being counter productive ( however the library wording, which has its own
>>>>> definitions, could be reworded
>>>>> independently). It is not that much work and I'm willing to do that
>>>>> work.
>>>>>
>>>>> I found that I needed to use the following terms as defined by the
>>>>> Unicode Standard
>>>>>
>>>>> * abstract character
>>>>> * character set
>>>>> * character encoding
>>>>> * code units, codepoint
>>>>>
>>>>> (we can bikeshed codepoint vs scalar values in the grammar as UCNs are
>>>>> technically scalar values)
>>>>>
>>>>> The notion of character repertoire was not useful, that of character
>>>>> set is sufficient.
>>>>>
>>>>> The notion of basic source character set could be removed, instead
>>>>> describing lexing after phase 1 entirely in terms of Unicode - a couple of
>>>>> library functions would have to be reworded, as well as a note in the
>>>>> description of user defined literals as they use "basic source character
>>>>> set" as a proxy to describe something else.
>>>>>
>>>>> In particular, it is useful to separate entirely the notions of source
>>>>> encoding (which only exists in phase 1), internal representation, and
>>>>> literals encodings, there are 3 distinct and unrelated categories of
>>>>> character sets and encodings, which should have no relation to each other,
>>>>> beyond the existence of an uni-directional mapping from source to internal
>>>>> and internal to literal, so i think it would be valuable not to describe
>>>>> them in term of each other.
>>>>>
>>>>> It is useful to be able to talk about the Unicode character set rather
>>>>> than "the character set described in ISO/IEC 10646"
>>>>> The U+xxxx notation (+ unicode character names) is also useful to
>>>>> describe specific codepoints in the grammar.
>>>>>
>>>>> Similarly, the basic execution character set is not a very useful
>>>>> notion as it is only used as a mechanism to describe which
>>>>> characters are in the execution and execution wide character sets)
>>>>> While I didn't try to do it, I think it make sense to rename execution
>>>>> character set in something like narrow/wide literal character sets, in the
>>>>> vein of what P1859 proposes.
>>>>>
>>>>> It is useful to be able to talk about both literal encoding and
>>>>> literal character sets for each type of literal (a given encoding
>>>>> implicitly represents a character set).
>>>>>
>>>>> The notion of dynamic encoding proposed by P1859 and its relation to
>>>>> the literal encoding are not needed in lex and might be better described in
>>>>> library, although a note in lex might not hurt
>>>>>
>>>>> While I have not done that work yet, it seems useful to describe in
>>>>> the grammar in terms of unicode codepoints what constitutes a whitespace as
>>>>> well as a a new line
>>>>>
>>>>> With the exception of "character literal" (and "abstract character" )
>>>>> it seems valuable to systematically replace the use of the vacuous term
>>>>> "character" in the core wording.
>>>>> That might be slightly more involved in library as "character" is used
>>>>> all over the place, usually to mean "code unit"
>>>>>
>>>>> The pdf attached is meant to be illustrative of the scope of changes
>>>>> in the core wording, and also contain a number of design changes that are
>>>>> mostly out of scope of the terminology discussion (It is also full of
>>>>> bugs). These design change will appear in a paper in more details soon™
>>>>>
>>>>> It notably incorporates changes from P2029 which go a long way in
>>>>> improving the way character literals are described.
>>>>>
>>>>> Hope that helps,
>>>>> Corentin
>>>>>
>>>>>
>>>>>> Tom.
>>>>>> --
>>>>>> SG16 mailing list
>>>>>> SG16_at_[hidden]
>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>>
>>>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>
Received on 2020-06-09 16:09:15