Date: Tue, 9 Jun 2020 13:08:29 -0400
In early implementations of modernish C, once locales had been invented,
the C compiler's locale, distinct from the "C" locale, was what was used to
determine the values of characters.
On Tue, Jun 9, 2020 at 1:01 PM Corentin Jabot <corentinjabot_at_[hidden]>
wrote:
>
>
> On Tue, 9 Jun 2020 at 18:45, Steve Downey <sdowney_at_[hidden]> wrote:
>
>> One thing I have realized while working on identifiers is that after
>> conversion from whatever the sources are, lexing and parsing are symbolic.
>> That is, 'a' doesn't have a value until it's rendered into a literal. That
>> is " The values of the members of the execution character sets and the
>> sets of additional members are locale-specific.
>> <http://eel.is/c++draft/lex.charset#3.sentence-5>"
>> http://eel.is/c++draft/lex.charset#3.sentence-5 really only comes into
>> play when rendering the "execution character set" into a characters or
>> strings. The execution character set and the source character set exist in
>> the same logical space right now, and the "source character set" isn't what
>> is in source files today.
>>
>
> Yep, and they don't have to have a value either. identifiers are not
> sorted etc.
> Everything in lex is symbolic anyway the phases don't exist in practice.
> However, the international representation being isomorphic to Unicode, it
> would be possible to describe in term of unicode with no observable
> behavior change.
> aka 'a" doesn't have a value but it is still the 'a' abstract character
> which is represented by U+0061 in Unicode
>
> I believe that sentence to be, however, very miss leading .
> The execution encodings are implementation defined rather than locale
> specific.
> It becomes locale-specific at runtime, and wording doesn't distinguish at
> all before compile time and runtime.
>
> And yeah, source character set is... the minimal subset of the internal
> representation character set
>
>
>>
>> On Tue, Jun 9, 2020 at 12:03 PM Corentin Jabot <corentinjabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Tue, 9 Jun 2020 at 16:48, Tom Honermann via SG16 <
>>> sg16_at_[hidden]> wrote:
>>>
>>>> This is your friendly reminder that an SG16 telecon will be held
>>>> tomorrow, Wednesday June 10th, at 19:30 UTC (timezone conversion
>>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20200610T193000&p1=1440>).
>>>> To attend, visit https://bluejeans.com/140274541 at the start of the
>>>> meeting.
>>>>
>>>> The agenda for the meeting is:
>>>>
>>>> - Discuss terminology updates to strive for in C++23
>>>> - P1859R0: Standard terminology character sets and encodings
>>>> <https://wg21.link/p1859>
>>>> - Establish priorities for terms to address.
>>>> - Establish a methodology for drafting wording updates.
>>>>
>>>> Anticipated decisions to be made at this meeting include:
>>>>
>>>> - Prioritization of terminology updates to pursue.
>>>>
>>>> Prior to tomorrow's meeting, please:
>>>>
>>>> - review P1859R0, particularly the proposed terminology.
>>>> - think of other terminology changes to be considered.
>>>> - think of how we can divide up the work for making terminology
>>>> updates.
>>>>
>>>> Hey!
>>> Some feedback on P1859 after a first attempt at rewording the standard.
>>>
>>> I will start to say that it seems entirely reasonable and useful to
>>> rewrite [lex] in terms of this new
>>> terminology, and I think that trying to split that work would end up
>>> being counter productive ( however the library wording, which has its own
>>> definitions, could be reworded
>>> independently). It is not that much work and I'm willing to do that work.
>>>
>>> I found that I needed to use the following terms as defined by the
>>> Unicode Standard
>>>
>>> * abstract character
>>> * character set
>>> * character encoding
>>> * code units, codepoint
>>>
>>> (we can bikeshed codepoint vs scalar values in the grammar as UCNs are
>>> technically scalar values)
>>>
>>> The notion of character repertoire was not useful, that of character set
>>> is sufficient.
>>>
>>> The notion of basic source character set could be removed, instead
>>> describing lexing after phase 1 entirely in terms of Unicode - a couple of
>>> library functions would have to be reworded, as well as a note in the
>>> description of user defined literals as they use "basic source character
>>> set" as a proxy to describe something else.
>>>
>>> In particular, it is useful to separate entirely the notions of source
>>> encoding (which only exists in phase 1), internal representation, and
>>> literals encodings, there are 3 distinct and unrelated categories of
>>> character sets and encodings, which should have no relation to each other,
>>> beyond the existence of an uni-directional mapping from source to internal
>>> and internal to literal, so i think it would be valuable not to describe
>>> them in term of each other.
>>>
>>> It is useful to be able to talk about the Unicode character set rather
>>> than "the character set described in ISO/IEC 10646"
>>> The U+xxxx notation (+ unicode character names) is also useful to
>>> describe specific codepoints in the grammar.
>>>
>>> Similarly, the basic execution character set is not a very useful notion
>>> as it is only used as a mechanism to describe which
>>> characters are in the execution and execution wide character sets)
>>> While I didn't try to do it, I think it make sense to rename execution
>>> character set in something like narrow/wide literal character sets, in the
>>> vein of what P1859 proposes.
>>>
>>> It is useful to be able to talk about both literal encoding and literal
>>> character sets for each type of literal (a given encoding
>>> implicitly represents a character set).
>>>
>>> The notion of dynamic encoding proposed by P1859 and its relation to the
>>> literal encoding are not needed in lex and might be better described in
>>> library, although a note in lex might not hurt
>>>
>>> While I have not done that work yet, it seems useful to describe in the
>>> grammar in terms of unicode codepoints what constitutes a whitespace as
>>> well as a a new line
>>>
>>> With the exception of "character literal" (and "abstract character" ) it
>>> seems valuable to systematically replace the use of the vacuous term
>>> "character" in the core wording.
>>> That might be slightly more involved in library as "character" is used
>>> all over the place, usually to mean "code unit"
>>>
>>> The pdf attached is meant to be illustrative of the scope of changes in
>>> the core wording, and also contain a number of design changes that are
>>> mostly out of scope of the terminology discussion (It is also full of
>>> bugs). These design change will appear in a paper in more details soon™
>>>
>>> It notably incorporates changes from P2029 which go a long way in
>>> improving the way character literals are described.
>>>
>>> Hope that helps,
>>> Corentin
>>>
>>>
>>>> Tom.
>>>> --
>>>> SG16 mailing list
>>>> SG16_at_[hidden]
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>
>>>
the C compiler's locale, distinct from the "C" locale, was what was used to
determine the values of characters.
On Tue, Jun 9, 2020 at 1:01 PM Corentin Jabot <corentinjabot_at_[hidden]>
wrote:
>
>
> On Tue, 9 Jun 2020 at 18:45, Steve Downey <sdowney_at_[hidden]> wrote:
>
>> One thing I have realized while working on identifiers is that after
>> conversion from whatever the sources are, lexing and parsing are symbolic.
>> That is, 'a' doesn't have a value until it's rendered into a literal. That
>> is " The values of the members of the execution character sets and the
>> sets of additional members are locale-specific.
>> <http://eel.is/c++draft/lex.charset#3.sentence-5>"
>> http://eel.is/c++draft/lex.charset#3.sentence-5 really only comes into
>> play when rendering the "execution character set" into a characters or
>> strings. The execution character set and the source character set exist in
>> the same logical space right now, and the "source character set" isn't what
>> is in source files today.
>>
>
> Yep, and they don't have to have a value either. identifiers are not
> sorted etc.
> Everything in lex is symbolic anyway the phases don't exist in practice.
> However, the international representation being isomorphic to Unicode, it
> would be possible to describe in term of unicode with no observable
> behavior change.
> aka 'a" doesn't have a value but it is still the 'a' abstract character
> which is represented by U+0061 in Unicode
>
> I believe that sentence to be, however, very miss leading .
> The execution encodings are implementation defined rather than locale
> specific.
> It becomes locale-specific at runtime, and wording doesn't distinguish at
> all before compile time and runtime.
>
> And yeah, source character set is... the minimal subset of the internal
> representation character set
>
>
>>
>> On Tue, Jun 9, 2020 at 12:03 PM Corentin Jabot <corentinjabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Tue, 9 Jun 2020 at 16:48, Tom Honermann via SG16 <
>>> sg16_at_[hidden]> wrote:
>>>
>>>> This is your friendly reminder that an SG16 telecon will be held
>>>> tomorrow, Wednesday June 10th, at 19:30 UTC (timezone conversion
>>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20200610T193000&p1=1440>).
>>>> To attend, visit https://bluejeans.com/140274541 at the start of the
>>>> meeting.
>>>>
>>>> The agenda for the meeting is:
>>>>
>>>> - Discuss terminology updates to strive for in C++23
>>>> - P1859R0: Standard terminology character sets and encodings
>>>> <https://wg21.link/p1859>
>>>> - Establish priorities for terms to address.
>>>> - Establish a methodology for drafting wording updates.
>>>>
>>>> Anticipated decisions to be made at this meeting include:
>>>>
>>>> - Prioritization of terminology updates to pursue.
>>>>
>>>> Prior to tomorrow's meeting, please:
>>>>
>>>> - review P1859R0, particularly the proposed terminology.
>>>> - think of other terminology changes to be considered.
>>>> - think of how we can divide up the work for making terminology
>>>> updates.
>>>>
>>>> Hey!
>>> Some feedback on P1859 after a first attempt at rewording the standard.
>>>
>>> I will start to say that it seems entirely reasonable and useful to
>>> rewrite [lex] in terms of this new
>>> terminology, and I think that trying to split that work would end up
>>> being counter productive ( however the library wording, which has its own
>>> definitions, could be reworded
>>> independently). It is not that much work and I'm willing to do that work.
>>>
>>> I found that I needed to use the following terms as defined by the
>>> Unicode Standard
>>>
>>> * abstract character
>>> * character set
>>> * character encoding
>>> * code units, codepoint
>>>
>>> (we can bikeshed codepoint vs scalar values in the grammar as UCNs are
>>> technically scalar values)
>>>
>>> The notion of character repertoire was not useful, that of character set
>>> is sufficient.
>>>
>>> The notion of basic source character set could be removed, instead
>>> describing lexing after phase 1 entirely in terms of Unicode - a couple of
>>> library functions would have to be reworded, as well as a note in the
>>> description of user defined literals as they use "basic source character
>>> set" as a proxy to describe something else.
>>>
>>> In particular, it is useful to separate entirely the notions of source
>>> encoding (which only exists in phase 1), internal representation, and
>>> literals encodings, there are 3 distinct and unrelated categories of
>>> character sets and encodings, which should have no relation to each other,
>>> beyond the existence of an uni-directional mapping from source to internal
>>> and internal to literal, so i think it would be valuable not to describe
>>> them in term of each other.
>>>
>>> It is useful to be able to talk about the Unicode character set rather
>>> than "the character set described in ISO/IEC 10646"
>>> The U+xxxx notation (+ unicode character names) is also useful to
>>> describe specific codepoints in the grammar.
>>>
>>> Similarly, the basic execution character set is not a very useful notion
>>> as it is only used as a mechanism to describe which
>>> characters are in the execution and execution wide character sets)
>>> While I didn't try to do it, I think it make sense to rename execution
>>> character set in something like narrow/wide literal character sets, in the
>>> vein of what P1859 proposes.
>>>
>>> It is useful to be able to talk about both literal encoding and literal
>>> character sets for each type of literal (a given encoding
>>> implicitly represents a character set).
>>>
>>> The notion of dynamic encoding proposed by P1859 and its relation to the
>>> literal encoding are not needed in lex and might be better described in
>>> library, although a note in lex might not hurt
>>>
>>> While I have not done that work yet, it seems useful to describe in the
>>> grammar in terms of unicode codepoints what constitutes a whitespace as
>>> well as a a new line
>>>
>>> With the exception of "character literal" (and "abstract character" ) it
>>> seems valuable to systematically replace the use of the vacuous term
>>> "character" in the core wording.
>>> That might be slightly more involved in library as "character" is used
>>> all over the place, usually to mean "code unit"
>>>
>>> The pdf attached is meant to be illustrative of the scope of changes in
>>> the core wording, and also contain a number of design changes that are
>>> mostly out of scope of the terminology discussion (It is also full of
>>> bugs). These design change will appear in a paper in more details soon™
>>>
>>> It notably incorporates changes from P2029 which go a long way in
>>> improving the way character literals are described.
>>>
>>> Hope that helps,
>>> Corentin
>>>
>>>
>>>> Tom.
>>>> --
>>>> SG16 mailing list
>>>> SG16_at_[hidden]
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>
>>>
Received on 2020-06-09 12:11:50