Date: Mon, 24 Jun 2024 11:56:55 -0500
On 6/13/24 12:29 PM, Alisdair Meredith via SG16 wrote:
> Several of the implementation quantities specified in Annex B
> talk about the number of characters in a line, or an identifier.
>
> Now that we have a clearer notion of supporting UTF-8 source
> files and unicode in identifiers, do we have a clear understanding
> of what we mean by “character”.
No, we don't, and yes, it would be great to fix this!
In various contexts, "character" might be used to refer to:
* An abstract character (e.g., the elements of the basic character set).
* A code unit, which might be any of:
o A character encodeable as a single code unit.
o An integer value that could indicate a single code unit of a
valid code unit sequence.
o An integer value that is not a valid code unit (e.g., u8'\xff').
o An element of a shift sequence.
o An element of a path or file name with no definite encoding.
* A code point, which might be any of:
o A character encoded in a literal or execution encoding.
o A multibyte character in a locale dependent encoding.
o A character denoted by an escape sequence or
/universal-character-name/.
* A glyph (a user perceived character; I don't think the standard is
currently affected by this).
I wouldn't be surprised to learn that there are others.
>
> For the implementation quantities, I expect we mean code units
> in the source character set, but we might also interpret them as
> Unicode code points, which might comprise multiple code units
> in UTF-8.
>
> Should we bring some clearer language to bear in Annex B, and
> should we clarify our assumed understanding in each case?
Ideally, yes.
I think the best way forward is to file LWG issues for any unclear uses.
Those can then be assigned to SG16 to offer an interpretation or a
recommendation to clarify the wording for LWG.
Tom.
>
> AlisdairM
> (On vacation in Thailand but cannot help myself)
> Several of the implementation quantities specified in Annex B
> talk about the number of characters in a line, or an identifier.
>
> Now that we have a clearer notion of supporting UTF-8 source
> files and unicode in identifiers, do we have a clear understanding
> of what we mean by “character”.
No, we don't, and yes, it would be great to fix this!
In various contexts, "character" might be used to refer to:
* An abstract character (e.g., the elements of the basic character set).
* A code unit, which might be any of:
o A character encodeable as a single code unit.
o An integer value that could indicate a single code unit of a
valid code unit sequence.
o An integer value that is not a valid code unit (e.g., u8'\xff').
o An element of a shift sequence.
o An element of a path or file name with no definite encoding.
* A code point, which might be any of:
o A character encoded in a literal or execution encoding.
o A multibyte character in a locale dependent encoding.
o A character denoted by an escape sequence or
/universal-character-name/.
* A glyph (a user perceived character; I don't think the standard is
currently affected by this).
I wouldn't be surprised to learn that there are others.
>
> For the implementation quantities, I expect we mean code units
> in the source character set, but we might also interpret them as
> Unicode code points, which might comprise multiple code units
> in UTF-8.
>
> Should we bring some clearer language to bear in Annex B, and
> should we clarify our assumed understanding in each case?
Ideally, yes.
I think the best way forward is to file LWG issues for any unclear uses.
Those can then be assigned to SG16 to offer an interpretation or a
recommendation to clarify the wording for LWG.
Tom.
>
> AlisdairM
> (On vacation in Thailand but cannot help myself)
Received on 2024-06-24 16:56:56