On 6/13/24 12:29 PM, Alisdair Meredith via SG16 wrote:

Several of the implementation quantities specified in Annex B
talk about the number of characters in a line, or an identifier.

Now that we have a clearer notion of supporting UTF-8 source
files and unicode in identifiers, do we have a clear understanding
of what we mean by “character”.

No, we don't, and yes, it would be great to fix this!

In various contexts, "character" might be used to refer to:

An abstract character (e.g., the elements of the basic character set).
A code unit, which might be any of:

A character encodeable as a single code unit.
An integer value that could indicate a single code unit of a valid code unit sequence.
An integer value that is not a valid code unit (e.g., u8'\xff').
An element of a shift sequence.
An element of a path or file name with no definite encoding.

A code point, which might be any of:

A character encoded in a literal or execution encoding.
A multibyte character in a locale dependent encoding.
A character denoted by an escape sequence or universal-character-name.

A glyph (a user perceived character; I don't think the standard is currently affected by this).

I wouldn't be surprised to learn that there are others.


For the implementation quantities, I expect we mean code units
in the source character set, but we might also interpret them as
Unicode code points, which might comprise multiple code units
in UTF-8.

Should we bring some clearer language to bear in Annex B, and
should we clarify our assumed understanding in each case?

Ideally, yes.

I think the best way forward is to file LWG issues for any unclear uses. Those can then be assigned to SG16 to offer an interpretation or a recommendation to clarify the wording for LWG.

Tom.


AlisdairM
(On vacation in Thailand but cannot help myself)