On 6/13/24 12:29 PM, Alisdair Meredith via SG16 wrote:
Several of the implementation quantities specified in Annex B
talk about the number of characters in a line, or an identifier.

Now that we have a clearer notion of supporting UTF-8 source
files and unicode in identifiers, do we have a clear understanding
of what we mean by “character”.

No, we don't, and yes, it would be great to fix this!

In various contexts, "character" might be used to refer to:

I wouldn't be surprised to learn that there are others.


For the implementation quantities, I expect we mean code units
in the source character set, but we might also interpret them as
Unicode code points, which might comprise multiple code units
in UTF-8.

Should we bring some clearer language to bear in Annex B, and
should we clarify our assumed understanding in each case?

Ideally, yes.

I think the best way forward is to file LWG issues for any unclear uses. Those can then be assigned to SG16 to offer an interpretation or a recommendation to clarify the wording for LWG.

Tom.


AlisdairM
(On vacation in Thailand but cannot help myself)