Several of the implementation quantities specified in Annex B talk about the number of characters in a line, or an identifier. Now that we have a clearer notion of supporting UTF-8 source files and unicode in identifiers, do we have a clear understanding of what we mean by “character”.
No, we don't, and yes, it would be great to fix this!
In various contexts, "character" might be used to refer to:
I wouldn't be surprised to learn that there are others.
For the implementation quantities, I expect we mean code units in the source character set, but we might also interpret them as Unicode code points, which might comprise multiple code units in UTF-8. Should we bring some clearer language to bear in Annex B, and should we clarify our assumed understanding in each case?
Ideally, yes.
I think the best way forward is to file LWG issues for any
unclear uses. Those can then be assigned to SG16 to offer an
interpretation or a recommendation to clarify the wording for LWG.
Tom.
AlisdairM (On vacation in Thailand but cannot help myself)