C++ Logo

sg16

Advanced search

Re: [isocpp-sg16] What does Annex B mean by "character"

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 24 Jun 2024 11:56:55 -0500
On 6/13/24 12:29 PM, Alisdair Meredith via SG16 wrote:
> Several of the implementation quantities specified in Annex B
> talk about the number of characters in a line, or an identifier.
>
> Now that we have a clearer notion of supporting UTF-8 source
> files and unicode in identifiers, do we have a clear understanding
> of what we mean by “character”.

No, we don't, and yes, it would be great to fix this!

In various contexts, "character" might be used to refer to:

  * An abstract character (e.g., the elements of the basic character set).
  * A code unit, which might be any of:
      o A character encodeable as a single code unit.
      o An integer value that could indicate a single code unit of a
        valid code unit sequence.
      o An integer value that is not a valid code unit (e.g., u8'\xff').
      o An element of a shift sequence.
      o An element of a path or file name with no definite encoding.
  * A code point, which might be any of:
      o A character encoded in a literal or execution encoding.
      o A multibyte character in a locale dependent encoding.
      o A character denoted by an escape sequence or
        /universal-character-name/.
  * A glyph (a user perceived character; I don't think the standard is
    currently affected by this).

I wouldn't be surprised to learn that there are others.

>
> For the implementation quantities, I expect we mean code units
> in the source character set, but we might also interpret them as
> Unicode code points, which might comprise multiple code units
> in UTF-8.
>
> Should we bring some clearer language to bear in Annex B, and
> should we clarify our assumed understanding in each case?

Ideally, yes.

I think the best way forward is to file LWG issues for any unclear uses.
Those can then be assigned to SG16 to offer an interpretation or a
recommendation to clarify the wording for LWG.

Tom.

>
> AlisdairM
> (On vacation in Thailand but cannot help myself)

Received on 2024-06-24 16:56:56