liaison: Re: [wg14/wg21 liaison] [SG16] WG14 N2701: @ and $ in source and execution character set

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Tue, 30 Mar 2021 00:01:11 +0200

+ liaison (which I hope will forgive my use of C++ terminology)

On Mon, Mar 29, 2021 at 5:15 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> FYI, WG14 will be considering N2701
> <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2701.htm>, a paper
> proposing the addition of '@', '$', and '`' to the basic source and
> execution character sets.
>
Because I am fun at parties, some rambling on basic character sets.

(Note that the following is mostly a reflection on the current
specification of C (and to some extent C++), rather than the proposal
itself).

What does this proposal solve?

Both languages allow (but do not require) these characters or any
characters in source files.
In C++, the basic character set is unrelated to source files.
In C, all lexing is done in source encoding.

But neither languages prescribe an encoding. So extending the set of
abstract characters that source encodings are supposed to encode does not
increase,
in any way, portability of source files. The only way to make source files
portable is to mandate the recognition of specific encodings, hence C++'s
P2295.
If an encoding were to be prescribed, it would imply a character set which
wouldn't need further description (ie, no need to say UTF-8 must encode $,
it just does).

Beyond phase 1 the set of codepoints that must be supported is described by
the grammar. If a character set can't represent an open parenthesis,
it might be difficult to represent a C program and as such requirements on
source encodings don't need to be further described.

Then there is evaluated string/characters literals.

Please realize that whether something can appear in a source files or can
appear in an evaluated string/character (And by extension anything
interpreted in a local-specific encoding*) are different concerns and
should not be tied, except that both languages describe the grammar in
terms of the basic character set, effectively preventing these different
concerns to be handled
separately (I have good hope that in C++ that coupling can be removed
without too much work, that does seem somewhat more challenging in C).

So I am left wondering what the goal is for character literals / execution
character set?
In both C and C++ putting a character in the basic execution set guarantees
that

   - The execution encoding characters can encode these characters. Do we
   want programs that never use these characters to be suddenly ill-formed or
   UB because they use an encoding that does not encode $ or @ (of which
   there are plenty
   https://en.wikipedia.org/wiki/ISO/IEC_646#Variant_comparison_chart -
   well, *were plenty*, not sure these get much use these days) - even if
   these characters are ever use?
   - By consequences, these characters will be encoded correctly. It is
   unfortunate that the handling of non encodable characters is implementation
   defined, and some implementations choose to replace such characters by a
   question mark.
   - Both languages like to put constraints on how many code units a given
   member of the execution character set is represented with. But because both
   languages do support stateful encodings, in the general case unless you
   know exactly what the execution encoding _is_, it is not wise to randomly
   access elements of a string, and so the single byte guarantee buys you very
   little.

And so the portability argument, while slightly stronger than for source,
still implies that the code does not make specific assumptions about the
execution encoding (most of my code would behave terribly if targeting a
shift-jis environment).

And what does that give us? What if the motivation for "requiring" specific
characters to be encodable even when neither the program or library or or
the program make use of them?
It should be on the users to choose an encoding appropriate for the program
(and on compilers to refuse to compile when the encodings can't encode some
literals).

Lastly, and this is my main concern(my only concern really) with this
paper's motivation, there is, fortunately nothing in the wording of either
languages that suggests that extending the basic source character set
impacts identifiers, which define their own grammar.

But in no case should the paper suggest that implementers have suddenly the
freedom to support @ in identifiers.
Symbols are too rare for such wastage and that they all support $ is
problematic enough for future evolution of either language.
And extending, or not extending these sets does very little in allowing or
disallowing the use of these characters outside of comments and string
literals.

Corentin

* Actually, it is possible to imagine that $ is not part of the literal
character set but is part of the local-specific execution encoding, if we
admit that these things are separate (and yes, this is a contrived scenario,
my point is that we should be explicit in what is the problem we are trying
to address)

> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-03-29 17:01:25