+ liaison (which I hope will forgive my use of C++ terminology)

On Mon, Mar 29, 2021 at 5:15 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

FYI, WG14 will be considering N2701, a paper proposing the addition of '@', '$', and '`' to the basic source and execution character sets.

Because I am fun at parties, some rambling on basic character sets.

(Note that the following is mostly a reflection on the current specification of C (and to some extent C++), rather than the proposal itself).

What does this proposal solve?

Both languages allow (but do not require) these characters or any characters in source files.

In C++, the basic character set is unrelated to source files.

In C, all lexing is done in source encoding.

But neither languages prescribe an encoding. So extending the set of abstract characters that source encodings are supposed to encode does not increase,

in any way, portability of source files. The only way to make source files portable is to mandate the recognition of specific encodings, hence C++'s P2295.

If an encoding were to be prescribed, it would imply a character set which wouldn't need further description (ie, no need to say UTF-8 must encode $, it just does).

Beyond phase 1 the set of codepoints that must be supported is described by the grammar. If a character set can't represent an open parenthesis,

it might be difficult to represent a C program and as such requirements on source encodings don't need to be further described.

Then there is evaluated string/characters literals.

Please realize that whether something can appear in a source files or can appear in an evaluated string/character (And by extension anything interpreted in a local-specific encoding*) are different concerns and should not be tied, except that both languages describe the grammar in terms of the basic character set, effectively preventing these different concerns to be handled

separately (I have good hope that in C++ that coupling can be removed without too much work, that does seem somewhat more challenging in C).

So I am left wondering what the goal is for character literals / execution character set?

In both C and C++ putting a character in the basic execution set guarantees that

The execution encoding characters can encode these characters. Do we want programs that never use these characters to be suddenly ill-formed or UB because they use an encoding that does not encode $ or @ (of which there are plenty https://en.wikipedia.org/wiki/ISO/IEC_646#Variant_comparison_chart - well, were plenty, not sure these get much use these days) - even if these characters are ever use?
By consequences, these characters will be encoded correctly. It is unfortunate that the handling of non encodable characters is implementation defined, and some implementations choose to replace such characters by a question mark.
Both languages like to put constraints on how many code units a given member of the execution character set is represented with. But because both languages do support stateful encodings, in the general case unless you know exactly what the execution encoding _is_, it is not wise to randomly access elements of a string, and so the single byte guarantee buys you very little.

And so the portability argument, while slightly stronger than for source, still implies that the code does not make specific assumptions about the execution encoding (most of my code would behave terribly if targeting a shift-jis environment).

And what does that give us? What if the motivation for "requiring" specific characters to be encodable even when neither the program or library or or the program make use of them?

It should be on the users to choose an encoding appropriate for the program (and on compilers to refuse to compile when the encodings can't encode some literals).

Lastly, and this is my main concern(my only concern really) with this paper's motivation, there is, fortunately nothing in the wording of either languages that suggests that extending the basic source character set

impacts identifiers, which define their own grammar.

But in no case should the paper suggest that implementers have suddenly the freedom to support @ in identifiers.

Symbols are too rare for such wastage and that they all support $ is problematic enough for future evolution of either language.

And extending, or not extending these sets does very little in allowing or disallowing the use of these characters outside of comments and string literals.

Corentin

* Actually, it is possible to imagine that $ is not part of the literal character set but is part of the local-specific execution encoding, if we admit that these things are separate (and yes, this is a contrived scenario,

my point is that we should be explicit in what is the problem we are trying to address)

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16