C++ Logo


Advanced search

Subject: Re: [SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?
From: Herring, Davis (herring_at_[hidden])
Date: 2019-08-13 09:27:00

> Is it politically feasible for C++ 23 and C 2x to require
> implementations to default to interpreting source files as either (i) 7
> bit ASCII or (ii) UTF-8? To be specific, char literals would thus be
> either 7 bit ASCII or UTF-8.

Answering a different question: it's not technically meaningful to do so, especially the "default to" part (as Ville has reminded us recently in a different context). We can no more specify the "actual form" of a source file than we can specify the "actual form" of program output. (We can of course require the implementation to provide functions like

  void writeUTF8(const char8_t*); // to stdout

but we can require neither that it be implemented "correctly" (e.g., with complete font support on a graphical terminal) nor that it "use UTF-8" to communicate with the environment. Recall, as always, the Van Eerd architecture that might have calligraphy as stdout.)

We could specify the source file directly as a sequence of ISO 10646 abstract characters, or even as a sequence of UTF-8 code units, but the implementation could choose to interpret the disk file to contain KOI-7 N1 with some sort of escape sequences for other characters. You might say "That's not UTF-8 on disk!", to which the implementation replies "That's how my operating system natively stores UTF-8." and the standard replies "What's a disk?".


SG16 list run by sg16-owner@lists.isocpp.org