sg16: [SG16-Unicode] UTF I/O : Source file and encoding

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sun, 23 Jun 2019 11:17:33 +0200

Hello

It is clear that one of the challenge face by SG-16 is to do utf8 I/O in a
system very much not designed for Unicode.

I distinguish 2 classes of I/O

   - Run-time I/O
   - Compile time I/O, aka text embedded in source file

Run-time I/O is a complicated topic that I personally think will only be
solved by a careful redesign of the io and locales facilities... but its a
much larger discussion,
I actually want to discuss compile time I/O aka string and characters
literals.

It's a problem complicated enough that it was one of the reasons Python
operated a language break that people are still enduring over after a
decade. Let's assume C++ doesn't want to go that route.

Let's further assume

   - Unicode will not be replaced and superseded in our lifetime
   - Unicode is the only character set to be able to handle text for it is
   the only encoding that is a strict super of all previously existing
   encodings and strives to encompass all characters used by people
   - A poor Unicode support at the language and system levels has led to
   most developers having a poor understanding of text and encoding
   (regardless of skill level)
   - In many cases developers and engineers only care about sequences of
   bytes they can pronounce rather than text, and in these cases ASCII is
   sufficient
   - Systems and compiler vendors have a vested interest in supporting
   Unicode which is already the most used encoding in user-facing systems
   https://en.wikipedia.org/wiki/Unicode#Adoption

I propose that we work towards making Unicode the only supported _source_
character set - I realize this might take time as far from all source files
are encoded in a UTF encoding, however Unicode is designed to make that
possible.
This is also standard practice and both GCC and Clang will assume a UTF-8
encoding

In the meantime, I propose that:

   - Source with characters outside of the basic source character set
   embedded in a utf-8, utf-16 or utf-32 character or string literal are
   ill-formed* if and only if* the source/input character set is not
   Unicode.
   - We put out a guideline recommending for source files to be utf-8
   encoded
   - We put in the standard that compiler should assume utf8-encoded filers
   as the default input encoding unless of the existence of implementation
   defined out-of-band information (which would have no practical impact, but
   to signal we recommend supporting utf-8)
   - We deprecate string and wide character and string literals (char,
   wchar_t) whose source representation contains characters not re-presentable
   in the basic execution character set or wide execution character set
   respectively. We encourage implementers to emit a warning in these cases -
   the intent is to avoid loss of information when transcoding to the
   execution character set - This matches existing practice

The proposed changes hope to make it easier to use string literals and
Unicode strings literal without loss of information portably across
platforms by capitalizing on char8_t
They would standardize existing practice, match common practice in other
languages (go rust, swift, python) and avoid bugs related to loss of
information when transcoding arbitrary Unicode data to legacy encodings not
able to represent but a very small of characters defined by Unicode.

It also make it feasible to have Unicode identifiers in the future, as
proposed by JF

Looking to discussing these ideas further,
Corentin

Received on 2019-06-23 11:17:46