Date: Sun, 23 Jun 2019 11:17:33 +0200
Hello
It is clear that one of the challenge face by SG-16 is to do utf8 I/O in a
system very much not designed for Unicode.
I distinguish 2 classes of I/O
- Run-time I/O
- Compile time I/O, aka text embedded in source file
Run-time I/O is a complicated topic that I personally think will only be
solved by a careful redesign of the io and locales facilities... but its a
much larger discussion,
I actually want to discuss compile time I/O aka string and characters
literals.
It's a problem complicated enough that it was one of the reasons Python
operated a language break that people are still enduring over after a
decade. Let's assume C++ doesn't want to go that route.
Let's further assume
- Unicode will not be replaced and superseded in our lifetime
- Unicode is the only character set to be able to handle text for it is
the only encoding that is a strict super of all previously existing
encodings and strives to encompass all characters used by people
- A poor Unicode support at the language and system levels has led to
most developers having a poor understanding of text and encoding
(regardless of skill level)
- In many cases developers and engineers only care about sequences of
bytes they can pronounce rather than text, and in these cases ASCII is
sufficient
- Systems and compiler vendors have a vested interest in supporting
Unicode which is already the most used encoding in user-facing systems
https://en.wikipedia.org/wiki/Unicode#Adoption
I propose that we work towards making Unicode the only supported _source_
character set - I realize this might take time as far from all source files
are encoded in a UTF encoding, however Unicode is designed to make that
possible.
This is also standard practice and both GCC and Clang will assume a UTF-8
encoding
In the meantime, I propose that:
- Source with characters outside of the basic source character set
embedded in a utf-8, utf-16 or utf-32 character or string literal are
ill-formed* if and only if* the source/input character set is not
Unicode.
- We put out a guideline recommending for source files to be utf-8
encoded
- We put in the standard that compiler should assume utf8-encoded filers
as the default input encoding unless of the existence of implementation
defined out-of-band information (which would have no practical impact, but
to signal we recommend supporting utf-8)
- We deprecate string and wide character and string literals (char,
wchar_t) whose source representation contains characters not re-presentable
in the basic execution character set or wide execution character set
respectively. We encourage implementers to emit a warning in these cases -
the intent is to avoid loss of information when transcoding to the
execution character set - This matches existing practice
The proposed changes hope to make it easier to use string literals and
Unicode strings literal without loss of information portably across
platforms by capitalizing on char8_t
They would standardize existing practice, match common practice in other
languages (go rust, swift, python) and avoid bugs related to loss of
information when transcoding arbitrary Unicode data to legacy encodings not
able to represent but a very small of characters defined by Unicode.
It also make it feasible to have Unicode identifiers in the future, as
proposed by JF
Looking to discussing these ideas further,
Corentin
It is clear that one of the challenge face by SG-16 is to do utf8 I/O in a
system very much not designed for Unicode.
I distinguish 2 classes of I/O
- Run-time I/O
- Compile time I/O, aka text embedded in source file
Run-time I/O is a complicated topic that I personally think will only be
solved by a careful redesign of the io and locales facilities... but its a
much larger discussion,
I actually want to discuss compile time I/O aka string and characters
literals.
It's a problem complicated enough that it was one of the reasons Python
operated a language break that people are still enduring over after a
decade. Let's assume C++ doesn't want to go that route.
Let's further assume
- Unicode will not be replaced and superseded in our lifetime
- Unicode is the only character set to be able to handle text for it is
the only encoding that is a strict super of all previously existing
encodings and strives to encompass all characters used by people
- A poor Unicode support at the language and system levels has led to
most developers having a poor understanding of text and encoding
(regardless of skill level)
- In many cases developers and engineers only care about sequences of
bytes they can pronounce rather than text, and in these cases ASCII is
sufficient
- Systems and compiler vendors have a vested interest in supporting
Unicode which is already the most used encoding in user-facing systems
https://en.wikipedia.org/wiki/Unicode#Adoption
I propose that we work towards making Unicode the only supported _source_
character set - I realize this might take time as far from all source files
are encoded in a UTF encoding, however Unicode is designed to make that
possible.
This is also standard practice and both GCC and Clang will assume a UTF-8
encoding
In the meantime, I propose that:
- Source with characters outside of the basic source character set
embedded in a utf-8, utf-16 or utf-32 character or string literal are
ill-formed* if and only if* the source/input character set is not
Unicode.
- We put out a guideline recommending for source files to be utf-8
encoded
- We put in the standard that compiler should assume utf8-encoded filers
as the default input encoding unless of the existence of implementation
defined out-of-band information (which would have no practical impact, but
to signal we recommend supporting utf-8)
- We deprecate string and wide character and string literals (char,
wchar_t) whose source representation contains characters not re-presentable
in the basic execution character set or wide execution character set
respectively. We encourage implementers to emit a warning in these cases -
the intent is to avoid loss of information when transcoding to the
execution character set - This matches existing practice
The proposed changes hope to make it easier to use string literals and
Unicode strings literal without loss of information portably across
platforms by capitalizing on char8_t
They would standardize existing practice, match common practice in other
languages (go rust, swift, python) and avoid bugs related to loss of
information when transcoding arbitrary Unicode data to legacy encodings not
able to represent but a very small of characters defined by Unicode.
It also make it feasible to have Unicode identifiers in the future, as
proposed by JF
Looking to discussing these ideas further,
Corentin
Received on 2019-06-23 11:17:46