Predictions have a spectacular rate of failure, but in this case, I would not bet against you :)Let's further assume
- Unicode will not be replaced and superseded in our lifetime
Sure.
- Unicode is the only character set to be able to handle text for it is the only encoding that is a strict super of all previously existing encodings and strives to encompass all characters used by people
Poor Unicode support *might* be a contributing factor, but is hardly the only one. Sometimes, poor support is a motivating factor for getting educated.
- A poor Unicode support at the language and system levels has led to most developers having a poor understanding of text and encoding (regardless of skill level)
Likewise, EBCDIC is sufficient on some systems.
- In many cases developers and engineers only care about sequences of bytes they can pronounce rather than text, and in these cases ASCII is sufficient
- Systems and compiler vendors have a vested interest in supporting Unicode which is already the most used encoding in user-facing systems https://en.wikipedia.org/wiki/Unicode#Adoption
Not necessarily. Some may take the view that Unicode support is an external library problem.
I don't see a need for the standard to impose a single supported source character set; I think it is better to let the market drive this. If a convergence occurs, that would be the appropriate time for the standard to reflect it. It is appropriate for the standard to lead in some cases, but in this case, there is considerable history that prohibits a wholesale migration. I agree with the general sentiment though; I do want to encourage and make migration easier.I propose that we work towards making Unicode the only supported _source_ character set - I realize this might take time as far from all source files are encoded in a UTF encoding, however Unicode is designed to make that possible.This is also standard practice and both GCC and Clang will assume a UTF-8 encoding
That would break existing code for, in my opinion, little gain. I think the more pressing concern is means to determine what encoding to interpret a source file as.
In the meantime, I propose that:
- Source with characters outside of the basic source character set embedded in a utf-8, utf-16 or utf-32 character or string literal are ill-formed if and only if the source/input character set is not Unicode.
Are you suggesting a standing document? I don't see much benefit in doing so.
- We put out a guideline recommending for source files to be utf-8 encoded
This would again break existing code. And the out-of-band information used today by the Microsoft compiler is the current locale (active code page).
- We put in the standard that compiler should assume utf8-encoded filers as the default input encoding unless of the existence of implementation defined out-of-band information (which would have no practical impact, but to signal we recommend supporting utf-8)
- We deprecate string and wide character and string literals (char, wchar_t) whose source representation contains characters not re-presentable in the basic execution character set or wide execution character set respectively. We encourage implementers to emit a warning in these cases - the intent is to avoid loss of information when transcoding to the execution character set - This matches existing practice
This is not existing practice in my experience, and I'm not sure
what techniques you have in mind for such encouragement. I think
I can get on board with such deprecation for the non-basic [wide]
(presumed) execution character set. E.g., I'd like to change the
implementation and conditionally defined behavior in these clauses
such that the program becomes ill-formed:
Maybe I'm missing it, but I don't see how what is proposed above helps capitalize on char8_t.
The proposed changes hope to make it easier to use string literals and Unicode strings literal without loss of information portably across platforms by capitalizing on char8_t
I'm not seeing the parallels here. I do agree with wanting to change source-to-execution character set transcoding problems into compile-time errors though.They would standardize existing practice, match common practice in other languages (go rust, swift, python) and avoid bugs related to loss of information when transcoding arbitrary Unicode data to legacy encodings not able to represent but a very small of characters defined by Unicode.
It also make it feasible to have Unicode identifiers in the future, as proposed by JF
We actually have them now.
I started on a related paper for the pre-Cologne mailing, but
didn't get it done in time. I hope to have it to enough of a
state to present at least some of the content at the SG16 evening
session in Cologne. One of the goals is to enable encodings to be
determined on a per-source-file basis. Two options will be
discussed:
Tom.
Looking to discussing these ideas further,Corentin
_______________________________________________ SG16 Unicode mailing list Unicode@isocpp.open-std.org http://www.open-std.org/mailman/listinfo/unicode