On 6/23/19 5:17 AM, Corentin wrote:

Let's further assume

Unicode will not be replaced and superseded in our lifetime

Predictions have a spectacular rate of failure, but in this case, I would not bet against you :)

Unicode is the only character set to be able to handle text for it is the only encoding that is a strict super of all previously existing encodings and strives to encompass all characters used by people

Sure.

A poor Unicode support at the language and system levels has led to most developers having a poor understanding of text and encoding (regardless of skill level)

Poor Unicode support *might* be a contributing factor, but is hardly the only one. Sometimes, poor support is a motivating factor for getting educated.

In many cases developers and engineers only care about sequences of bytes they can pronounce rather than text, and in these cases ASCII is sufficient

Likewise, EBCDIC is sufficient on some systems.

Systems and compiler vendors have a vested interest in supporting Unicode which is already the most used encoding in user-facing systems https://en.wikipedia.org/wiki/Unicode#Adoption

Not necessarily. Some may take the view that Unicode support is an external library problem.

I propose that we work towards making Unicode the only supported _source_ character set - I realize this might take time as far from all source files are encoded in a UTF encoding, however Unicode is designed to make that possible.

This is also standard practice and both GCC and Clang will assume a UTF-8 encoding

I don't see a need for the standard to impose a single supported source character set; I think it is better to let the market drive this. If a convergence occurs, that would be the appropriate time for the standard to reflect it. It is appropriate for the standard to lead in some cases, but in this case, there is considerable history that prohibits a wholesale migration. I agree with the general sentiment though; I do want to encourage and make migration easier.

In the meantime, I propose that:

Source with characters outside of the basic source character set embedded in a utf-8, utf-16 or utf-32 character or string literal are ill-formed if and only if the source/input character set is not Unicode.

That would break existing code for, in my opinion, little gain. I think the more pressing concern is means to determine what encoding to interpret a source file as.

We put out a guideline recommending for source files to be utf-8 encoded

Are you suggesting a standing document? I don't see much benefit in doing so.

We put in the standard that compiler should assume utf8-encoded filers as the default input encoding unless of the existence of implementation defined out-of-band information (which would have no practical impact, but to signal we recommend supporting utf-8)

This would again break existing code. And the out-of-band information used today by the Microsoft compiler is the current locale (active code page).

We deprecate string and wide character and string literals (char, wchar_t) whose source representation contains characters not re-presentable in the basic execution character set or wide execution character set respectively. We encourage implementers to emit a warning in these cases - the intent is to avoid loss of information when transcoding to the execution character set - This matches existing practice

This is not existing practice in my experience, and I'm not sure what techniques you have in mind for such encouragement. I think I can get on board with such deprecation for the non-basic [wide] (presumed) execution character set. E.g., I'd like to change the implementation and conditionally defined behavior in these clauses such that the program becomes ill-formed:

[lex.phases]p1.5 (http://eel.is/c++draft/lex.phases#1.5)
[lex.ccon]p2 (http://eel.is/c++draft/lex.ccon#2)
[lex.ccon]p6 (http://eel.is/c++draft/lex.ccon#6)

The proposed changes hope to make it easier to use string literals and Unicode strings literal without loss of information portably across platforms by capitalizing on char8_t

Maybe I'm missing it, but I don't see how what is proposed above helps capitalize on char8_t.

They would standardize existing practice, match common practice in other languages (go rust, swift, python) and avoid bugs related to loss of information when transcoding arbitrary Unicode data to legacy encodings not able to represent but a very small of characters defined by Unicode.

I'm not seeing the parallels here. I do agree with wanting to change source-to-execution character set transcoding problems into compile-time errors though.

It also make it feasible to have Unicode identifiers in the future, as proposed by JF

We actually have them now.

[lex.name]p1 (http://eel.is/c++draft/lex.name#1)

I started on a related paper for the pre-Cologne mailing, but didn't get it done in time. I hope to have it to enough of a state to present at least some of the content at the SG16 evening session in Cologne. One of the goals is to enable encodings to be determined on a per-source-file basis. Two options will be discussed:

Specifying that implementations check for a Unicode BOM. BOMs aren't particularly popular, but sufficiently supported.
Specifying a #pragma directive that indicates the encoding. This is similar to features in Python and the HTML spec.

Tom.

Looking to discussing these ideas further,

Corentin

_______________________________________________
SG16 Unicode mailing list
Unicode@isocpp.open-std.org
http://www.open-std.org/mailman/listinfo/unicode