sg16: Re: [SG16-Unicode] UTF I/O : Source file and encoding

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 6 Jul 2019 23:04:30 -0400

On 6/23/19 5:17 AM, Corentin wrote:
> Let's further assume
>
> * Unicode will not be replaced and superseded in our lifetime
>
Predictions have a spectacular rate of failure, but in this case, I
would not bet against you :)
>
> * Unicode is the only character set to be able to handle text for it
> is the only encoding that is a strict super of all previously
> existing encodings and strives to encompass all characters used by
> people
>
Sure.
>
> * A poor Unicode support at the language and system levels has led
> to most developers having a poor understanding of text and
> encoding (regardless of skill level)
>
Poor Unicode support *might* be a contributing factor, but is hardly the
only one. Sometimes, poor support is a motivating factor for getting
educated.
>
> * In many cases developers and engineers only care about sequences
> of bytes they can pronounce rather than text, and in these cases
> ASCII is sufficient
>
Likewise, EBCDIC is sufficient on some systems.
>
> * Systems and compiler vendors have a vested interest in supporting
> Unicode which is already the most used encoding in user-facing
> systems https://en.wikipedia.org/wiki/Unicode#Adoption
>
Not necessarily. Some may take the view that Unicode support is an
external library problem.

> I propose that we work towards making Unicode the only supported
> _source_ character set - I realize this might take time as far from
> all source files are encoded in a UTF encoding, however Unicode is
> designed to make that possible.
> This is also standard practice and both GCC and Clang will assume a
> UTF-8 encoding
I don't see a need for the standard to impose a single supported source
character set; I think it is better to let the market drive this. If a
convergence occurs, that would be the appropriate time for the standard
to reflect it. It is appropriate for the standard to lead in some
cases, but in this case, there is considerable history that prohibits a
wholesale migration. I agree with the general sentiment though; I do
want to encourage and make migration easier.
>
> In the meantime, I propose that:
>
> * Source with characters outside of the basic source character set
> embedded in a utf-8, utf-16 or utf-32 character or string literal
> are ill-formed*if and only if* the source/input character set is
> not Unicode.
>
That would break existing code for, in my opinion, little gain. I think
the more pressing concern is means to determine what encoding to
interpret a source file as.
>
> * We put out a guideline recommending for source files to be utf-8
> encoded
>
Are you suggesting a standing document? I don't see much benefit in
doing so.
>
> * We put in the standard that compiler should assume utf8-encoded
> filers as the default input encoding unless of the existence of
> implementation defined out-of-band information (which would have
> no practical impact, but to signal we recommend supporting utf-8)
>
This would again break existing code. And the out-of-band information
used today by the Microsoft compiler is the current locale (active code
page).
>
> * We deprecate string and wide character and string literals (char,
> wchar_t) whose source representation contains characters not
> re-presentable in the basic execution character set or
> wide execution character set respectively. We encourage
> implementers to emit a warning in these cases - the intent is to
> avoid loss of information when transcoding to the execution
> character set - This matches existing practice
>
This is not existing practice in my experience, and I'm not sure what
techniques you have in mind for such encouragement. I think I can get
on board with such deprecation for the non-basic [wide] (presumed)
execution character set. E.g., I'd like to change the implementation
and conditionally defined behavior in these clauses such that the
program becomes ill-formed:

  * [lex.phases]p1.5 (http://eel.is/c++draft/lex.phases#1.5)
  * [lex.ccon]p2 (http://eel.is/c++draft/lex.ccon#2)
  * [lex.ccon]p6 (http://eel.is/c++draft/lex.ccon#6)

>
> The proposed changes hope to make it easier to use string literals and
> Unicode strings literal without loss of information portably across
> platforms by capitalizing on char8_t
Maybe I'm missing it, but I don't see how what is proposed above helps
capitalize on char8_t.
> They would standardize existing practice, match common practice in
> other languages (go rust, swift, python) and avoid bugs related to
> loss of information when transcoding arbitrary Unicode data to legacy
> encodings not able to represent but a very small of characters defined
> by Unicode.
I'm not seeing the parallels here. I do agree with wanting to change
source-to-execution character set transcoding problems into compile-time
errors though.
>
> It also make it feasible to have Unicode identifiers in the future, as
> proposed by JF

We actually have them now.

  * [lex.name]p1 (http://eel.is/c++draft/lex.name#1)

I started on a related paper for the pre-Cologne mailing, but didn't get
it done in time. I hope to have it to enough of a state to present at
least some of the content at the SG16 evening session in Cologne. One
of the goals is to enable encodings to be determined on a
per-source-file basis. Two options will be discussed:

1. Specifying that implementations check for a Unicode BOM. BOMs
    aren't particularly popular, but sufficiently supported.
2. Specifying a #pragma directive that indicates the encoding. This is
    similar to features in Python and the HTML spec.

Tom.

>
> Looking to discussing these ideas further,
> Corentin
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode

Received on 2019-07-07 05:04:34