On 9/9/19 10:31 AM, Tony V E wrote:


On Mon, Sep 9, 2019 at 3:31 AM Corentin <corentin.jabot@gmail.com> wrote:


On Mon, 9 Sep 2019 at 01:25, Tom Honermann <tom@honermann.net> wrote:

On Sep 8, 2019, at 3:31 PM, Tony V E via Lib <lib@lists.isocpp.org> wrote:

Do we have / could we have / should we have
a clear long term (20 years) direction for text in C++?

I would like that very much, but we don’t control the ecosystem, and will have to, to some degree, roll with where the community takes us. 

The community is waiting for us to catch up and i do believe we have some control

yep, every other language just decided for the community.

That is not correct.  Examples include C, Fortran, and COBOL.  In general, I think languages that decided for the community had a few advantages that we do not:

  1. Less history and legacy code to support.
  2. Fewer implementations.
  3. Designed with more abstractions (e.g., VM languages) that enabled sandboxing the language environment (with associated performance costs).
  4. Designed after Unicode was standardized.

As C++, we have to allow the user to do _anything_, but they already can.  And they will still be able to.
Indeed, but as a standard, one of our responsibilities is to produce a specification that reflects existing practice.  We can (and should) lead, but need to remain focused on support for existing code as well.  I worry about repeating the Python 2->3 experience if we aren't careful.


 


ie the long term direction is unicode.
and/or specifically the long term direction is UTF8.

I think we do have wide spread agreement on that, though UTF-16 is likely to remain strongly relevant in some niches. 

We expect everyone to use char8_t then?  Or we expect char to become utf8 someday?

I think it is very unlikely that there will be a mass migration to char8_t. My expectation is that it will be used for the internal encoding within some percentage of new projects and components. 

With regard to char, I expect it to remain the type used for text that may or may not be UTF-8.

I think Microsoft will eventually provide (non-experimental) means to use UTF-8 with Win32 and that this will likely come in three forms 

1) support for UTF-8 as the system wide Active Code Page (ACP). This is already available as an experimental option. 

They di
 

2) support for executables to opt-in to a per-process override of the system wide ACP. In this mode, stdio would presumably traffic in the system wide ACP and require transcoding (I don’t think implicit transcoding is realistic). This is already available as an experimental option. 


They do

How does "override system wide ACP" and "stdio traffic in system wide ACP" fit together?  Either my process thinks the world is on the UTF8 ACP, or it doesn't.  I would expect transcoding or whatever else is required.  I would expect fopen to work, etc.
Basically, the option (a declaration in a manifest file) causes the Win32 "ANSI" APIs to work in UTF-8 mode for that process only.  Other processes on the system that don't opt-in to the option run with whatever the system ACP is.  So, any information exchanged between them will require transcoding.  I would expect implicit transcoding for command line options and environment variables (those are already implicitly transcoded from their wide variants), but stdio is unaffected.  So, piped data between processes that both adhere to (their perception of) the ACP would require intervention.  But, stdio can be binary anyway.  And executable written in some other languages expect UTF-8 regardless, so I don't think this is a significant issue.

If that works, I believe almost every Windows developer will turn this on, and char will be utf8 (as it is on linux, IIUC).
Most code will "just work".

Quite possibly.


In 10 years, it will be the assumption.
Representatives at Microsoft have so far stated that their testing of the UTF-8 ACP option revealed that it breaks too many widely deployed applications for them to make it a default at this point.  And their strong commitment to backward compatibility may invite a longer migration period.

I think we sure steer in the direction that char becomes UTF8.
I agree, and that is what is already happening.

In the short term we could say char is whatever the system is in, but we encourage UTF8.  Or something like that.  Maybe the standard "assumes" UTF8, but implementations are allowed to vary.  Whatever "assumes" means for a given API.
I think that is the status quo.  We could add a non-normative note encouraging UTF-8, but I think the likelihood of any greenfield project picking anything else is highly unlikely.
We could define things like fmt to be "if the system is UTF8, then behaviour is X, otherwise YMMV (ie implementation defined)".

We could.  But that makes the behavior locale dependent because, on most platforms, that is the reality.

Tom.



--
Be seeing you,
Tony