sg16: Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 9 Sep 2019 12:25:37 -0400

On 9/9/19 10:31 AM, Tony V E wrote:
>
>
> On Mon, Sep 9, 2019 at 3:31 AM Corentin <corentin.jabot_at_[hidden]
> <mailto:corentin.jabot_at_[hidden]>> wrote:
>
>
>
> On Mon, 9 Sep 2019 at 01:25, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
>
> On Sep 8, 2019, at 3:31 PM, Tony V E via Lib
> <lib_at_[hidden] <mailto:lib_at_[hidden]>> wrote:
>
>> Do we have / could we have / should we have
>> a clear long term (20 years) direction for text in C++?
>
> I would like that very much, but we don’t control the
> ecosystem, and will have to, to some degree, roll with where
> the community takes us.
>
>
> The community is waiting for us to catch up and i do believe we
> have some control
>
>
> yep, every other language just decided for the community.

That is not correct. Examples include C, Fortran, and COBOL. In
general, I think languages that decided for the community had a few
advantages that we do not:

1. Less history and legacy code to support.
2. Fewer implementations.
3. Designed with more abstractions (e.g., VM languages) that enabled
sandboxing the language environment (with associated performance costs).
4. Designed after Unicode was standardized.

>
> As C++, we have to allow the user to do _anything_, but they already
> can. And they will still be able to.
Indeed, but as a standard, one of our responsibilities is to produce a
specification that reflects existing practice. We can (and should)
lead, but need to remain focused on support for existing code as well.
I worry about repeating the Python 2->3 experience if we aren't careful.
>
>
>
>>
>> ie the long term direction is unicode.
>> and/or specifically the long term direction is UTF8.
>
> I think we do have wide spread agreement on that, though
> UTF-16 is likely to remain strongly relevant in some niches.
>
>> We expect everyone to use char8_t then? Or we expect char to
>> become utf8 someday?
>
> I think it is very unlikely that there will be a mass
> migration to char8_t. My expectation is that it will be used
> for the internal encoding within some percentage of new
> projects and components.
>
> With regard to char, I expect it to remain the type used for
> text that may or may not be UTF-8.
>
> I think Microsoft will eventually provide (non-experimental)
> means to use UTF-8 with Win32 and that this will likely come
> in three forms
>
>
> 1) support for UTF-8 as the system wide Active Code Page
> (ACP). This is already available as an experimental option.
>
>
> They di
>
>
> 2) support for executables to opt-in to a per-process override
> of the system wide ACP. In this mode, stdio would presumably
> traffic in the system wide ACP and require transcoding (I
> don’t think implicit transcoding is realistic). This is
> already available as an experimental option.
>
>
>
> They do
>
>
> How does "override system wide ACP" and "stdio traffic in system wide
> ACP" fit together? Either my process thinks the world is on the UTF8
> ACP, or it doesn't. I would expect transcoding or whatever else is
> required. I would expect fopen to work, etc.
Basically, the option (a declaration in a manifest file) causes the
Win32 "ANSI" APIs to work in UTF-8 mode for that process only. Other
processes on the system that don't opt-in to the option run with
whatever the system ACP is. So, any information exchanged between them
will require transcoding. I would expect implicit transcoding for
command line options and environment variables (those are already
implicitly transcoded from their wide variants), but stdio is
unaffected. So, piped data between processes that both adhere to (their
perception of) the ACP would require intervention. But, stdio can be
binary anyway. And executable written in some other languages expect
UTF-8 regardless, so I don't think this is a significant issue.
>
> If that works, I believe almost every Windows developer will turn this
> on, and char will be utf8 (as it is on linux, IIUC).
> Most code will "just work".

Quite possibly.

>
> In 10 years, it will be the assumption.
Representatives at Microsoft have so far stated that their testing of
the UTF-8 ACP option revealed that it breaks too many widely deployed
applications for them to make it a default at this point. And their
strong commitment to backward compatibility may invite a longer
migration period.
>
> I think we sure steer in the direction that char becomes UTF8.
I agree, and that is what is already happening.
>
> In the short term we could say char is whatever the system is in, but
> we encourage UTF8. Or something like that. Maybe the standard
> "assumes" UTF8, but implementations are allowed to vary. Whatever
> "assumes" means for a given API.
I think that is the status quo. We could add a non-normative note
encouraging UTF-8, but I think the likelihood of any greenfield project
picking anything else is highly unlikely.
> We could define things like fmt to be "if the system is UTF8, then
> behaviour is X, otherwise YMMV (ie implementation defined)".

We could. But that makes the behavior locale dependent because, on most
platforms, that is the reality.

Tom.

>
>
> --
> Be seeing you,
> Tony

Received on 2019-09-09 18:25:41