sg16: Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Zach Laine <whatwasthataddress_at_[hidden]>
Date: Sun, 8 Sep 2019 18:05:31 -0500

On Sun, Sep 8, 2019 at 3:00 PM Tom Honermann via Lib <lib_at_[hidden]>
wrote:

>
> On Sep 8, 2019, at 2:46 PM, Corentin via Lib <lib_at_[hidden]> wrote:
>
>
>
> On Sun, 8 Sep 2019 at 19:30, Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 9/8/19 12:40 PM, Corentin wrote:
>>
>>
>>
>> On Sun, 8 Sep 2019 at 18:12, Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> On 9/8/19 6:00 AM, Corentin via Lib wrote:
>>>
>>>
>>>
>>> On Sun, 8 Sep 2019 at 11:17, Corentin <corentin.jabot_at_[hidden]> wrote:
>>>
>>>>
>>>>
>>>> On Sun, 8 Sep 2019 at 09:52, Billy O'Neal (VC LIBS) <bion_at_[hidden]>
>>>> wrote:
>>>>
>>>>> > I agree that EGCS is the best option. That doesn't drag locale
>>>>>
>>>>>
>>>>>
>>>>> Because we don’t get to assume that we’re talking about Unicode at
>>>>> all, it absolutely drags in locale.
>>>>>
>>>>
>>>> Sorry, I should have been more specific.
>>>> There is a non-tailored Unicode EGCS boundary algorithm (but it can be
>>>> tailored)
>>>> I didn't mean to imply that text manipulation can be done without
>>>> knowing its encoding and never use "locale" to mean encoding.
>>>>
>>>> EGCS are only defined for text whose character repertoire is Unicode,
>>>> other encodings deal with codepoints
>>>>
>>>
>>>
>>> To be clear, the difference of whether the EGC algorithm is required to
>>> be tailored or not is that tailoring for all intent and purposes requires
>>> icu or something with CLDR, which restrict the platforms on which this
>>> can be implemented
>>>
>>> Tailoring is not relevant to this discussion.
>>>
>> It is - see https://unicode.org/reports/tr29/ "ch" is 2 EGCS in most
>> locales but in Slovak it's 1. I don't make the rules :D
>>
>> It isn't relevant in determining how we resolve this issue. If the
>> resolution is that field widths are measured in EGCs, then we've already
>> decided that the width is locale dependent and tailoring becomes an
>> implementation detail.
>>
>
> No, format decided to be locale-independent (for good reason) and applying
> locale specific behavior implicitly would be against that.
> I'n arguing for encoding specific behavior
>
>
> You seem to be missing the point that, for char and wchar_t, the encoding
> can’t be known (in general) without consulting the locale. Again, LANG=C vs
> LANG=C.UTF-8.
>
> Tom.
>

Tom, you seem to be missing the point that std::format does not such
consultation! It is locale-agnostic. It is assumed to be char-based, not
Windows 1252, not UTF-8, not even ASCII.

This means that the definition of width as being a CU is the de facto
status quo. I'm suggesting that later on, we pull a fast one and specify
that we meant that it should have been UTF-8-based instead of char-based.
This may mean that we need to add a char8_t overload, or it may be
palatable to just change the current interface's contract. I assume the
former will be necessary, since people tend to hate silent contract changes
(with good reason).

So, if we do nothing, we get what you want. If we *specify* that CUs are
the width, we color the future debate about the Unicode-aware version in a
Unicode-unfriendly direction.

Zach

Received on 2019-09-09 01:05:43