C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 9 Sep 2019 09:26:36 +0200
On Mon, Sep 9, 2019, 4:34 AM Tom Honermann <tom_at_[hidden]> wrote:

> On 9/8/19 7:05 PM, Zach Laine wrote:
>
> On Sun, Sep 8, 2019 at 3:00 PM Tom Honermann via Lib <lib_at_[hidden]>
> wrote:
>
>>
>> On Sep 8, 2019, at 2:46 PM, Corentin via Lib <lib_at_[hidden]>
>> wrote:
>>
>>
>>
>> On Sun, 8 Sep 2019 at 19:30, Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> On 9/8/19 12:40 PM, Corentin wrote:
>>>
>>>
>>>
>>> On Sun, 8 Sep 2019 at 18:12, Tom Honermann <tom_at_[hidden]> wrote:
>>>
>>>> On 9/8/19 6:00 AM, Corentin via Lib wrote:
>>>>
>>>>
>>>>
>>>> On Sun, 8 Sep 2019 at 11:17, Corentin <corentin.jabot_at_[hidden]> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Sun, 8 Sep 2019 at 09:52, Billy O'Neal (VC LIBS) <
>>>>> bion_at_[hidden]> wrote:
>>>>>
>>>>>> > I agree that EGCS is the best option. That doesn't drag locale
>>>>>>
>>>>>>
>>>>>>
>>>>>> Because we don’t get to assume that we’re talking about Unicode at
>>>>>> all, it absolutely drags in locale.
>>>>>>
>>>>>
>>>>> Sorry, I should have been more specific.
>>>>> There is a non-tailored Unicode EGCS boundary algorithm (but it can be
>>>>> tailored)
>>>>> I didn't mean to imply that text manipulation can be done without
>>>>> knowing its encoding and never use "locale" to mean encoding.
>>>>>
>>>>> EGCS are only defined for text whose character repertoire is Unicode,
>>>>> other encodings deal with codepoints
>>>>>
>>>>
>>>>
>>>> To be clear, the difference of whether the EGC algorithm is required to
>>>> be tailored or not is that tailoring for all intent and purposes requires
>>>> icu or something with CLDR, which restrict the platforms on which this
>>>> can be implemented
>>>>
>>>> Tailoring is not relevant to this discussion.
>>>>
>>> It is - see https://unicode.org/reports/tr29/ "ch" is 2 EGCS in most
>>> locales but in Slovak it's 1. I don't make the rules :D
>>>
>>> It isn't relevant in determining how we resolve this issue. If the
>>> resolution is that field widths are measured in EGCs, then we've already
>>> decided that the width is locale dependent and tailoring becomes an
>>> implementation detail.
>>>
>>
>> No, format decided to be locale-independent (for good reason) and
>> applying locale specific behavior implicitly would be against that.
>> I'n arguing for encoding specific behavior
>>
>>
>> You seem to be missing the point that, for char and wchar_t, the encoding
>> can’t be known (in general) without consulting the locale. Again, LANG=C vs
>> LANG=C.UTF-8.
>>
>> Tom.
>>
>
> Tom, you seem to be missing the point that std::format does not such
> consultation! It is locale-agnostic. It is assumed to be char-based, not
> Windows 1252, not UTF-8, not even ASCII.
>
> That is exactly my point! And why my proposed resolution was to specify
> width in terms of code units.
>
>
> This means that the definition of width as being a CU is the de facto
> status quo. I'm suggesting that later on, we pull a fast one and specify
> that we meant that it should have been UTF-8-based instead of char-based.
> This may mean that we need to add a char8_t overload, or it may be
> palatable to just change the current interface's contract. I assume the
> former will be necessary, since people tend to hate silent contract changes
> (with good reason).
>
> Victor's fmtlib implementation already effectively does what you suggest.
> See
> https://github.com/fmtlib/fmt/commit/38325248e5310ddbea41390974e496e8495f7324
> .
>
> I think this isn't a good state to be in though. If the current locale
> has a UTF-8 encoding, I would be disappointed if the following two calls
> produced different string contents:
>
> std::format( "{:3}", "\xC3\x81"); // U+00C1 { LATIN CAPITAL LETTER A
> WITH ACUTE }
> std::format(u8"{:3}", u8"\xC3\x81"); // U+00C1 { LATIN CAPITAL LETTER A
> WITH ACUTE }
>
> If the width is code units for the char based overload and EGCs for the
> char8_t based one, then the first will produce "\xC3\x81\x20" (one inserted
> space) and the second "\xC3\x81\x20\x20" (two inserted spaces). I think
> users would find that surprising.
>

I think we are going there 0- we will have to if we take the code units
route.
It matches a discussion I recall we had probably at kona that at the moment
fmt is more of a bytes formatting library - with the expectation that u8
overload would format text

So, if we do nothing, we get what you want. If we *specify* that CUs are
> the width, we color the future debate about the Unicode-aware version in a
> Unicode-unfriendly direction.
>
> +1


> If we do nothing, we are in the situation where different implementors may
> do different things
>
My preferred direction for exploration is a future extension that enables
> opt-in to field widths that are encoding dependent (and therefore locale
> dependent for char and wchar_t). For example (using 'L' appended to the
> width; 'L' doesn't conflict with the existing type options):
>
> std::format("{:3L}", "\xC3\x81"); // produces "\xC3\x81\x20\x20"; 3 EGCs.
>
std::format("{:3L}", "ch"); what does that produces?
Locale specifiers should only affect region specific rules, not whether
something is interpreted as bytes or not

> But again, I'm far from convinced that this is actually useful since EGCs
> don't suffice to ensure an aligned result anyway as nicely described in
> Henri's post (https://hsivonen.fi/string-length).
>
Agreed but i think you know that code units is the least useful option in
this case and i am concerned about choosing a bad option to make a fix easy.


> Tom.
>
>
> Zach
>
>
>

Received on 2019-09-09 09:26:50