C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 8 Sep 2019 22:34:41 -0400
On 9/8/19 7:05 PM, Zach Laine wrote:
> On Sun, Sep 8, 2019 at 3:00 PM Tom Honermann via Lib
> <lib_at_[hidden] <mailto:lib_at_[hidden]>> wrote:
>
>
> On Sep 8, 2019, at 2:46 PM, Corentin via Lib <lib_at_[hidden]
> <mailto:lib_at_[hidden]>> wrote:
>
>>
>>
>> On Sun, 8 Sep 2019 at 19:30, Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> On 9/8/19 12:40 PM, Corentin wrote:
>>>
>>>
>>> On Sun, 8 Sep 2019 at 18:12, Tom Honermann
>>> <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
>>>
>>> On 9/8/19 6:00 AM, Corentin via Lib wrote:
>>>>
>>>>
>>>> On Sun, 8 Sep 2019 at 11:17, Corentin
>>>> <corentin.jabot_at_[hidden]
>>>> <mailto:corentin.jabot_at_[hidden]>> wrote:
>>>>
>>>>
>>>>
>>>> On Sun, 8 Sep 2019 at 09:52, Billy O'Neal (VC LIBS)
>>>> <bion_at_[hidden] <mailto:bion_at_[hidden]>> wrote:
>>>>
>>>> > I agree that EGCS is the best option. That
>>>> doesn't drag locale
>>>>
>>>> Because we don’t get to assume that we’re
>>>> talking about Unicode at all, it absolutely
>>>> drags in locale.
>>>>
>>>>
>>>> Sorry, I should have been more specific.
>>>> There is a non-tailored Unicode EGCS boundary
>>>> algorithm (but it can be tailored)
>>>> I didn't mean to imply that text manipulation can
>>>> be done without knowing its encoding and never use
>>>> "locale" to mean encoding.
>>>>
>>>> EGCS are only defined for text whose character
>>>> repertoire is Unicode, other encodings deal with
>>>> codepoints
>>>>
>>>>
>>>>
>>>> To be clear, the difference of whether the EGC
>>>> algorithm is required to be tailored or not is that
>>>> tailoring for all intent and purposes requires
>>>> icu or something with CLDR, which restrict the
>>>> platforms on which this can be implemented
>>>
>>> Tailoring is not relevant to this discussion.
>>>
>>> It is - see https://unicode.org/reports/tr29/ "ch" is 2 EGCS
>>> in most locales but in Slovak it's 1. I don't make the rules :D
>> It isn't relevant in determining how we resolve this issue.
>> If the resolution is that field widths are measured in EGCs,
>> then we've already decided that the width is locale dependent
>> and tailoring becomes an implementation detail.
>>
>>
>> No, format decided to be locale-independent (for good reason) and
>> applying locale specific behavior implicitly would be against that.
>> I'n arguing for encoding specific behavior
>
> You seem to be missing the point that, for char and wchar_t, the
> encoding can’t be known (in general) without consulting the
> locale. Again, LANG=C vs LANG=C.UTF-8.
>
> Tom.
>
>
> Tom, you seem to be missing the point that std::format does not such
> consultation! It is locale-agnostic. It is assumed to be char-based,
> not Windows 1252, not UTF-8, not even ASCII.
That is exactly my point! And why my proposed resolution was to specify
width in terms of code units.
>
> This means that the definition of width as being a CU is the de facto
> status quo. I'm suggesting that later on, we pull a fast one and
> specify that we meant that it should have been UTF-8-based instead of
> char-based. This may mean that we need to add a char8_t overload, or
> it may be palatable to just change the current interface's contract. I
> assume the former will be necessary, since people tend to hate silent
> contract changes (with good reason).

Victor's fmtlib implementation already effectively does what you
suggest. See
https://github.com/fmtlib/fmt/commit/38325248e5310ddbea41390974e496e8495f7324.

I think this isn't a good state to be in though. If the current locale
has a UTF-8 encoding, I would be disappointed if the following two calls
produced different string contents:

std::format( "{:3}", "\xC3\x81"); // U+00C1{ LATIN CAPITAL LETTER A
WITH ACUTE }
std::format(u8"{:3}", u8"\xC3\x81"); // U+00C1{ LATIN CAPITAL LETTER A
WITH ACUTE }

If the width is code units for the char based overload and EGCs for the
char8_t based one, then the first will produce "\xC3\x81\x20" (one
inserted space) and the second "\xC3\x81\x20\x20" (two inserted
spaces). I think users would find that surprising.

>
> So, if we do nothing, we get what you want. If we *specify* that CUs
> are the width, we color the future debate about the Unicode-aware
> version in a Unicode-unfriendly direction.

If we do nothing, we are in the situation where different implementors
may do different things.

My preferred direction for exploration is a future extension that
enables opt-in to field widths that are encoding dependent (and
therefore locale dependent for char and wchar_t). For example (using
'L' appended to the width; 'L' doesn't conflict with the existing type
options):

std::format("{:3L}", "\xC3\x81"); // produces "\xC3\x81\x20\x20"; 3 EGCs.

But again, I'm far from convinced that this is actually useful since
EGCs don't suffice to ensure an aligned result anyway as nicely
described in Henri's post (https://hsivonen.fi/string-length).

Tom.

>
> Zach
>


Received on 2019-09-09 04:34:46