sg16: Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sun, 8 Sep 2019 20:46:41 +0200

On Sun, 8 Sep 2019 at 19:30, Tom Honermann <tom_at_[hidden]> wrote:

> On 9/8/19 12:40 PM, Corentin wrote:
>
>
>
> On Sun, 8 Sep 2019 at 18:12, Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 9/8/19 6:00 AM, Corentin via Lib wrote:
>>
>>
>>
>> On Sun, 8 Sep 2019 at 11:17, Corentin <corentin.jabot_at_[hidden]> wrote:
>>
>>>
>>>
>>> On Sun, 8 Sep 2019 at 09:52, Billy O'Neal (VC LIBS) <bion_at_[hidden]>
>>> wrote:
>>>
>>>> > I agree that EGCS is the best option. That doesn't drag locale
>>>>
>>>>
>>>>
>>>> Because we don’t get to assume that we’re talking about Unicode at all,
>>>> it absolutely drags in locale.
>>>>
>>>
>>> Sorry, I should have been more specific.
>>> There is a non-tailored Unicode EGCS boundary algorithm (but it can be
>>> tailored)
>>> I didn't mean to imply that text manipulation can be done without
>>> knowing its encoding and never use "locale" to mean encoding.
>>>
>>> EGCS are only defined for text whose character repertoire is Unicode,
>>> other encodings deal with codepoints
>>>
>>
>>
>> To be clear, the difference of whether the EGC algorithm is required to
>> be tailored or not is that tailoring for all intent and purposes requires
>> icu or something with CLDR, which restrict the platforms on which this
>> can be implemented
>>
>> Tailoring is not relevant to this discussion.
>>
> It is - see https://unicode.org/reports/tr29/ "ch" is 2 EGCS in most
> locales but in Slovak it's 1. I don't make the rules :D
>
> It isn't relevant in determining how we resolve this issue. If the
> resolution is that field widths are measured in EGCs, then we've already
> decided that the width is locale dependent and tailoring becomes an
> implementation detail.
>

No, format decided to be locale-independent (for good reason) and applying
locale specific behavior implicitly would be against that.
I'n arguing for encoding specific behavior

>
> The locale dependency stems from the encoding itself being dependent on
>> locale. Again, LANG=C vs LANG=C.UTF-8. If the specified behavior is
>> encoding dependent (as it would have to be for field width to be a count of
>> any of code points, scalar values, or EGCs), then it is also locale
>> dependent (for char and wchar_t). Thus there is a trade off:
>>
>> 1. Either the behavior is locale dependent in which case, field
>> widths could be specified such that they count code points, scalar values,
>> or EGCs when the locale selects a Unicode encoding (and something else for
>> non-Unicode encodings), or
>> 2. The behavior is not locale dependent in which case, field widths
>> can only be specified in terms of code units.
>>
>>
> Agreed, but let me rephrase:
>
> Either a string is text and therefore we need and to know its encoding, or
> it is a sequence of bytes (in the case of char)
> I have an opinion about what we are dealing with in this context :D
>
> So your preference is for trade off #1 above and the cost is that
> std::format is no longer locale insensitive even in the cases where a
> std::locale argument is not provided.
>
It would be _encoding_ sensitive
It would not change for example the decimal separator.

When Unicode is involved - and even when it is not, it is I think important
not to conflate locale and encoding even if C kinda amalgamates the two and
derives one from the other.

> Since I don't think field width works for alignment, even if EGCs are used
> (see Henri's post - https://hsivonen.fi/string-length), I prefer trade
> off #2.
>
> Tom.
>
>
>
> Recall that, unless there is a call to std::setlocale, all C and C++
>> processes start with the locale set to "C"
>>
> Tom.
>>
>>
>>
>>
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>> Billy3
>>>>
>>>>
>>>> ------------------------------
>>>> *From:* Lib <lib-bounces_at_[hidden]> on behalf of Corentin via
>>>> Lib <lib_at_[hidden]>
>>>> *Sent:* Saturday, September 7, 2019 11:08:25 PM
>>>> *To:* Library Working Group <lib_at_[hidden]>
>>>> *Cc:* Corentin <corentin.jabot_at_[hidden]>; Victor Zverovich <
>>>> victor.zverovich_at_[hidden]>; Tom Honermann <tom_at_[hidden]>;
>>>> unicode_at_[hidden] <unicode_at_[hidden]>
>>>> *Subject:* Re: [isocpp-lib] New issue: Are std::format field widths
>>>> code units, code points, or something else?
>>>>
>>>>
>>>>
>>>> On Sun, Sep 8, 2019, 5:30 AM Tom Honermann via Lib <
>>>> lib_at_[hidden]> wrote:
>>>>
>>>>> On 9/7/19 10:44 PM, Victor Zverovich wrote:
>>>>>
>>>>> > Is field width measured in code units, code points, or something
>>>>> else?
>>>>>
>>>>> I think the main consideration here is that width should be
>>>>> locale-independent by default for consistency with the rest of
>>>>> std::format's design.
>>>>>
>>>>> I agree with that goal, but...
>>>>>
>>>>> If we can say that width is measured in grapheme clusters or code
>>>>> points based on the execution encoding (or whatever the standardese term)
>>>>> without querying the locale then I suggest doing so.
>>>>>
>>>>> I don't know how to do that. From my response to Zach, if code units
>>>>> aren't used, then behavior should be different for LANG=C vs LANG=C.UTF-8.
>>>>>
>>>>> I have slight preference for grapheme clusters since those correspond
>>>>> to user-perceived characters, but only have implementation experience with
>>>>> code points (this is what both the fmt library and Python do).
>>>>>
>>>>> I would definitely vote for EGCs over code points. I think code
>>>>> points are probably the worst of the options since it makes the results
>>>>> dependent on Unicode normalization form.
>>>>>
>>>>
>>>> I disagree. Code Units is the worse option. For me anything involving
>>>> code units is a big red flag. I agree that EGCS is the best option. That
>>>> doesn't drag locale, might be a bit involved for implementers in 20.
>>>> I don't think specify EGCS for Unicode text and codepoints otherwise
>>>> wouldn't be too difficult - implementation might be a bit challenging on
>>>> some platforms in the 20 time frame but they could fallback to codepoints
>>>> in the meantime. Not perfect but I think we need a good long term solution
>>>> rather than a bad short term one
>>>>
>>>> Tom.
>>>>>
>>>>>
>>>>> Cheers,
>>>>> Victor
>>>>>
>>>>> On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib <
>>>>> lib_at_[hidden]> wrote:
>>>>>
>>>>>> [format.string.std]p7
>>>>>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252854619&sdata=WsHw%2BM62uyiOBrr91P6W1GzwGe313EDe30bKN5i006Q%3D&reserved=0>
>>>>>> states:
>>>>>>
>>>>>> The *positive-integer* in *width* is a decimal integer defining the
>>>>>> minimum field width. If *width* is not specified, there is no
>>>>>> minimum field width, and the field width is determined based on the content
>>>>>> of the field.
>>>>>>
>>>>>> Is field width measured in code units, code points, or something else?
>>>>>>
>>>>>> Consider the following example assuming a UTF-8 locale:
>>>>>>
>>>>>> std::format("{}", "\xC3\x81"); // U+00C1 { LATIN CAPITAL
>>>>>> LETTER A WITH ACUTE }
>>>>>> std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN CAPITAL
>>>>>> LETTER A } { COMBINING ACUTE ACCENT }
>>>>>>
>>>>>> In both cases, the arguments encode the same user-perceived character
>>>>>> (Á). The first uses two UTF-8 code units to encode a single code point
>>>>>> that represents a single glyph using a composed Unicode normalization
>>>>>> form. The second uses three code units to encode two code points that
>>>>>> represent the same glyph using a decomposed Unicode normalization form.
>>>>>>
>>>>>> How is the field width determined? If measured in code units, the
>>>>>> first has a width of 2 and the second of 3. If measured in code points,
>>>>>> the first has a width of 1 and the second of 2. If measured in grapheme
>>>>>> clusters, both have a width of 1. Is the determination locale dependent?
>>>>>>
>>>>>> *Proposed resolution:*
>>>>>>
>>>>>> Field widths are measured in code units and are not locale dependent.
>>>>>> Modify [format.string.std]p7
>>>>>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=36WpbP64Oqoi4Pne9kFrEu6nauHLNr2VunnfkvdWcPY%3D&reserved=0>
>>>>>> as follows:
>>>>>>
>>>>>> The *positive-integer* in *width* is a decimal integer defining the
>>>>>> minimum field width. If *width* is not specified, there is no
>>>>>> minimum field width, and the field width is determined based on the content
>>>>>> of the field. *Field width is measured in code units. Each byte of
>>>>>> a multibyte character contributes to the field width.*
>>>>>>
>>>>>> (*code unit* is not formally defined in the standard. Most uses
>>>>>> occur in UTF-8 and UTF-16 specific contexts, but [lex.ext]p5
>>>>>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Flex.ext%235&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=UyG%2Fr7BXuLAPAXP78ekpXS%2FWhqdeU2QCHTmTeBPjImQ%3D&reserved=0>
>>>>>> uses it in an encoding agnostic context.)
>>>>>>
>>>>>> Tom.
>>>>>> _______________________________________________
>>>>>> Lib mailing list
>>>>>> Lib_at_[hidden]
>>>>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
>>>>>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=ieyJCXmZ0Bj3UfW4Lvi3hW1HlOq6oeEML86Xyry9uFI%3D&reserved=0>
>>>>>> Link to this post: http://lists.isocpp.org/lib/2019/09/13440.php
>>>>>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13440.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=l4UxwaFExnxKireder%2F%2BAnU2mszZXMYatHrd2zGSSWQ%3D&reserved=0>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Lib mailing list
>>>>> Lib_at_[hidden]
>>>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
>>>>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252884602&sdata=B0%2BhF8pSkAy2MbEwWHk1r3uVjbIpvIoQ%2Fi%2BckyTQ94A%3D&reserved=0>
>>>>> Link to this post: http://lists.isocpp.org/lib/2019/09/13446.php
>>>>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13446.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252894598&sdata=NVwyEiiPWSwvAApse%2FxktecxI6oAiGhUWKjyXw8yYMw%3D&reserved=0>
>>>>>
>>>>
>> _______________________________________________
>> Lib mailing listLib_at_[hidden]
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
>> Link to this post: http://lists.isocpp.org/lib/2019/09/13453.php
>>
>>
>>
>

Received on 2019-09-08 20:46:55