sg16: Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 8 Sep 2019 12:12:07 -0400

On 9/8/19 6:00 AM, Corentin via Lib wrote:
>
>
> On Sun, 8 Sep 2019 at 11:17, Corentin <corentin.jabot_at_[hidden]
> <mailto:corentin.jabot_at_[hidden]>> wrote:
>
>
>
> On Sun, 8 Sep 2019 at 09:52, Billy O'Neal (VC LIBS)
> <bion_at_[hidden] <mailto:bion_at_[hidden]>> wrote:
>
> > I agree that EGCS is the best option. That doesn't drag locale
>
> Because we don’t get to assume that we’re talking about
> Unicode at all, it absolutely drags in locale.
>
>
> Sorry, I should have been more specific.
> There is a non-tailored Unicode EGCS boundary algorithm (but it
> can be tailored)
> I didn't mean to imply that text manipulation can be done without
> knowing its encoding and never use "locale" to mean encoding.
>
> EGCS are only defined for text whose character repertoire is
> Unicode, other encodings deal with codepoints
>
>
>
> To be clear, the difference of whether the EGC algorithm is required
> to be tailored or not is that tailoring for all intent and purposes
> requires
> icu or something with CLDR, which restrict the platforms on which this
> can be implemented

Tailoring is not relevant to this discussion.

The locale dependency stems from the encoding itself being dependent on
locale. Again, LANG=C vs LANG=C.UTF-8. If the specified behavior is
encoding dependent (as it would have to be for field width to be a count
of any of code points, scalar values, or EGCs), then it is also locale
dependent (for char and wchar_t). Thus there is a trade off:

1. Either the behavior is locale dependent in which case, field widths
    could be specified such that they count code points, scalar values,
    or EGCs when the locale selects a Unicode encoding (and something
    else for non-Unicode encodings), or
2. The behavior is not locale dependent in which case, field widths can
    only be specified in terms of code units.

Recall that, unless there is a call to std::setlocale, all C and C++
processes start with the locale set to "C".

Tom.

>
>
>
> Billy3
>
> ------------------------------------------------------------------------
> *From:* Lib <lib-bounces_at_[hidden]
> <mailto:lib-bounces_at_[hidden]>> on behalf of Corentin
> via Lib <lib_at_[hidden] <mailto:lib_at_[hidden]>>
> *Sent:* Saturday, September 7, 2019 11:08:25 PM
> *To:* Library Working Group <lib_at_[hidden]
> <mailto:lib_at_[hidden]>>
> *Cc:* Corentin <corentin.jabot_at_[hidden]
> <mailto:corentin.jabot_at_[hidden]>>; Victor Zverovich
> <victor.zverovich_at_[hidden]
> <mailto:victor.zverovich_at_[hidden]>>; Tom Honermann
> <tom_at_[hidden] <mailto:tom_at_[hidden]>>;
> unicode_at_[hidden]
> <mailto:unicode_at_[hidden]> <unicode_at_[hidden]
> <mailto:unicode_at_[hidden]>>
> *Subject:* Re: [isocpp-lib] New issue: Are std::format field
> widths code units, code points, or something else?
>
>
> On Sun, Sep 8, 2019, 5:30 AM Tom Honermann via Lib
> <lib_at_[hidden] <mailto:lib_at_[hidden]>> wrote:
>
> On 9/7/19 10:44 PM, Victor Zverovich wrote:
>> > Is field width measured in code units, code points, or
>> something else?
>>
>> I think the main consideration here is that width should
>> be locale-independent by default for consistency with the
>> rest of std::format's design.
> I agree with that goal, but...
>> If we can say that width is measured in grapheme clusters
>> or code points based on the execution encoding (or
>> whatever the standardese term) without querying the
>> locale then I suggest doing so.
> I don't know how to do that. From my response to Zach, if
> code units aren't used, then behavior should be different
> for LANG=C vs LANG=C.UTF-8.
>> I have slight preference for grapheme clusters since
>> those correspond to user-perceived characters, but only
>> have implementation experience with code points (this is
>> what both the fmt library and Python do).
>
> I would definitely vote for EGCs over code points. I
> think code points are probably the worst of the options
> since it makes the results dependent on Unicode
> normalization form.
>
>
> I disagree. Code Units is the worse option. For me anything
> involving code units is a big red flag. I agree that EGCS is
> the best option. That doesn't drag locale, might be a bit
> involved for implementers in 20.
> I don't think specify EGCS for Unicode text and codepoints
> otherwise wouldn't be too difficult - implementation might be
> a bit challenging on some platforms in the 20 time frame but
> they could fallback to codepoints in the meantime. Not perfect
> but I think we need a good long term solution rather than a
> bad short term one
>
> Tom.
>
>>
>> Cheers,
>> Victor
>>
>> On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib
>> <lib_at_[hidden] <mailto:lib_at_[hidden]>> wrote:
>>
>> [format.string.std]p7
>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252854619&sdata=WsHw%2BM62uyiOBrr91P6W1GzwGe313EDe30bKN5i006Q%3D&reserved=0>
>> states:
>>
>>> The /positive-integer/ in /width/ is a decimal
>>> integer defining the minimum field width. If
>>> /width/ is not specified, there is no minimum field
>>> width, and the field width is determined based on
>>> the content of the field.
>>>
>> Is field width measured in code units, code points,
>> or something else?
>>
>> Consider the following example assuming a UTF-8 locale:
>>
>> std::format("{}", "\xC3\x81"); // U+00C1{ LATIN
>> CAPITAL LETTER A WITH ACUTE }
>> std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 {
>> LATIN CAPITAL LETTER A } { COMBINING ACUTE ACCENT }
>>
>> In both cases, the arguments encode the same
>> user-perceived character (Á). The first uses two
>> UTF-8 code units to encode a single code point that
>> represents a single glyph using a composed Unicode
>> normalization form. The second uses three code units
>> to encode two code points that represent the same
>> glyph using a decomposed Unicode normalization form.
>>
>> How is the field width determined? If measured in
>> code units, the first has a width of 2 and the second
>> of 3. If measured in code points, the first has a
>> width of 1 and the second of 2. If measured in
>> grapheme clusters, both have a width of 1. Is the
>> determination locale dependent?
>>
>> *Proposed resolution:*
>>
>> Field widths are measured in code units and are not
>> locale dependent. Modify [format.string.std]p7
>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=36WpbP64Oqoi4Pne9kFrEu6nauHLNr2VunnfkvdWcPY%3D&reserved=0>
>> as follows:
>>
>>> The /positive-integer/ in /width/ is a decimal
>>> integer defining the minimum field width. If
>>> /width/ is not specified, there is no minimum field
>>> width, and the field width is determined based on
>>> the content of the field. *Field width is measured
>>> in code units. Each byte of a multibyte character
>>> contributes to the field width.*
>>>
>> (/code unit/ is not formally defined in the
>> standard. Most uses occur in UTF-8 and UTF-16
>> specific contexts, but [lex.ext]p5
>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Flex.ext%235&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=UyG%2Fr7BXuLAPAXP78ekpXS%2FWhqdeU2QCHTmTeBPjImQ%3D&reserved=0>
>> uses it in an encoding agnostic context.)
>>
>> Tom.
>>
>> _______________________________________________
>> Lib mailing list
>> Lib_at_[hidden] <mailto:Lib_at_[hidden]>
>> Subscription:
>> https://lists.isocpp.org/mailman/listinfo.cgi/lib
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=ieyJCXmZ0Bj3UfW4Lvi3hW1HlOq6oeEML86Xyry9uFI%3D&reserved=0>
>> Link to this post:
>> http://lists.isocpp.org/lib/2019/09/13440.php
>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13440.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=l4UxwaFExnxKireder%2F%2BAnU2mszZXMYatHrd2zGSSWQ%3D&reserved=0>
>>
>
> _______________________________________________
> Lib mailing list
> Lib_at_[hidden] <mailto:Lib_at_[hidden]>
> Subscription:
> https://lists.isocpp.org/mailman/listinfo.cgi/lib
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252884602&sdata=B0%2BhF8pSkAy2MbEwWHk1r3uVjbIpvIoQ%2Fi%2BckyTQ94A%3D&reserved=0>
> Link to this post:
> http://lists.isocpp.org/lib/2019/09/13446.php
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13446.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252894598&sdata=NVwyEiiPWSwvAApse%2FxktecxI6oAiGhUWKjyXw8yYMw%3D&reserved=0>
>
>
> _______________________________________________
> Lib mailing list
> Lib_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2019/09/13453.php

Received on 2019-09-08 18:12:11