sg16: Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 8 Sep 2019 13:30:18 -0400

On 9/8/19 12:40 PM, Corentin wrote:
>
>
> On Sun, 8 Sep 2019 at 18:12, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 9/8/19 6:00 AM, Corentin via Lib wrote:
>>
>>
>> On Sun, 8 Sep 2019 at 11:17, Corentin <corentin.jabot_at_[hidden]
>> <mailto:corentin.jabot_at_[hidden]>> wrote:
>>
>>
>>
>> On Sun, 8 Sep 2019 at 09:52, Billy O'Neal (VC LIBS)
>> <bion_at_[hidden] <mailto:bion_at_[hidden]>> wrote:
>>
>> > I agree that EGCS is the best option. That doesn't drag
>> locale
>>
>> Because we don’t get to assume that we’re talking about
>> Unicode at all, it absolutely drags in locale.
>>
>>
>> Sorry, I should have been more specific.
>> There is a non-tailored Unicode EGCS boundary algorithm (but
>> it can be tailored)
>> I didn't mean to imply that text manipulation can be done
>> without knowing its encoding and never use "locale" to mean
>> encoding.
>>
>> EGCS are only defined for text whose character repertoire is
>> Unicode, other encodings deal with codepoints
>>
>>
>>
>> To be clear, the difference of whether the EGC algorithm is
>> required to be tailored or not is that tailoring for all intent
>> and purposes requires
>> icu or something with CLDR, which restrict the platforms on which
>> this can be implemented
>
> Tailoring is not relevant to this discussion.
>
> It is - see https://unicode.org/reports/tr29/ "ch" is 2 EGCS in most
> locales but in Slovak it's 1. I don't make the rules :D
It isn't relevant in determining how we resolve this issue. If the
resolution is that field widths are measured in EGCs, then we've already
decided that the width is locale dependent and tailoring becomes an
implementation detail.
>
> The locale dependency stems from the encoding itself being
> dependent on locale. Again, LANG=C vs LANG=C.UTF-8. If the
> specified behavior is encoding dependent (as it would have to be
> for field width to be a count of any of code points, scalar
> values, or EGCs), then it is also locale dependent (for char and
> wchar_t). Thus there is a trade off:
>
> 1. Either the behavior is locale dependent in which case, field
> widths could be specified such that they count code points,
> scalar values, or EGCs when the locale selects a Unicode
> encoding (and something else for non-Unicode encodings), or
> 2. The behavior is not locale dependent in which case, field
> widths can only be specified in terms of code units.
>
>
> Agreed, but let me rephrase:
>
> Either a string is text and therefore we need and to know its
> encoding, or it is a sequence of bytes (in the case of char)
> I have an opinion about what we are dealing with in this context :D

So your preference is for trade off #1 above and the cost is that
std::format is no longer locale insensitive even in the cases where a
std::locale argument is not provided.

Since I don't think field width works for alignment, even if EGCs are
used (see Henri's post - https://hsivonen.fi/string-length), I prefer
trade off #2.

Tom.

>
>
> Recall that, unless there is a call to std::setlocale, all C and
> C++ processes start with the locale set to "C"
>
> Tom.
>
>>
>>
>>
>> Billy3
>>
>> ------------------------------------------------------------------------
>> *From:* Lib <lib-bounces_at_[hidden]
>> <mailto:lib-bounces_at_[hidden]>> on behalf of
>> Corentin via Lib <lib_at_[hidden]
>> <mailto:lib_at_[hidden]>>
>> *Sent:* Saturday, September 7, 2019 11:08:25 PM
>> *To:* Library Working Group <lib_at_[hidden]
>> <mailto:lib_at_[hidden]>>
>> *Cc:* Corentin <corentin.jabot_at_[hidden]
>> <mailto:corentin.jabot_at_[hidden]>>; Victor Zverovich
>> <victor.zverovich_at_[hidden]
>> <mailto:victor.zverovich_at_[hidden]>>; Tom Honermann
>> <tom_at_[hidden] <mailto:tom_at_[hidden]>>;
>> unicode_at_[hidden]
>> <mailto:unicode_at_[hidden]>
>> <unicode_at_[hidden] <mailto:unicode_at_[hidden]>>
>> *Subject:* Re: [isocpp-lib] New issue: Are std::format
>> field widths code units, code points, or something else?
>>
>>
>> On Sun, Sep 8, 2019, 5:30 AM Tom Honermann via Lib
>> <lib_at_[hidden] <mailto:lib_at_[hidden]>> wrote:
>>
>> On 9/7/19 10:44 PM, Victor Zverovich wrote:
>>> > Is field width measured in code units, code
>>> points, or something else?
>>>
>>> I think the main consideration here is that width
>>> should be locale-independent by default for
>>> consistency with the rest of std::format's design.
>> I agree with that goal, but...
>>> If we can say that width is measured in grapheme
>>> clusters or code points based on the execution
>>> encoding (or whatever the standardese term) without
>>> querying the locale then I suggest doing so.
>> I don't know how to do that. From my response to
>> Zach, if code units aren't used, then behavior should
>> be different for LANG=C vs LANG=C.UTF-8.
>>> I have slight preference for grapheme clusters since
>>> those correspond to user-perceived characters, but
>>> only have implementation experience with code points
>>> (this is what both the fmt library and Python do).
>>
>> I would definitely vote for EGCs over code points. I
>> think code points are probably the worst of the
>> options since it makes the results dependent on
>> Unicode normalization form.
>>
>>
>> I disagree. Code Units is the worse option. For me
>> anything involving code units is a big red flag. I agree
>> that EGCS is the best option. That doesn't drag locale,
>> might be a bit involved for implementers in 20.
>> I don't think specify EGCS for Unicode text and
>> codepoints otherwise wouldn't be too difficult -
>> implementation might be a bit challenging on some
>> platforms in the 20 time frame but they could fallback to
>> codepoints in the meantime. Not perfect but I think we
>> need a good long term solution rather than a bad short
>> term one
>>
>> Tom.
>>
>>>
>>> Cheers,
>>> Victor
>>>
>>> On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib
>>> <lib_at_[hidden] <mailto:lib_at_[hidden]>>
>>> wrote:
>>>
>>> [format.string.std]p7
>>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252854619&sdata=WsHw%2BM62uyiOBrr91P6W1GzwGe313EDe30bKN5i006Q%3D&reserved=0>
>>> states:
>>>
>>>> The /positive-integer/ in /width/ is a decimal
>>>> integer defining the minimum field width. If
>>>> /width/ is not specified, there is no minimum
>>>> field width, and the field width is determined
>>>> based on the content of the field.
>>>>
>>> Is field width measured in code units, code
>>> points, or something else?
>>>
>>> Consider the following example assuming a UTF-8
>>> locale:
>>>
>>> std::format("{}", "\xC3\x81"); // U+00C1{ LATIN
>>> CAPITAL LETTER A WITH ACUTE }
>>> std::format("{}", "\x41\xCC\x81"); // U+0041
>>> U+0301 { LATIN CAPITAL LETTER A } { COMBINING
>>> ACUTE ACCENT }
>>>
>>> In both cases, the arguments encode the same
>>> user-perceived character (Á). The first uses
>>> two UTF-8 code units to encode a single code
>>> point that represents a single glyph using a
>>> composed Unicode normalization form. The second
>>> uses three code units to encode two code points
>>> that represent the same glyph using a decomposed
>>> Unicode normalization form.
>>>
>>> How is the field width determined? If measured
>>> in code units, the first has a width of 2 and
>>> the second of 3. If measured in code points,
>>> the first has a width of 1 and the second of 2.
>>> If measured in grapheme clusters, both have a
>>> width of 1. Is the determination locale dependent?
>>>
>>> *Proposed resolution:*
>>>
>>> Field widths are measured in code units and are
>>> not locale dependent. Modify
>>> [format.string.std]p7
>>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=36WpbP64Oqoi4Pne9kFrEu6nauHLNr2VunnfkvdWcPY%3D&reserved=0>
>>> as follows:
>>>
>>>> The /positive-integer/ in /width/ is a decimal
>>>> integer defining the minimum field width. If
>>>> /width/ is not specified, there is no minimum
>>>> field width, and the field width is determined
>>>> based on the content of the field. *Field width
>>>> is measured in code units. Each byte of a
>>>> multibyte character contributes to the field
>>>> width.*
>>>>
>>> (/code unit/ is not formally defined in the
>>> standard. Most uses occur in UTF-8 and UTF-16
>>> specific contexts, but [lex.ext]p5
>>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Flex.ext%235&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=UyG%2Fr7BXuLAPAXP78ekpXS%2FWhqdeU2QCHTmTeBPjImQ%3D&reserved=0>
>>> uses it in an encoding agnostic context.)
>>>
>>> Tom.
>>>
>>> _______________________________________________
>>> Lib mailing list
>>> Lib_at_[hidden] <mailto:Lib_at_[hidden]>
>>> Subscription:
>>> https://lists.isocpp.org/mailman/listinfo.cgi/lib
>>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=ieyJCXmZ0Bj3UfW4Lvi3hW1HlOq6oeEML86Xyry9uFI%3D&reserved=0>
>>> Link to this post:
>>> http://lists.isocpp.org/lib/2019/09/13440.php
>>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13440.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=l4UxwaFExnxKireder%2F%2BAnU2mszZXMYatHrd2zGSSWQ%3D&reserved=0>
>>>
>>
>> _______________________________________________
>> Lib mailing list
>> Lib_at_[hidden] <mailto:Lib_at_[hidden]>
>> Subscription:
>> https://lists.isocpp.org/mailman/listinfo.cgi/lib
>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252884602&sdata=B0%2BhF8pSkAy2MbEwWHk1r3uVjbIpvIoQ%2Fi%2BckyTQ94A%3D&reserved=0>
>> Link to this post:
>> http://lists.isocpp.org/lib/2019/09/13446.php
>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13446.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252894598&sdata=NVwyEiiPWSwvAApse%2FxktecxI6oAiGhUWKjyXw8yYMw%3D&reserved=0>
>>
>>
>> _______________________________________________
>> Lib mailing list
>> Lib_at_[hidden] <mailto:Lib_at_[hidden]>
>> Subscription:https://lists.isocpp.org/mailman/listinfo.cgi/lib
>> Link to this post:http://lists.isocpp.org/lib/2019/09/13453.php
>
>

Received on 2019-09-08 19:30:23