sg16: Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 9 Sep 2019 09:31:47 +0200

On Mon, 9 Sep 2019 at 01:25, Tom Honermann <tom_at_[hidden]> wrote:

>
> On Sep 8, 2019, at 3:31 PM, Tony V E via Lib <lib_at_[hidden]> wrote:
>
> Do we have / could we have / should we have
> a clear long term (20 years) direction for text in C++?
>
>
> I would like that very much, but we don’t control the ecosystem, and will
> have to, to some degree, roll with where the community takes us.
>

The community is waiting for us to catch up and i do believe we have some
control

>
>
> ie the long term direction is unicode.
> and/or specifically the long term direction is UTF8.
>
>
> I think we do have wide spread agreement on that, though UTF-16 is likely
> to remain strongly relevant in some niches.
>
> We expect everyone to use char8_t then? Or we expect char to become utf8
> someday?
>
>
> I think it is very unlikely that there will be a mass migration to
> char8_t. My expectation is that it will be used for the internal encoding
> within some percentage of new projects and components.
>
> With regard to char, I expect it to remain the type used for text that may
> or may not be UTF-8.
>
> I think Microsoft will eventually provide (non-experimental) means to use
> UTF-8 with Win32 and that this will likely come in three forms
>

> 1) support for UTF-8 as the system wide Active Code Page (ACP). This is
> already available as an experimental option.
>

They di

>
> 2) support for executables to opt-in to a per-process override of the
> system wide ACP. In this mode, stdio would presumably traffic in the system
> wide ACP and require transcoding (I don’t think implicit transcoding is
> realistic). This is already available as an experimental option.
>

They do

>
> 3) support for a subset of Win32 interfaces that take char8_t. E.g., U8
> variants of some existing A/W interfaces.
>

That seems unlikely ?

>
> z/OS is a bit more interesting. Though EBCDIC based, ASCII interfaces that
> implicitly transcode to EBCDIC are available for a subset of C interfaces
> . As far as I am aware, there are no plans to extend this support to
> include UTF-8.
>

Their interest in text is limited, it is clearly a small minority here.
I think there is a difference between not breaking their use cases and
designing for that platform specifically.
Whatever we do, they will be fine

>
> What do we want the long term future to look like?
>
>
> 🎵You can’t always get what you want 🎶
>
> deprecate std::string?
>
>
> Probably not.
>

We should supersede it and an may chips fall were they may.

>
>
> And then a list of short term stop-gap measures, like "we know we can't do
> X yet,so we do Y for now".
> Like we use char, but plan on switching to char8_t.
> Or QoI escape hatches. etc.
>
>
> I think we need to plan to support use of both char and char8_t for UTF-8
> text for the foreseeable future.
>
> Tom.
>
>
>
>
> On Sun, Sep 8, 2019 at 2:46 PM Corentin via Lib <lib_at_[hidden]>
> wrote:
>
>>
>>
>> On Sun, 8 Sep 2019 at 19:30, Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> On 9/8/19 12:40 PM, Corentin wrote:
>>>
>>>
>>>
>>> On Sun, 8 Sep 2019 at 18:12, Tom Honermann <tom_at_[hidden]> wrote:
>>>
>>>> On 9/8/19 6:00 AM, Corentin via Lib wrote:
>>>>
>>>>
>>>>
>>>> On Sun, 8 Sep 2019 at 11:17, Corentin <corentin.jabot_at_[hidden]> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Sun, 8 Sep 2019 at 09:52, Billy O'Neal (VC LIBS) <
>>>>> bion_at_[hidden]> wrote:
>>>>>
>>>>>> > I agree that EGCS is the best option. That doesn't drag locale
>>>>>>
>>>>>>
>>>>>>
>>>>>> Because we don’t get to assume that we’re talking about Unicode at
>>>>>> all, it absolutely drags in locale.
>>>>>>
>>>>>
>>>>> Sorry, I should have been more specific.
>>>>> There is a non-tailored Unicode EGCS boundary algorithm (but it can be
>>>>> tailored)
>>>>> I didn't mean to imply that text manipulation can be done without
>>>>> knowing its encoding and never use "locale" to mean encoding.
>>>>>
>>>>> EGCS are only defined for text whose character repertoire is Unicode,
>>>>> other encodings deal with codepoints
>>>>>
>>>>
>>>>
>>>> To be clear, the difference of whether the EGC algorithm is required to
>>>> be tailored or not is that tailoring for all intent and purposes requires
>>>> icu or something with CLDR, which restrict the platforms on which this
>>>> can be implemented
>>>>
>>>> Tailoring is not relevant to this discussion.
>>>>
>>> It is - see https://unicode.org/reports/tr29/ "ch" is 2 EGCS in most
>>> locales but in Slovak it's 1. I don't make the rules :D
>>>
>>> It isn't relevant in determining how we resolve this issue. If the
>>> resolution is that field widths are measured in EGCs, then we've already
>>> decided that the width is locale dependent and tailoring becomes an
>>> implementation detail.
>>>
>>
>> No, format decided to be locale-independent (for good reason) and
>> applying locale specific behavior implicitly would be against that.
>> I'n arguing for encoding specific behavior
>>
>>
>>>
>>> The locale dependency stems from the encoding itself being dependent on
>>>> locale. Again, LANG=C vs LANG=C.UTF-8. If the specified behavior is
>>>> encoding dependent (as it would have to be for field width to be a count of
>>>> any of code points, scalar values, or EGCs), then it is also locale
>>>> dependent (for char and wchar_t). Thus there is a trade off:
>>>>
>>>> 1. Either the behavior is locale dependent in which case, field
>>>> widths could be specified such that they count code points, scalar values,
>>>> or EGCs when the locale selects a Unicode encoding (and something else for
>>>> non-Unicode encodings), or
>>>> 2. The behavior is not locale dependent in which case, field widths
>>>> can only be specified in terms of code units.
>>>>
>>>>
>>> Agreed, but let me rephrase:
>>>
>>> Either a string is text and therefore we need and to know its encoding,
>>> or it is a sequence of bytes (in the case of char)
>>> I have an opinion about what we are dealing with in this context :D
>>>
>>> So your preference is for trade off #1 above and the cost is that
>>> std::format is no longer locale insensitive even in the cases where a
>>> std::locale argument is not provided.
>>>
>> It would be _encoding_ sensitive
>> It would not change for example the decimal separator.
>>
>> When Unicode is involved - and even when it is not, it is I think
>> important not to conflate locale and encoding even if C kinda amalgamates
>> the two and derives one from the other.
>>
>>
>>
>>
>>> Since I don't think field width works for alignment, even if EGCs are
>>> used (see Henri's post - https://hsivonen.fi/string-length), I prefer
>>> trade off #2.
>>>
>>> Tom.
>>>
>>>
>>>
>>> Recall that, unless there is a call to std::setlocale, all C and C++
>>>> processes start with the locale set to "C"
>>>>
>>> Tom.
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Billy3
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:* Lib <lib-bounces_at_[hidden]> on behalf of Corentin via
>>>>>> Lib <lib_at_[hidden]>
>>>>>> *Sent:* Saturday, September 7, 2019 11:08:25 PM
>>>>>> *To:* Library Working Group <lib_at_[hidden]>
>>>>>> *Cc:* Corentin <corentin.jabot_at_[hidden]>; Victor Zverovich <
>>>>>> victor.zverovich_at_[hidden]>; Tom Honermann <tom_at_[hidden]>;
>>>>>> unicode_at_[hidden] <unicode_at_[hidden]>
>>>>>> *Subject:* Re: [isocpp-lib] New issue: Are std::format field widths
>>>>>> code units, code points, or something else?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Sep 8, 2019, 5:30 AM Tom Honermann via Lib <
>>>>>> lib_at_[hidden]> wrote:
>>>>>>
>>>>>>> On 9/7/19 10:44 PM, Victor Zverovich wrote:
>>>>>>>
>>>>>>> > Is field width measured in code units, code points, or something
>>>>>>> else?
>>>>>>>
>>>>>>> I think the main consideration here is that width should be
>>>>>>> locale-independent by default for consistency with the rest of
>>>>>>> std::format's design.
>>>>>>>
>>>>>>> I agree with that goal, but...
>>>>>>>
>>>>>>> If we can say that width is measured in grapheme clusters or code
>>>>>>> points based on the execution encoding (or whatever the standardese term)
>>>>>>> without querying the locale then I suggest doing so.
>>>>>>>
>>>>>>> I don't know how to do that. From my response to Zach, if code
>>>>>>> units aren't used, then behavior should be different for LANG=C vs
>>>>>>> LANG=C.UTF-8.
>>>>>>>
>>>>>>> I have slight preference for grapheme clusters since those
>>>>>>> correspond to user-perceived characters, but only have implementation
>>>>>>> experience with code points (this is what both the fmt library and Python
>>>>>>> do).
>>>>>>>
>>>>>>> I would definitely vote for EGCs over code points. I think code
>>>>>>> points are probably the worst of the options since it makes the results
>>>>>>> dependent on Unicode normalization form.
>>>>>>>
>>>>>>
>>>>>> I disagree. Code Units is the worse option. For me anything involving
>>>>>> code units is a big red flag. I agree that EGCS is the best option. That
>>>>>> doesn't drag locale, might be a bit involved for implementers in 20.
>>>>>> I don't think specify EGCS for Unicode text and codepoints otherwise
>>>>>> wouldn't be too difficult - implementation might be a bit challenging on
>>>>>> some platforms in the 20 time frame but they could fallback to codepoints
>>>>>> in the meantime. Not perfect but I think we need a good long term solution
>>>>>> rather than a bad short term one
>>>>>>
>>>>>> Tom.
>>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Victor
>>>>>>>
>>>>>>> On Sat, Sep 7, 2019 at 5:13 PM Tom Honermann via Lib <
>>>>>>> lib_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> [format.string.std]p7
>>>>>>>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252854619&sdata=WsHw%2BM62uyiOBrr91P6W1GzwGe313EDe30bKN5i006Q%3D&reserved=0>
>>>>>>>> states:
>>>>>>>>
>>>>>>>> The *positive-integer* in *width* is a decimal integer defining
>>>>>>>> the minimum field width. If *width* is not specified, there is no
>>>>>>>> minimum field width, and the field width is determined based on the content
>>>>>>>> of the field.
>>>>>>>>
>>>>>>>> Is field width measured in code units, code points, or something
>>>>>>>> else?
>>>>>>>>
>>>>>>>> Consider the following example assuming a UTF-8 locale:
>>>>>>>>
>>>>>>>> std::format("{}", "\xC3\x81"); // U+00C1 { LATIN
>>>>>>>> CAPITAL LETTER A WITH ACUTE }
>>>>>>>> std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN
>>>>>>>> CAPITAL LETTER A } { COMBINING ACUTE ACCENT }
>>>>>>>>
>>>>>>>> In both cases, the arguments encode the same user-perceived
>>>>>>>> character (Á). The first uses two UTF-8 code units to encode a single code
>>>>>>>> point that represents a single glyph using a composed Unicode normalization
>>>>>>>> form. The second uses three code units to encode two code points that
>>>>>>>> represent the same glyph using a decomposed Unicode normalization form.
>>>>>>>>
>>>>>>>> How is the field width determined? If measured in code units, the
>>>>>>>> first has a width of 2 and the second of 3. If measured in code points,
>>>>>>>> the first has a width of 1 and the second of 2. If measured in grapheme
>>>>>>>> clusters, both have a width of 1. Is the determination locale dependent?
>>>>>>>>
>>>>>>>> *Proposed resolution:*
>>>>>>>>
>>>>>>>> Field widths are measured in code units and are not locale
>>>>>>>> dependent. Modify [format.string.std]p7
>>>>>>>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Fformat%23string.std-7&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=36WpbP64Oqoi4Pne9kFrEu6nauHLNr2VunnfkvdWcPY%3D&reserved=0>
>>>>>>>> as follows:
>>>>>>>>
>>>>>>>> The *positive-integer* in *width* is a decimal integer defining
>>>>>>>> the minimum field width. If *width* is not specified, there is no
>>>>>>>> minimum field width, and the field width is determined based on the content
>>>>>>>> of the field. *Field width is measured in code units. Each byte
>>>>>>>> of a multibyte character contributes to the field width.*
>>>>>>>>
>>>>>>>> (*code unit* is not formally defined in the standard. Most uses
>>>>>>>> occur in UTF-8 and UTF-16 specific contexts, but [lex.ext]p5
>>>>>>>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Flex.ext%235&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252864612&sdata=UyG%2Fr7BXuLAPAXP78ekpXS%2FWhqdeU2QCHTmTeBPjImQ%3D&reserved=0>
>>>>>>>> uses it in an encoding agnostic context.)
>>>>>>>>
>>>>>>>> Tom.
>>>>>>>> _______________________________________________
>>>>>>>> Lib mailing list
>>>>>>>> Lib_at_[hidden]
>>>>>>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
>>>>>>>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=ieyJCXmZ0Bj3UfW4Lvi3hW1HlOq6oeEML86Xyry9uFI%3D&reserved=0>
>>>>>>>> Link to this post: http://lists.isocpp.org/lib/2019/09/13440.php
>>>>>>>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13440.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252874608&sdata=l4UxwaFExnxKireder%2F%2BAnU2mszZXMYatHrd2zGSSWQ%3D&reserved=0>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Lib mailing list
>>>>>>> Lib_at_[hidden]
>>>>>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
>>>>>>> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252884602&sdata=B0%2BhF8pSkAy2MbEwWHk1r3uVjbIpvIoQ%2Fi%2BckyTQ94A%3D&reserved=0>
>>>>>>> Link to this post: http://lists.isocpp.org/lib/2019/09/13446.php
>>>>>>> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib%2F2019%2F09%2F13446.php&data=02%7C01%7Cbion%40microsoft.com%7C92b795de78e843d852bf08d73422ffe8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637035197252894598&sdata=NVwyEiiPWSwvAApse%2FxktecxI6oAiGhUWKjyXw8yYMw%3D&reserved=0>
>>>>>>>
>>>>>>
>>>> _______________________________________________
>>>> Lib mailing listLib_at_[hidden]
>>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
>>>> Link to this post: http://lists.isocpp.org/lib/2019/09/13453.php
>>>>
>>>>
>>>>
>>> _______________________________________________
>> Lib mailing list
>> Lib_at_[hidden]
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
>> Link to this post: http://lists.isocpp.org/lib/2019/09/13458.php
>>
>
>
> --
> Be seeing you,
> Tony
>
> _______________________________________________
> Lib mailing list
> Lib_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2019/09/13459.php
>
>

Received on 2019-09-09 09:32:01