sg16: Re: [SG16-Unicode] [isocpp-lib-ext] The "Let's Stop Ascribing Meaning to Code Points" blog post

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 12 Nov 2019 16:09:35 -0500

If implementors aren't going to be willing to change these tables once
we ship, then I think we have a fairly serious issue.

Some have adamantly stated that these widths are estimates only and
should not be counted on to remain stable. Code that is sensitive to
the formatted size of the output should be calling std::formatted_size
and allocating appropriately. I take it your concern is regarding code
that calls std::format_to with an assumption that the provided output
buffer is large enough? (or, code that calls std::format and assumes
the size of the resulting std::string).

Tom.

On 11/12/19 8:58 PM, Billy O'Neal (VC LIBS) wrote:
>
> My only point was that the specified behavior gives grapheme clusters
> a width of 1 or 2, but there exist characters like U+FDFD that are
> wider than 2. (And many that have a width of 0) I would be very
> nervous about changing the constants used after std::format ships
> because that could introduce unexpected buffer overruns or underruns
> in user programs. This is the kind of thing that becomes contractual
> very quickly (which is one of the reasons I was weakly against trying
> to open this can of worms).
>
> Billy3
>
> *From: *Tom Honermann <mailto:tom_at_[hidden]>
> *Sent: *Tuesday, November 12, 2019 12:53 PM
> *To: *lib-ext_at_[hidden] <mailto:lib-ext_at_[hidden]>;
> Corentin <mailto:corentin.jabot_at_[hidden]>
> *Cc: *Billy O'Neal (VC LIBS) <mailto:bion_at_[hidden]>;
> lib_at_[hidden] <mailto:lib_at_[hidden]>; SG16
> <mailto:unicode_at_[hidden]>; Victor Zverovich
> <mailto:victor.zverovich_at_[hidden]>
> *Subject: *Re: [isocpp-lib-ext] The "Let's Stop Ascribing Meaning to
> Code Points" blog post
>
> On 11/12/19 6:11 PM, Billy O'Neal (VC LIBS) via Lib-Ext wrote:
>
> It came up in the context of that width thing in format and I was
> asking if I had permission to make wider-than-2 characters format
> properly, and the forwarded text doesn’t seem to allow that (which
> is OK, I just wanted to understand at the time); I was thinking of
> U+FDFD (﷽).
>
> Can you elaborate? My understanding of the forwarded wording is that
> the assumed encoding for the input text is implementation defined
> (though not locale sensitive) and that implementors are encouraged to
> use the Unicode code point ranges indicated in the wording, but are
> not required to (that is my interpretation of the use of the word
> "should" in the proposed wording).
>
> It does look like the provided code point ranges don't handle U+FDFD
> correctly.
>
> I don't know how much confidence should be placed on the listed code
> point ranges. But I think it is important that we consider them
> amenable to change. I suspect that U+FDFD is not the last code point
> we'll find that is not correctly handled.
>
> Tom.
>
> Billy3
>
> *From: *Corentin <mailto:corentin.jabot_at_[hidden]>
> *Sent: *Tuesday, November 12, 2019 8:42 AM
> *To: *C++ Library Evolution Working Group
> <mailto:lib-ext_at_[hidden]>
> *Cc: *lib_at_[hidden] <mailto:lib_at_[hidden]>; Billy
> O'Neal (VC LIBS) <mailto:bion_at_[hidden]>; SG16
> <mailto:unicode_at_[hidden]>
> *Subject: *Re: [isocpp-lib-ext] The "Let's Stop Ascribing Meaning
> to Code Points" blog post
>
> On Tue, 12 Nov 2019 at 16:58, Billy O'Neal (VC LIBS) via Lib-Ext
> <lib-ext_at_[hidden] <mailto:lib-ext_at_[hidden]>> wrote:
>
> During review of some Unicode stuff in LWG we had a mini
> discussion for some folks about grapheme clusters and I
> mentioned everyone who touches this stuff might understand the
> complexities better if they read this:
>
> https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmanishearth.github.io%2Fblog%2F2017%2F01%2F14%2Fstop-ascribing-meaning-to-unicode-code-points%2F&data=02%7C01%7Cbion%40microsoft.com%7Caf98b04ab27042b257db08d767b26149%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637091888153726447&sdata=McYDWKyevonhNT732yikSFqDuAlbXKLBdGw5%2BBdgVJk%3D&reserved=0>
>
> +1
>
> FYI SG-16 is aware of that blog post and i think there is a pretty
> strong agreement with it.
>
> Codepoints have some use (notably the Unicode Character Database
> is really the Unicode Codepoint Database, and most Unicode
> algorithms works on codepoints), but any kind of user facing UX
> should deal with EGCS.
>
> It is not always what applications choose to do for a variety of
> reasons. Notably Twitter character counts deals in codepoints, web
> browsers search function use codepoints as to ignore diacritics,
> and comparisons can be done on (normalized) codepoint sequences.
>
> There is also not always a 1-1 mapping between what people
> understand as "character", grapheme clusters and glyphes.
>
> Billy3
>
> _______________________________________________
> Lib-Ext mailing list
> Lib-Ext_at_[hidden] <mailto:Lib-Ext_at_[hidden]>
> Subscription:
> https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib-ext&data=02%7C01%7Cbion%40microsoft.com%7Caf98b04ab27042b257db08d767b26149%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637091888153736437&sdata=%2FCxf5Wy1XyiBIBTUa9Bkv8JDkcY4KoEApujBgPDDJ2c%3D&reserved=0>
> Link to this post:
> http://lists.isocpp.org/lib-ext/2019/11/13606.php
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib-ext%2F2019%2F11%2F13606.php&data=02%7C01%7Cbion%40microsoft.com%7Caf98b04ab27042b257db08d767b26149%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637091888153746432&sdata=XUFzgnDpGE6aZkvLCxc62Ppj1kVDEMP7R0TaOFXK0w8%3D&reserved=0>
>
>
>
> _______________________________________________
>
> Lib-Ext mailing list
>
> Lib-Ext_at_[hidden] <mailto:Lib-Ext_at_[hidden]>
>
> Subscription:https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib-ext&data=02%7C01%7Cbion%40microsoft.com%7Caf98b04ab27042b257db08d767b26149%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637091888153746432&sdata=%2BEoD3p3%2FtNQBdLJGEW%2BV9l0c3SQeF5lnjkimezW14Vg%3D&reserved=0>
>
> Link to this post:http://lists.isocpp.org/lib-ext/2019/11/13609.php <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib-ext%2F2019%2F11%2F13609.php&data=02%7C01%7Cbion%40microsoft.com%7Caf98b04ab27042b257db08d767b26149%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637091888153756427&sdata=MGtOpNxPCBZVy6L%2BCUw0UBmsv%2BBAeGVu49b01zQkpNU%3D&reserved=0>
>

Received on 2019-11-12 22:09:40