sg16: Re: [SG16-Unicode] [isocpp-lib-ext] The "Let's Stop Ascribing Meaning to Code Points" blog post

From: Corentin <corentin.jabot_at_[hidden]>
Date: Tue, 12 Nov 2019 22:19:29 +0100

On Tue, Nov 12, 2019, 22:09 Tom Honermann <tom_at_[hidden]> wrote:

> If implementors aren't going to be willing to change these tables once we
> ship, then I think we have a fairly serious issue.
>

+1

>
> Some have adamantly stated that these widths are estimates only and should
> not be counted on to remain stable. Code that is sensitive to the
> formatted size of the output should be calling std::formatted_size and
> allocating appropriately. I take it your concern is regarding code that
> calls std::format_to with an assumption that the provided output buffer is
> large enough? (or, code that calls std::format and assumes the size of the
> resulting std::string).
>
> Tom.
>
> On 11/12/19 8:58 PM, Billy O'Neal (VC LIBS) wrote:
>
> My only point was that the specified behavior gives grapheme clusters a
> width of 1 or 2, but there exist characters like U+FDFD that are wider than
> 2. (And many that have a width of 0) I would be very nervous about changing
> the constants used after std::format ships because that could introduce
> unexpected buffer overruns or underruns in user programs. This is the kind
> of thing that becomes contractual very quickly (which is one of the reasons
> I was weakly against trying to open this can of worms).
>
>
>
> Billy3
>
>
>
> *From: *Tom Honermann <tom_at_[hidden]>
> *Sent: *Tuesday, November 12, 2019 12:53 PM
> *To: *lib-ext_at_[hidden]; Corentin <corentin.jabot_at_[hidden]>
> *Cc: *Billy O'Neal (VC LIBS) <bion_at_[hidden]>; lib_at_[hidden];
> SG16 <unicode_at_[hidden]>; Victor Zverovich <victor.zverovich_at_[hidden]>
> *Subject: *Re: [isocpp-lib-ext] The "Let's Stop Ascribing Meaning to Code
> Points" blog post
>
>
>
> On 11/12/19 6:11 PM, Billy O'Neal (VC LIBS) via Lib-Ext wrote:
>
> It came up in the context of that width thing in format and I was asking
> if I had permission to make wider-than-2 characters format properly, and
> the forwarded text doesn’t seem to allow that (which is OK, I just wanted
> to understand at the time); I was thinking of U+FDFD (﷽).
>
> Can you elaborate? My understanding of the forwarded wording is that the
> assumed encoding for the input text is implementation defined (though not
> locale sensitive) and that implementors are encouraged to use the Unicode
> code point ranges indicated in the wording, but are not required to (that
> is my interpretation of the use of the word "should" in the proposed
> wording).
>
> It does look like the provided code point ranges don't handle U+FDFD
> correctly.
>
> I don't know how much confidence should be placed on the listed code point
> ranges. But I think it is important that we consider them amenable to
> change. I suspect that U+FDFD is not the last code point we'll find that
> is not correctly handled.
>
> Tom.
>
>
>
> Billy3
>
>
>
> *From: *Corentin <corentin.jabot_at_[hidden]>
> *Sent: *Tuesday, November 12, 2019 8:42 AM
> *To: *C++ Library Evolution Working Group <lib-ext_at_[hidden]>
> *Cc: *lib_at_[hidden]; Billy O'Neal (VC LIBS) <bion_at_[hidden]>;
> SG16 <unicode_at_[hidden]>
> *Subject: *Re: [isocpp-lib-ext] The "Let's Stop Ascribing Meaning to Code
> Points" blog post
>
>
>
>
>
>
>
> On Tue, 12 Nov 2019 at 16:58, Billy O'Neal (VC LIBS) via Lib-Ext <
> lib-ext_at_[hidden]> wrote:
>
> During review of some Unicode stuff in LWG we had a mini discussion for
> some folks about grapheme clusters and I mentioned everyone who touches
> this stuff might understand the complexities better if they read this:
>
>
>
>
> https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmanishearth.github.io%2Fblog%2F2017%2F01%2F14%2Fstop-ascribing-meaning-to-unicode-code-points%2F&data=02%7C01%7Cbion%40microsoft.com%7Caf98b04ab27042b257db08d767b26149%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637091888153726447&sdata=McYDWKyevonhNT732yikSFqDuAlbXKLBdGw5%2BBdgVJk%3D&reserved=0>
>
>
>
> +1
>
> FYI SG-16 is aware of that blog post and i think there is a pretty strong
> agreement with it.
>
> Codepoints have some use (notably the Unicode Character Database is really
> the Unicode Codepoint Database, and most Unicode algorithms works on
> codepoints), but any kind of user facing UX should deal with EGCS.
>
> It is not always what applications choose to do for a variety of reasons.
> Notably Twitter character counts deals in codepoints, web browsers
> search function use codepoints as to ignore diacritics, and comparisons can
> be done on (normalized) codepoint sequences.
>
>
>
> There is also not always a 1-1 mapping between what people understand as
> "character", grapheme clusters and glyphes.
>
>
>
>
>
> Billy3
>
> _______________________________________________
> Lib-Ext mailing list
> Lib-Ext_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib-ext&data=02%7C01%7Cbion%40microsoft.com%7Caf98b04ab27042b257db08d767b26149%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637091888153736437&sdata=%2FCxf5Wy1XyiBIBTUa9Bkv8JDkcY4KoEApujBgPDDJ2c%3D&reserved=0>
> Link to this post: http://lists.isocpp.org/lib-ext/2019/11/13606.php
> <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib-ext%2F2019%2F11%2F13606.php&data=02%7C01%7Cbion%40microsoft.com%7Caf98b04ab27042b257db08d767b26149%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637091888153746432&sdata=XUFzgnDpGE6aZkvLCxc62Ppj1kVDEMP7R0TaOFXK0w8%3D&reserved=0>
>
>
>
>
>
> _______________________________________________
>
> Lib-Ext mailing list
>
> Lib-Ext_at_[hidden]
>
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Flib-ext&data=02%7C01%7Cbion%40microsoft.com%7Caf98b04ab27042b257db08d767b26149%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637091888153746432&sdata=%2BEoD3p3%2FtNQBdLJGEW%2BV9l0c3SQeF5lnjkimezW14Vg%3D&reserved=0>
>
> Link to this post: http://lists.isocpp.org/lib-ext/2019/11/13609.php <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.isocpp.org%2Flib-ext%2F2019%2F11%2F13609.php&data=02%7C01%7Cbion%40microsoft.com%7Caf98b04ab27042b257db08d767b26149%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637091888153756427&sdata=MGtOpNxPCBZVy6L%2BCUw0UBmsv%2BBAeGVu49b01zQkpNU%3D&reserved=0>
>
>
>
>
>
>
>

Received on 2019-11-12 22:19:43