ISOCPP sg16 List: Re: LWG ISSUE: Format's width estimation is too approximate and not forward compatible.

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Sun, 18 Sep 2022 09:07:39 -0700

Windows is not a great example because width for anything nontrivial is
broken but xtem is interesting. This should totally go into a paper
explaining this design change. BTW you could also add iTerm2 which is a
popular terminal emulator with high-quality Unicode support: it gives a
width of 2.

- Victor

On Sun, Sep 18, 2022 at 8:55 AM Corentin <corentin.jabot_at_[hidden]> wrote:

> Let's see
> python3 -c 'print("|{}|\n|a|".format(chr(0x2E9A)))'
>
> Konsole:
> [image: image.png]
> Windows terminal
> [image: image.png]
> Mac
> [image: image.png]
> xterm
> [image: image.png]
>
>
> On Sun, Sep 18, 2022 at 4:38 PM Victor Zverovich <
> victor.zverovich_at_[hidden]> wrote:
>
>> Yes and the proposed resolution changes handling of such cases in an
>> incompatible way that goes against the original intent and makes width
>> estimation less useful. This is a design change.
>>
>> - Victor
>>
>> On Sun, Sep 18, 2022 at 7:20 AM Corentin <corentin.jabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Sun, Sep 18, 2022, 16:09 Victor Zverovich <victor.zverovich_at_[hidden]>
>>> wrote:
>>>
>>>> Even a cursory look at the issue shows a number of undesirable changes
>>>> that go against the intent of P1868R1. For example,
>>>>
>>>> U+2E9A
>>>>
>>>
>>> This codepoint does not appear to be assigned.
>>>
>>>>
>>>> has display width of 2 on common terminals while the proposed
>>>> resolution changes it to 1.
>>>>
>>>> The part "it seems questionable to make an exception for those without
>>>> input from Unicode" is itself a questionable design change.
>>>>
>>>> - Victor
>>>>
>>>> On Thu, Sep 15, 2022 at 10:12 AM Corentin via SG16 <
>>>> sg16_at_[hidden]> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Sep 15, 2022 at 5:59 PM Tom Honermann <tom_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> Thank you for looking into this, Corentin. It would be great to have
>>>>>> some kind of principled approach to how estimated widths are assigned.
>>>>>>
>>>>>> I don't think this should be handled as an LWG issue though. This is
>>>>>> a design change, so should go through LEWG (and probably SG16 as well).
>>>>>>
>>>>>> I think it would be helpful to have a paper that includes screenshots
>>>>>> (of select characters, not all 8500+ characters!) to demonstrate how common
>>>>>> terminals display them today. The goal being to ensure that they are
>>>>>> handled consistently across platforms before we standardize them.
>>>>>>
>>>>>> A paper would also present the opportunity to attribute an estimated
>>>>>> width of 0 (or a negative width!) to some characters. Perhaps it is worth
>>>>>> asking whether an estimated width of 1 really makes sense for CR, LF, NL,
>>>>>> VT, etc... It looks like these changes still don't cover edge cases like ﷽
>>>>>> (U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM).
>>>>>>
>>>>>
>>>>> I'm not trying to change the design, nor am I trying to make case-by
>>>>> case decisions, the original paper covered that.
>>>>> I think these are questions worth asking, sure, at least for control
>>>>> characters as, for an arbitrary non-cjk, non-emoji neither an information
>>>>> nor a specification exist.
>>>>> Note that however it's not straightforward
>>>>> * There is no spec so it's handled manually
>>>>> https://github.com/termux/wcwidth/blob/master/wcwidth.c#L497
>>>>> * C1 control characters may be mapped differently on EBCDIC
>>>>> platforms
>>>>> * Combining characters not preceded by a grapheme starter are in
>>>>> fact rendered
>>>>> * We already ignore non leading codepoints in graphemes
>>>>> * Negative width would be novel
>>>>>
>>>>> I do not intend to explore these further but I'd be happy to review
>>>>> such work :)
>>>>>
>>>>> However, the original paper intended for CJK characters and emojis to
>>>>> be covered, Unicode 14 and 15 added a number of such characters, I
>>>>> considered this a bug.
>>>>> But mostly, I consider specifying anything Unicode related in terms of
>>>>> partially-assigned ranges to be a bug reminiscent of the identifiers of old.
>>>>>
>>>>> Now, I expect SG16 will deal with this issue, but all the information
>>>>> is in the issue.
>>>>> The only design-ish question is whether we want to grandfather U+3248
>>>>> CIRCLED NUMBER TEN ON BLACK SQUARE..U+324F CIRCLED NUMBER EIGHTY ON BLACK
>>>>> SQUARE
>>>>> for backward compat reasons.
>>>>>
>>>>> Codepoints which would go from 1 to 2
>>>>> https://gist.github.com/cor3ntin/e5731f77574b146d806e39283e8c7cb7
>>>>> Screenshot of my terminal
>>>>> [image: image.png]
>>>>>
>>>>> Tom.
>>>>>> On 9/15/22 5:02 AM, Corentin via SG16 wrote:
>>>>>>
>>>>>>
>>>>>> Affected clause [format.string.std]
>>>>>>
>>>>>> Issue:
>>>>>> For the purpose of width estimation, format considers ranges of
>>>>>> codepoints initially derived from an implementation of wcwidth with
>>>>>> modifications.
>>>>>> (See https://wg21.link/p1868r1)
>>>>>>
>>>>>> This however present a number of challenges:
>>>>>>
>>>>>> - From a reading of the spec, it is not clear how these ranges
>>>>>> were selected.
>>>>>> - Poor forward compatibility with future Unicode versions. The
>>>>>> list will become less and less meaningful overtime or require active
>>>>>> maintenance at each Unicode release (which we have not done for Unicode 14
>>>>>> already)
>>>>>> - Some of these codepoints are unassigned or otherwise reserved,
>>>>>> which is another forward compatibility concern.
>>>>>>
>>>>>> Instead, we propose to
>>>>>>
>>>>>> - Rely on UAX-11 https://www.unicode.org/reports/tr11/ for most
>>>>>> of the codepoints)
>>>>>> - Grand-father specific and fully assigned, blocks of codepoints
>>>>>> to support additional pictograms per the original intent of the paper and
>>>>>> existing practices. We add the name of these blocks in the wording for
>>>>>> clarity.
>>>>>>
>>>>>> Note that per UAX-11
>>>>>>
>>>>>> - Most emojis are considered East_Asian_Width="W"
>>>>>> - By design, East_Asian_Width="W" includes specific unassigned
>>>>>> ranges, which should always be treated as Wide.
>>>>>>
>>>>>> This change:
>>>>>>
>>>>>> - Considers 8477 extra codepoints as having a width 2 (as of
>>>>>> Unicode 15) (mostly Tangut Ideographs)
>>>>>> - Change the width of 85 unassigned code points from 2 to 1
>>>>>> - Change the width of 8 codepoints (in the range U+3248 CIRCLED
>>>>>> NUMBER TEN ON BLACK SQUARE ..U+324F CIRCLED NUMBER EIGHTY ON BLACK SQUARE)
>>>>>> from 2 to 1, because it seems questionable to make an exception for those
>>>>>> without input from Unicode
>>>>>>
>>>>>>
>>>>>> Proposed wording:
>>>>>>
>>>>>> Modify [format.string.std]/p12
>>>>>>
>>>>>> For a string in a Unicode encoding, implementations should estimate
>>>>>> the width of a string as the sum of estimated widths of the first code
>>>>>> points in its extended grapheme clusters. The extended grapheme clusters of
>>>>>> a string are defined by UAX #29. The estimated width of the following code
>>>>>> points is 2:
>>>>>>
>>>>>> - U+1100 – U+115F
>>>>>> - U+2329 – U+232A
>>>>>> - U+2E80 – U+303E
>>>>>> - U+3040 – U+A4CF
>>>>>> - U+AC00 – U+D7A3
>>>>>> - U+F900 – U+FAFF
>>>>>> - U+FE10 – U+FE19
>>>>>> - U+FE30 – U+FE6F
>>>>>> - U+FF00 – U+FF60
>>>>>> - U+FFE0 – U+FFE6
>>>>>> - U+1F300 – U+1F64F
>>>>>> - U+1F900 – U+1F9FF
>>>>>> - U+20000 – U+2FFFD
>>>>>> - U+30000 – U+3FFFD
>>>>>> - *Any codepoint with the East_Asian_Width="W"
>>>>>> or East_Asian_Width="F" Derived Extracted Property as described by UAX #44*
>>>>>> -
>>>>>> *U+4DC0 - U+4DFF (Yijing Hexagram Symbols) *
>>>>>> - *U+1F300 - U+1F5FF (Miscellaneous Symbols and Pictographs)*
>>>>>> - *U+1F900 - U+1F9FF (Supplemental Symbols and Pictographs)*
>>>>>>
>>>>>> The estimated width of other code points is 1.
>>>>>>
>>>>>>
>>>>>> ---- END OF WORDING ----
>>>>>>
>>>>>> Annex: Differences introduced by this change
>>>>>>
>>>>>> // Missing from the standard (count: 8477)
>>>>>> // Used to be of width 1, becomes 2
>>>>>>
>>>>>> U+231A WATCH ..U+231B HOURGLASS
>>>>>> U+23E9 BLACK RIGHT-POINTING DOUBLE TRIANGLE ..U+23EC BLACK
>>>>>> DOWN-POINTING DOUBLE TRIANGLE
>>>>>> U+23F0 ALARM CLOCK
>>>>>> U+23F3 HOURGLASS WITH FLOWING SAND
>>>>>> U+25FD WHITE MEDIUM SMALL SQUARE ..U+25FE BLACK MEDIUM SMALL SQUARE
>>>>>> U+2614 UMBRELLA WITH RAIN DROPS ..U+2615 HOT BEVERAGE
>>>>>> U+2648 ARIES ..U+2653 PISCES
>>>>>> U+267F WHEELCHAIR SYMBOL
>>>>>> U+2693 ANCHOR
>>>>>> U+26A1 HIGH VOLTAGE SIGN
>>>>>> U+26AA MEDIUM WHITE CIRCLE ..U+26AB MEDIUM BLACK CIRCLE
>>>>>> U+26BD SOCCER BALL ..U+26BE BASEBALL
>>>>>> U+26C4 SNOWMAN WITHOUT SNOW ..U+26C5 SUN BEHIND CLOUD
>>>>>> U+26CE OPHIUCHUS
>>>>>> U+26D4 NO ENTRY
>>>>>> U+26EA CHURCH
>>>>>> U+26F2 FOUNTAIN ..U+26F3 FLAG IN HOLE
>>>>>> U+26F5 SAILBOAT
>>>>>> U+26FA TENT
>>>>>> U+26FD FUEL PUMP
>>>>>> U+2705 WHITE HEAVY CHECK MARK
>>>>>> U+270A RAISED FIST ..U+270B RAISED HAND
>>>>>> U+2728 SPARKLES
>>>>>> U+274C CROSS MARK
>>>>>> U+274E NEGATIVE SQUARED CROSS MARK
>>>>>> U+2753 BLACK QUESTION MARK ORNAMENT ..U+2755 WHITE EXCLAMATION MARK
>>>>>> ORNAMENT
>>>>>> U+2757 HEAVY EXCLAMATION MARK SYMBOL
>>>>>> U+2795 HEAVY PLUS SIGN ..U+2797 HEAVY DIVISION SIGN
>>>>>> U+27B0 CURLY LOOP
>>>>>> U+27BF DOUBLE CURLY LOOP
>>>>>> U+2B1B BLACK LARGE SQUARE ..U+2B1C WHITE LARGE SQUARE
>>>>>> U+2B50 WHITE MEDIUM STAR
>>>>>> U+2B55 HEAVY LARGE CIRCLE
>>>>>> U+A960 HANGUL CHOSEONG TIKEUT-MIEUM ..U+A97C HANGUL CHOSEONG
>>>>>> SSANGYEORINHIEUH
>>>>>> U+16FE0 TANGUT ITERATION MARK ..U+16FE4 KHITAN SMALL SCRIPT FILLER
>>>>>> U+16FF0 VIETNAMESE ALTERNATE READING MARK CA ..U+16FF1 VIETNAMESE
>>>>>> ALTERNATE READING MARK NHAY
>>>>>> U+17000 TANGUT IDEOGRAPH-# ..U+187F7 TANGUT IDEOGRAPH-#
>>>>>> U+18800 TANGUT COMPONENT-001 ..U+18CD5 KHITAN SMALL SCRIPT
>>>>>> CHARACTER-#
>>>>>> U+18D00 TANGUT IDEOGRAPH-# ..U+18D08 TANGUT IDEOGRAPH-#
>>>>>> U+1AFF0 KATAKANA LETTER MINNAN TONE-2 ..U+1AFF3 KATAKANA LETTER
>>>>>> MINNAN TONE-5
>>>>>> U+1AFF5 KATAKANA LETTER MINNAN TONE-7 ..U+1AFFB KATAKANA LETTER
>>>>>> MINNAN NASALIZED TONE-5
>>>>>> U+1AFFD KATAKANA LETTER MINNAN NASALIZED TONE-7 ..U+1AFFE KATAKANA
>>>>>> LETTER MINNAN NASALIZED TONE-8
>>>>>> U+1B000 KATAKANA LETTER ARCHAIC E ..U+1B122 KATAKANA LETTER ARCHAIC
>>>>>> WU
>>>>>> U+1B132 HIRAGANA LETTER SMALL KO
>>>>>> U+1B150 HIRAGANA LETTER SMALL WI ..U+1B152 HIRAGANA LETTER SMALL WO
>>>>>> U+1B155 KATAKANA LETTER SMALL KO
>>>>>> U+1B164 KATAKANA LETTER SMALL WI ..U+1B167 KATAKANA LETTER SMALL N
>>>>>> U+1B170 NUSHU CHARACTER-# ..U+1B2FB NUSHU CHARACTER-#
>>>>>> U+1F004 MAHJONG TILE RED DRAGON
>>>>>> U+1F0CF PLAYING CARD BLACK JOKER
>>>>>> U+1F18E NEGATIVE SQUARED AB
>>>>>> U+1F191 SQUARED CL ..U+1F19A SQUARED VS
>>>>>> U+1F200 SQUARE HIRAGANA HOKA ..U+1F202 SQUARED KATAKANA SA
>>>>>> U+1F210 SQUARED CJK UNIFIED IDEOGRAPH-624B ..U+1F23B SQUARED CJK
>>>>>> UNIFIED IDEOGRAPH-914D
>>>>>> U+1F240 TORTOISE SHELL BRACKETED CJK UNIFIED IDEOGRAPH-672C ..U+1F248
>>>>>> TORTOISE SHELL BRACKETED CJK UNIFIED IDEOGRAPH-6557
>>>>>> U+1F250 CIRCLED IDEOGRAPH ADVANTAGE ..U+1F251 CIRCLED IDEOGRAPH
>>>>>> ACCEPT
>>>>>> U+1F260 ROUNDED SYMBOL FOR FU ..U+1F265 ROUNDED SYMBOL FOR CAI
>>>>>> U+1F680 ROCKET ..U+1F6C5 LEFT LUGGAGE
>>>>>> U+1F6CC SLEEPING ACCOMMODATION
>>>>>> U+1F6D0 PLACE OF WORSHIP ..U+1F6D2 SHOPPING TROLLEY
>>>>>> U+1F6D5 HINDU TEMPLE ..U+1F6D7 ELEVATOR
>>>>>> U+1F6DC WIRELESS ..U+1F6DF RING BUOY
>>>>>> U+1F6EB AIRPLANE DEPARTURE ..U+1F6EC AIRPLANE ARRIVING
>>>>>> U+1F6F4 SCOOTER ..U+1F6FC ROLLER SKATE
>>>>>> U+1F7E0 LARGE ORANGE CIRCLE ..U+1F7EB LARGE BROWN SQUARE
>>>>>> U+1F7F0 HEAVY EQUALS SIGN
>>>>>> U+1FA70 BALLET SHOES ..U+1FA7C CRUTCH
>>>>>> U+1FA80 YO-YO ..U+1FA88 FLUTE
>>>>>> U+1FA90 RINGED PLANET ..U+1FABD WING
>>>>>> U+1FABF GOOSE ..U+1FAC5 PERSON WITH CROWN
>>>>>> U+1FACE MOOSE ..U+1FADB PEA POD
>>>>>> U+1FAE0 MELTING FACE ..U+1FAE8 SHAKING FACE
>>>>>> U+1FAF0 HAND WITH INDEX FINGER AND THUMB CROSSED ..U+1FAF8 RIGHTWARDS
>>>>>> PUSHING HAND
>>>>>>
>>>>>> // Reserved (count: 85) - Become used to be of width 2, becomes 1
>>>>>>
>>>>>> U+2E9A
>>>>>> U+2EF4 ..U+2EFF
>>>>>> U+2FD6 ..U+2FEF
>>>>>> U+2FFC ..U+2FFF
>>>>>> U+3040
>>>>>> U+3097 ..U+3098
>>>>>> U+3100 ..U+3104
>>>>>> U+3130
>>>>>> U+318F
>>>>>> U+31E4 ..U+31EF
>>>>>> U+321F
>>>>>> U+A48D ..U+A48F
>>>>>> U+A4C7 ..U+A4CF
>>>>>> U+FE53
>>>>>> U+FE67
>>>>>> U+FE6C ..U+FE6F
>>>>>> U+FF00
>>>>>>
>>>>>> // Used to be of width 2, becomes 1
>>>>>> U+3248 CIRCLED NUMBER TEN ON BLACK SQUARE ..U+324F CIRCLED NUMBER
>>>>>> EIGHTY ON BLACK SQUARE
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>> SG16 mailing list
>>>>> SG16_at_[hidden]
>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>
>>>>

Received on 2022-09-18 16:07:51