ISOCPP sg16 List: Re: LWG ISSUE: Format's width estimation is too approximate and not forward compatible.

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 15 Sep 2022 16:25:43 -0400

On 9/15/22 1:11 PM, Corentin wrote:
>
>
> On Thu, Sep 15, 2022 at 5:59 PM Tom Honermann <tom_at_[hidden]> wrote:
>
> Thank you for looking into this, Corentin. It would be great to
> have some kind of principled approach to how estimated widths are
> assigned.
>
> I don't think this should be handled as an LWG issue though. This
> is a design change, so should go through LEWG (and probably SG16
> as well).
>
> I think it would be helpful to have a paper that includes
> screenshots (of select characters, not all 8500+ characters!) to
> demonstrate how common terminals display them today. The goal
> being to ensure that they are handled consistently across
> platforms before we standardize them.
>
> A paper would also present the opportunity to attribute an
> estimated width of 0 (or a negative width!) to some characters.
> Perhaps it is worth asking whether an estimated width of 1 really
> makes sense for CR, LF, NL, VT, etc... It looks like these changes
> still don't cover edge cases like ﷽ (U+FDFD ARABIC LIGATURE
> BISMILLAH AR-RAHMAN AR-RAHEEM).
>
>
> I'm not trying to change the design, nor am I trying to make case-by
> case decisions, the original paper covered that.
> I think these are questions worth asking, sure, at least for control
> characters as, for an arbitrary non-cjk, non-emoji neither an
> information nor a specification exist.
> Note that however it's not straightforward
> * There is no spec so it's handled manually
> https://github.com/termux/wcwidth/blob/master/wcwidth.c#L497
> * C1 control characters may be mapped differently on EBCDIC platforms
> * Combining characters not preceded by a grapheme starter are in
> fact rendered
> * We already ignore non leading codepoints in graphemes
> * Negative width would be novel
>
> I do not intend to explore these further but I'd be happy to review
> such work :)
>
> However, the original paper intended for CJK characters and emojis to
> be covered, Unicode 14 and 15 added a number of such characters, I
> considered this a bug.
> But mostly, I consider specifying anything Unicode related in terms of
> partially-assigned ranges to be a bug reminiscent of the identifiers
> of old.
>
> Now, I expect SG16 will deal with this issue, but all the information
> is in the issue.
> The only design-ish question is whether we want to grandfather U+3248
> CIRCLED NUMBER TEN ON BLACK SQUARE..U+324F CIRCLED NUMBER EIGHTY ON
> BLACK SQUARE
> for backward compat reasons.
>
> Codepoints which would go from 1 to 2
> https://gist.github.com/cor3ntin/e5731f77574b146d806e39283e8c7cb7
> Screenshot of my terminal
> image.png

My desire for a paper is to have a record containing evidence (from
multiple platforms) that supports a conclusion that these additional
characters should be given the proposed widths. I think this is
important precisely because there is no formal specification to refer to.

While I agree with addressing characters that were introduced in
versions of Unicode more recent than what was available when P1868 was
adopted, this proposal opts-in additional characters (as noted in your
annex) that are not new. We very well may want those, but the argument
that the change is just a bug fix becomes more difficult in that case.

I think the following recent comment from the Unicode mailing list is
relevant with regard to the trustworthiness of East_Asian_Width for our
purposes (from
https://corp.unicode.org/pipermail/unicode/2022-September/010308.html):

> AFAIK, EAW is a data set to improve interoperability with environments
> using some double-byte encoding schemes where display widths are tied
> with byte sizes, so no meaningful value is defined for characters
> which are not found in those legacy character encodings.

Tom.

>
> Tom.
>
> On 9/15/22 5:02 AM, Corentin via SG16 wrote:
>>
>> Affected clause [format.string.std]
>>
>> Issue:
>> For the purpose of width estimation, format considers ranges of
>> codepoints initially derived from an implementation of wcwidth
>> with modifications.
>> (See https://wg21.link/p1868r1)
>>
>> This however present a number of challenges:
>>
>> * From a reading of the spec, it is not clear how these ranges
>> were selected.
>> * Poor forward compatibility with future Unicode versions. The
>> list will become less and less meaningful overtime or require
>> active maintenance at each Unicode release (which we have not
>> done for Unicode 14 already)
>> * Some of these codepoints are unassigned or otherwise
>> reserved, which is another forward compatibility concern.
>>
>> Instead, we propose to
>>
>> * Rely on UAX-11 https://www.unicode.org/reports/tr11/ for most
>> of the codepoints)
>> * Grand-father specific and fully assigned, blocks of
>> codepoints to support additional pictograms per the original
>> intent of the paper and existing practices. We add the name
>> of these blocks in the wording for clarity.
>>
>> Note that per UAX-11
>>
>> * Most emojis are considered East_Asian_Width="W"
>> * By design, East_Asian_Width="W" includes specific unassigned
>> ranges, which should always be treated as Wide.
>>
>> This change:
>>
>> * Considers 8477 extra codepoints as having a width 2 (as of
>> Unicode 15) (mostly Tangut Ideographs)
>> * Change the width of 85 unassigned code points from 2 to 1
>> * Change the width of 8 codepoints (in the range U+3248 CIRCLED
>> NUMBER TEN ON BLACK SQUARE ..U+324F CIRCLED NUMBER EIGHTY ON
>> BLACK SQUARE) from 2 to 1, because it seems questionable to
>> make an exception for those without input from Unicode
>>
>>
>> Proposed wording:
>>
>> Modify [format.string.std]/p12
>>
>> For a string in a Unicode encoding, implementations should
>> estimate the width of a string as the sum of estimated widths of
>> the first code points in its extended grapheme clusters. The
>> extended grapheme clusters of a string are defined by UAX #29.
>> The estimated width of the following code points is 2:
>>
>> * U+1100 – U+115F
>> * U+2329 – U+232A
>> * U+2E80 – U+303E
>> * U+3040 – U+A4CF
>> * U+AC00 – U+D7A3
>> * U+F900 – U+FAFF
>> * U+FE10 – U+FE19
>> * U+FE30 – U+FE6F
>> * U+FF00 – U+FF60
>> * U+FFE0 – U+FFE6
>> * U+1F300 – U+1F64F
>> * U+1F900 – U+1F9FF
>> * U+20000 – U+2FFFD
>> * U+30000 – U+3FFFD
>> * _Any codepoint with the East_Asian_Width="W" or
>> East_Asian_Width="F" Derived Extracted Property as described
>> by UAX #44_
>> * _U+4DC0 - U+4DFF (Yijing Hexagram Symbols)
>> _
>> * _U+1F300 - U+1F5FF (Miscellaneous Symbols and Pictographs)_
>> * _U+1F900 - U+1F9FF (Supplemental Symbols and Pictographs)_
>>
>> The estimated width of other code points is 1.
>>
>>
>> ---- END OF WORDING ----
>>
>> Annex: Differences introduced by this change
>>
>> // Missing from the standard (count: 8477)
>> // Used to be of width 1, becomes 2
>>
>> U+231A WATCH ..U+231B HOURGLASS
>> U+23E9 BLACK RIGHT-POINTING DOUBLE TRIANGLE ..U+23EC BLACK
>> DOWN-POINTING DOUBLE TRIANGLE
>> U+23F0 ALARM CLOCK
>> U+23F3 HOURGLASS WITH FLOWING SAND
>> U+25FD WHITE MEDIUM SMALL SQUARE ..U+25FE BLACK MEDIUM SMALL SQUARE
>> U+2614 UMBRELLA WITH RAIN DROPS ..U+2615 HOT BEVERAGE
>> U+2648 ARIES ..U+2653 PISCES
>> U+267F WHEELCHAIR SYMBOL
>> U+2693 ANCHOR
>> U+26A1 HIGH VOLTAGE SIGN
>> U+26AA MEDIUM WHITE CIRCLE ..U+26AB MEDIUM BLACK CIRCLE
>> U+26BD SOCCER BALL ..U+26BE BASEBALL
>> U+26C4 SNOWMAN WITHOUT SNOW ..U+26C5 SUN BEHIND CLOUD
>> U+26CE OPHIUCHUS
>> U+26D4 NO ENTRY
>> U+26EA CHURCH
>> U+26F2 FOUNTAIN ..U+26F3 FLAG IN HOLE
>> U+26F5 SAILBOAT
>> U+26FA TENT
>> U+26FD FUEL PUMP
>> U+2705 WHITE HEAVY CHECK MARK
>> U+270A RAISED FIST ..U+270B RAISED HAND
>> U+2728 SPARKLES
>> U+274C CROSS MARK
>> U+274E NEGATIVE SQUARED CROSS MARK
>> U+2753 BLACK QUESTION MARK ORNAMENT ..U+2755 WHITE EXCLAMATION
>> MARK ORNAMENT
>> U+2757 HEAVY EXCLAMATION MARK SYMBOL
>> U+2795 HEAVY PLUS SIGN ..U+2797 HEAVY DIVISION SIGN
>> U+27B0 CURLY LOOP
>> U+27BF DOUBLE CURLY LOOP
>> U+2B1B BLACK LARGE SQUARE ..U+2B1C WHITE LARGE SQUARE
>> U+2B50 WHITE MEDIUM STAR
>> U+2B55 HEAVY LARGE CIRCLE
>> U+A960 HANGUL CHOSEONG TIKEUT-MIEUM ..U+A97C HANGUL CHOSEONG
>> SSANGYEORINHIEUH
>> U+16FE0 TANGUT ITERATION MARK ..U+16FE4 KHITAN SMALL SCRIPT FILLER
>> U+16FF0 VIETNAMESE ALTERNATE READING MARK CA ..U+16FF1 VIETNAMESE
>> ALTERNATE READING MARK NHAY
>> U+17000 TANGUT IDEOGRAPH-# ..U+187F7 TANGUT IDEOGRAPH-#
>> U+18800 TANGUT COMPONENT-001 ..U+18CD5 KHITAN SMALL SCRIPT
>> CHARACTER-#
>> U+18D00 TANGUT IDEOGRAPH-# ..U+18D08 TANGUT IDEOGRAPH-#
>> U+1AFF0 KATAKANA LETTER MINNAN TONE-2 ..U+1AFF3 KATAKANA LETTER
>> MINNAN TONE-5
>> U+1AFF5 KATAKANA LETTER MINNAN TONE-7 ..U+1AFFB KATAKANA LETTER
>> MINNAN NASALIZED TONE-5
>> U+1AFFD KATAKANA LETTER MINNAN NASALIZED TONE-7 ..U+1AFFE
>> KATAKANA LETTER MINNAN NASALIZED TONE-8
>> U+1B000 KATAKANA LETTER ARCHAIC E ..U+1B122 KATAKANA LETTER
>> ARCHAIC WU
>> U+1B132 HIRAGANA LETTER SMALL KO
>> U+1B150 HIRAGANA LETTER SMALL WI ..U+1B152 HIRAGANA LETTER SMALL WO
>> U+1B155 KATAKANA LETTER SMALL KO
>> U+1B164 KATAKANA LETTER SMALL WI ..U+1B167 KATAKANA LETTER SMALL N
>> U+1B170 NUSHU CHARACTER-# ..U+1B2FB NUSHU CHARACTER-#
>> U+1F004 MAHJONG TILE RED DRAGON
>> U+1F0CF PLAYING CARD BLACK JOKER
>> U+1F18E NEGATIVE SQUARED AB
>> U+1F191 SQUARED CL ..U+1F19A SQUARED VS
>> U+1F200 SQUARE HIRAGANA HOKA ..U+1F202 SQUARED KATAKANA SA
>> U+1F210 SQUARED CJK UNIFIED IDEOGRAPH-624B ..U+1F23B SQUARED CJK
>> UNIFIED IDEOGRAPH-914D
>> U+1F240 TORTOISE SHELL BRACKETED CJK UNIFIED IDEOGRAPH-672C
>> ..U+1F248 TORTOISE SHELL BRACKETED CJK UNIFIED IDEOGRAPH-6557
>> U+1F250 CIRCLED IDEOGRAPH ADVANTAGE ..U+1F251 CIRCLED IDEOGRAPH
>> ACCEPT
>> U+1F260 ROUNDED SYMBOL FOR FU ..U+1F265 ROUNDED SYMBOL FOR CAI
>> U+1F680 ROCKET ..U+1F6C5 LEFT LUGGAGE
>> U+1F6CC SLEEPING ACCOMMODATION
>> U+1F6D0 PLACE OF WORSHIP ..U+1F6D2 SHOPPING TROLLEY
>> U+1F6D5 HINDU TEMPLE ..U+1F6D7 ELEVATOR
>> U+1F6DC WIRELESS ..U+1F6DF RING BUOY
>> U+1F6EB AIRPLANE DEPARTURE ..U+1F6EC AIRPLANE ARRIVING
>> U+1F6F4 SCOOTER ..U+1F6FC ROLLER SKATE
>> U+1F7E0 LARGE ORANGE CIRCLE ..U+1F7EB LARGE BROWN SQUARE
>> U+1F7F0 HEAVY EQUALS SIGN
>> U+1FA70 BALLET SHOES ..U+1FA7C CRUTCH
>> U+1FA80 YO-YO ..U+1FA88 FLUTE
>> U+1FA90 RINGED PLANET ..U+1FABD WING
>> U+1FABF GOOSE ..U+1FAC5 PERSON WITH CROWN
>> U+1FACE MOOSE ..U+1FADB PEA POD
>> U+1FAE0 MELTING FACE ..U+1FAE8 SHAKING FACE
>> U+1FAF0 HAND WITH INDEX FINGER AND THUMB CROSSED ..U+1FAF8
>> RIGHTWARDS PUSHING HAND
>>
>> // Reserved (count: 85) - Become used to be of width 2, becomes 1
>>
>> U+2E9A
>> U+2EF4 ..U+2EFF
>> U+2FD6 ..U+2FEF
>> U+2FFC ..U+2FFF
>> U+3040
>> U+3097 ..U+3098
>> U+3100 ..U+3104
>> U+3130
>> U+318F
>> U+31E4 ..U+31EF
>> U+321F
>> U+A48D ..U+A48F
>> U+A4C7 ..U+A4CF
>> U+FE53
>> U+FE67
>> U+FE6C ..U+FE6F
>> U+FF00
>>
>> // Used to be of width 2, becomes 1
>> U+3248 CIRCLED NUMBER TEN ON BLACK SQUARE ..U+324F CIRCLED NUMBER
>> EIGHTY ON BLACK SQUARE
>>
>>
>>

Received on 2022-09-15 20:25:46