On Thu, Sep 15, 2022 at 10:12 AM Corentin via SG16 <sg16@lists.isocpp.org> wrote:

On Thu, Sep 15, 2022 at 5:59 PM Tom Honermann <tom@honermann.net> wrote:

Thank you for looking into this, Corentin. It would be great to have some kind of principled approach to how estimated widths are assigned.

I don't think this should be handled as an LWG issue though. This is a design change, so should go through LEWG (and probably SG16 as well).

I think it would be helpful to have a paper that includes screenshots (of select characters, not all 8500+ characters!) to demonstrate how common terminals display them today. The goal being to ensure that they are handled consistently across platforms before we standardize them.

A paper would also present the opportunity to attribute an estimated width of 0 (or a negative width!) to some characters. Perhaps it is worth asking whether an estimated width of 1 really makes sense for CR, LF, NL, VT, etc... It looks like these changes still don't cover edge cases like ﷽ (U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM).

I'm not trying to change the design, nor am I trying to make case-by case decisions, the original paper covered that.
I think these are questions worth asking, sure, at least for control characters as, for an arbitrary non-cjk, non-emoji neither an information nor a specification exist.
Note that however it's not straightforward
* There is no spec so it's handled manually https://github.com/termux/wcwidth/blob/master/wcwidth.c#L497
* C1 control characters may be mapped differently on EBCDIC platforms
* Combining characters not preceded by a grapheme starter are in fact rendered
* We already ignore non leading codepoints in graphemes
* Negative width would be novel

I do not intend to explore these further but I'd be happy to review such work :)

However, the original paper intended for CJK characters and emojis to be covered, Unicode 14 and 15 added a number of such characters, I considered this a bug.
But mostly, I consider specifying anything Unicode related in terms of partially-assigned ranges to be a bug reminiscent of the identifiers of old.

Now, I expect SG16 will deal with this issue, but all the information is in the issue.
The only design-ish question is whether we want to grandfather U+3248 CIRCLED NUMBER TEN ON BLACK SQUARE..U+324F CIRCLED NUMBER EIGHTY ON BLACK SQUARE
for backward compat reasons.

Codepoints which would go from 1 to 2
https://gist.github.com/cor3ntin/e5731f77574b146d806e39283e8c7cb7
Screenshot of my terminal

Tom.

On 9/15/22 5:02 AM, Corentin via SG16 wrote:

Affected clause [format.string.std]

Issue:

For the purpose of width estimation, format considers ranges of codepoints initially derived from an implementation of wcwidth with modifications.

(See https://wg21.link/p1868r1)

This however present a number of challenges:

From a reading of the spec, it is not clear how these ranges were selected.

Poor forward compatibility with future Unicode versions. The list will become less and less meaningful overtime or require active maintenance at each Unicode release (which we have not done for Unicode 14 already)

Some of these codepoints are unassigned or otherwise reserved, which is another forward compatibility concern.

Instead, we propose to

Rely on UAX-11 https://www.unicode.org/reports/tr11/ for most of the codepoints)

Grand-father specific and fully assigned, blocks of codepoints to support additional pictograms per the original intent of the paper and existing practices. We add the name of these blocks in the wording for clarity.

Note that per UAX-11

Most emojis are considered East_Asian_Width="W"

By design, East_Asian_Width="W" includes specific unassigned ranges, which should always be treated as Wide.

This change:

Considers 8477 extra codepoints as having a width 2 (as of Unicode 15) (mostly Tangut Ideographs)

Change the width of 85 unassigned code points from 2 to 1

Change the width of 8 codepoints (in the range U+3248 CIRCLED NUMBER TEN ON BLACK SQUARE ..U+324F CIRCLED NUMBER EIGHTY ON BLACK SQUARE) from 2 to 1, because it seems questionable to make an exception for those without input from Unicode

Proposed wording:

Modify [format.string.std]/p12

For a string in a Unicode encoding, implementations should estimate the width of a string as the sum of estimated widths of the first code points in its extended grapheme clusters. The extended grapheme clusters of a string are defined by UAX #29. The estimated width of the following code points is 2:

~~U+1100 – U+115F~~

~~U+2329 – U+232A~~

~~U+2E80 – U+303E~~

~~U+3040 – U+A4CF~~

~~U+AC00 – U+D7A3~~

~~U+F900 – U+FAFF~~

~~U+FE10 – U+FE19~~

~~U+FE30 – U+FE6F~~

~~U+FF00 – U+FF60~~

~~U+FFE0 – U+FFE6~~

~~U+1F300 – U+1F64F~~

~~U+1F900 – U+1F9FF~~

~~U+20000 – U+2FFFD~~

~~U+30000 – U+3FFFD~~

Any codepoint with the East_Asian_Width="W" or East_Asian_Width="F" Derived Extracted Property as described by UAX #44

U+4DC0 - U+4DFF (Yijing Hexagram Symbols)

U+1F300 - U+1F5FF (Miscellaneous Symbols and Pictographs)

U+1F900 - U+1F9FF (Supplemental Symbols and Pictographs)

The estimated width of other code points is 1.

---- END OF WORDING ----

Annex: Differences introduced by this change

// Missing from the standard (count: 8477)

// Used to be of width 1, becomes 2

U+231A WATCH ..U+231B HOURGLASS
U+23E9 BLACK RIGHT-POINTING DOUBLE TRIANGLE ..U+23EC BLACK DOWN-POINTING DOUBLE TRIANGLE
U+23F0 ALARM CLOCK
U+23F3 HOURGLASS WITH FLOWING SAND
U+25FD WHITE MEDIUM SMALL SQUARE ..U+25FE BLACK MEDIUM SMALL SQUARE
U+2614 UMBRELLA WITH RAIN DROPS ..U+2615 HOT BEVERAGE
U+2648 ARIES ..U+2653 PISCES
U+267F WHEELCHAIR SYMBOL
U+2693 ANCHOR
U+26A1 HIGH VOLTAGE SIGN
U+26AA MEDIUM WHITE CIRCLE ..U+26AB MEDIUM BLACK CIRCLE
U+26BD SOCCER BALL ..U+26BE BASEBALL
U+26C4 SNOWMAN WITHOUT SNOW ..U+26C5 SUN BEHIND CLOUD
U+26CE OPHIUCHUS
U+26D4 NO ENTRY
U+26EA CHURCH
U+26F2 FOUNTAIN ..U+26F3 FLAG IN HOLE
U+26F5 SAILBOAT
U+26FA TENT
U+26FD FUEL PUMP
U+2705 WHITE HEAVY CHECK MARK
U+270A RAISED FIST ..U+270B RAISED HAND
U+2728 SPARKLES
U+274C CROSS MARK
U+274E NEGATIVE SQUARED CROSS MARK
U+2753 BLACK QUESTION MARK ORNAMENT ..U+2755 WHITE EXCLAMATION MARK ORNAMENT
U+2757 HEAVY EXCLAMATION MARK SYMBOL
U+2795 HEAVY PLUS SIGN ..U+2797 HEAVY DIVISION SIGN
U+27B0 CURLY LOOP
U+27BF DOUBLE CURLY LOOP
U+2B1B BLACK LARGE SQUARE ..U+2B1C WHITE LARGE SQUARE
U+2B50 WHITE MEDIUM STAR
U+2B55 HEAVY LARGE CIRCLE
U+A960 HANGUL CHOSEONG TIKEUT-MIEUM ..U+A97C HANGUL CHOSEONG SSANGYEORINHIEUH
U+16FE0 TANGUT ITERATION MARK ..U+16FE4 KHITAN SMALL SCRIPT FILLER
U+16FF0 VIETNAMESE ALTERNATE READING MARK CA ..U+16FF1 VIETNAMESE ALTERNATE READING MARK NHAY
U+17000 TANGUT IDEOGRAPH-# ..U+187F7 TANGUT IDEOGRAPH-#
U+18800 TANGUT COMPONENT-001 ..U+18CD5 KHITAN SMALL SCRIPT CHARACTER-#
U+18D00 TANGUT IDEOGRAPH-# ..U+18D08 TANGUT IDEOGRAPH-#
U+1AFF0 KATAKANA LETTER MINNAN TONE-2 ..U+1AFF3 KATAKANA LETTER MINNAN TONE-5
U+1AFF5 KATAKANA LETTER MINNAN TONE-7 ..U+1AFFB KATAKANA LETTER MINNAN NASALIZED TONE-5
U+1AFFD KATAKANA LETTER MINNAN NASALIZED TONE-7 ..U+1AFFE KATAKANA LETTER MINNAN NASALIZED TONE-8
U+1B000 KATAKANA LETTER ARCHAIC E ..U+1B122 KATAKANA LETTER ARCHAIC WU
U+1B132 HIRAGANA LETTER SMALL KO
U+1B150 HIRAGANA LETTER SMALL WI ..U+1B152 HIRAGANA LETTER SMALL WO
U+1B155 KATAKANA LETTER SMALL KO
U+1B164 KATAKANA LETTER SMALL WI ..U+1B167 KATAKANA LETTER SMALL N
U+1B170 NUSHU CHARACTER-# ..U+1B2FB NUSHU CHARACTER-#
U+1F004 MAHJONG TILE RED DRAGON
U+1F0CF PLAYING CARD BLACK JOKER
U+1F18E NEGATIVE SQUARED AB
U+1F191 SQUARED CL ..U+1F19A SQUARED VS
U+1F200 SQUARE HIRAGANA HOKA ..U+1F202 SQUARED KATAKANA SA
U+1F210 SQUARED CJK UNIFIED IDEOGRAPH-624B ..U+1F23B SQUARED CJK UNIFIED IDEOGRAPH-914D
U+1F240 TORTOISE SHELL BRACKETED CJK UNIFIED IDEOGRAPH-672C ..U+1F248 TORTOISE SHELL BRACKETED CJK UNIFIED IDEOGRAPH-6557
U+1F250 CIRCLED IDEOGRAPH ADVANTAGE ..U+1F251 CIRCLED IDEOGRAPH ACCEPT
U+1F260 ROUNDED SYMBOL FOR FU ..U+1F265 ROUNDED SYMBOL FOR CAI
U+1F680 ROCKET ..U+1F6C5 LEFT LUGGAGE
U+1F6CC SLEEPING ACCOMMODATION
U+1F6D0 PLACE OF WORSHIP ..U+1F6D2 SHOPPING TROLLEY
U+1F6D5 HINDU TEMPLE ..U+1F6D7 ELEVATOR
U+1F6DC WIRELESS ..U+1F6DF RING BUOY
U+1F6EB AIRPLANE DEPARTURE ..U+1F6EC AIRPLANE ARRIVING
U+1F6F4 SCOOTER ..U+1F6FC ROLLER SKATE
U+1F7E0 LARGE ORANGE CIRCLE ..U+1F7EB LARGE BROWN SQUARE
U+1F7F0 HEAVY EQUALS SIGN
U+1FA70 BALLET SHOES ..U+1FA7C CRUTCH
U+1FA80 YO-YO ..U+1FA88 FLUTE
U+1FA90 RINGED PLANET ..U+1FABD WING
U+1FABF GOOSE ..U+1FAC5 PERSON WITH CROWN
U+1FACE MOOSE ..U+1FADB PEA POD
U+1FAE0 MELTING FACE ..U+1FAE8 SHAKING FACE
U+1FAF0 HAND WITH INDEX FINGER AND THUMB CROSSED ..U+1FAF8 RIGHTWARDS PUSHING HAND

// Reserved (count: 85) - Become used to be of width 2, becomes 1

U+2E9A
U+2EF4 ..U+2EFF
U+2FD6 ..U+2FEF
U+2FFC ..U+2FFF
U+3040
U+3097 ..U+3098
U+3100 ..U+3104
U+3130
U+318F
U+31E4 ..U+31EF
U+321F
U+A48D ..U+A48F
U+A4C7 ..U+A4CF
U+FE53
U+FE67
U+FE6C ..U+FE6F
U+FF00

// Used to be of width 2, becomes 1
U+3248 CIRCLED NUMBER TEN ON BLACK SQUARE ..U+324F CIRCLED NUMBER EIGHTY ON BLACK SQUARE

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16