ISOCPP sg16 List: Re: LWG ISSUE: Format's width estimation is too approximate and not forward compatible.

From: Daniel Krügler <daniel.kruegler_at_[hidden]>
Date: Sat, 17 Sep 2022 19:18:12 +0200

Am Do., 15. Sept. 2022 um 11:03 Uhr schrieb Corentin <corentin.jabot_at_[hidden]>:
>
>
> Affected clause [format.string.std]
>
> Issue:
> For the purpose of width estimation, format considers ranges of codepoints initially derived from an implementation of wcwidth with modifications.
> (See https://wg21.link/p1868r1)
>
> This however present a number of challenges:
>
> From a reading of the spec, it is not clear how these ranges were selected.
> Poor forward compatibility with future Unicode versions. The list will become less and less meaningful overtime or require active maintenance at each Unicode release (which we have not done for Unicode 14 already)
> Some of these codepoints are unassigned or otherwise reserved, which is another forward compatibility concern.
>
> Instead, we propose to
>
> Rely on UAX-11 https://www.unicode.org/reports/tr11/ for most of the codepoints)
> Grand-father specific and fully assigned, blocks of codepoints to support additional pictograms per the original intent of the paper and existing practices. We add the name of these blocks in the wording for clarity.
>
> Note that per UAX-11
>
> Most emojis are considered East_Asian_Width="W"
> By design, East_Asian_Width="W" includes specific unassigned ranges, which should always be treated as Wide.
>
> This change:
>
> Considers 8477 extra codepoints as having a width 2 (as of Unicode 15) (mostly Tangut Ideographs)
> Change the width of 85 unassigned code points from 2 to 1
> Change the width of 8 codepoints (in the range U+3248 CIRCLED NUMBER TEN ON BLACK SQUARE ..U+324F CIRCLED NUMBER EIGHTY ON BLACK SQUARE) from 2 to 1, because it seems questionable to make an exception for those without input from Unicode
>
>
> Proposed wording:
>
> Modify [format.string.std]/p12
>
> For a string in a Unicode encoding, implementations should estimate the width of a string as the sum of estimated widths of the first code points in its extended grapheme clusters. The extended grapheme clusters of a string are defined by UAX #29. The estimated width of the following code points is 2:
>
> U+1100 – U+115F
> U+2329 – U+232A
> U+2E80 – U+303E
> U+3040 – U+A4CF
> U+AC00 – U+D7A3
> U+F900 – U+FAFF
> U+FE10 – U+FE19
> U+FE30 – U+FE6F
> U+FF00 – U+FF60
> U+FFE0 – U+FFE6
> U+1F300 – U+1F64F
> U+1F900 – U+1F9FF
> U+20000 – U+2FFFD
> U+30000 – U+3FFFD
> Any codepoint with the East_Asian_Width="W" or East_Asian_Width="F" Derived Extracted Property as described by UAX #44
> U+4DC0 - U+4DFF (Yijing Hexagram Symbols)
> U+1F300 - U+1F5FF (Miscellaneous Symbols and Pictographs)
> U+1F900 - U+1F9FF (Supplemental Symbols and Pictographs)
>
> The estimated width of other code points is 1.
>
>
> ---- END OF WORDING ----
>
> Annex: Differences introduced by this change
>
> // Missing from the standard (count: 8477)
> // Used to be of width 1, becomes 2
>
> U+231A WATCH ..U+231B HOURGLASS
> U+23E9 BLACK RIGHT-POINTING DOUBLE TRIANGLE ..U+23EC BLACK DOWN-POINTING DOUBLE TRIANGLE
> U+23F0 ALARM CLOCK
> U+23F3 HOURGLASS WITH FLOWING SAND
> U+25FD WHITE MEDIUM SMALL SQUARE ..U+25FE BLACK MEDIUM SMALL SQUARE
> U+2614 UMBRELLA WITH RAIN DROPS ..U+2615 HOT BEVERAGE
> U+2648 ARIES ..U+2653 PISCES
> U+267F WHEELCHAIR SYMBOL
> U+2693 ANCHOR
> U+26A1 HIGH VOLTAGE SIGN
> U+26AA MEDIUM WHITE CIRCLE ..U+26AB MEDIUM BLACK CIRCLE
> U+26BD SOCCER BALL ..U+26BE BASEBALL
> U+26C4 SNOWMAN WITHOUT SNOW ..U+26C5 SUN BEHIND CLOUD
> U+26CE OPHIUCHUS
> U+26D4 NO ENTRY
> U+26EA CHURCH
> U+26F2 FOUNTAIN ..U+26F3 FLAG IN HOLE
> U+26F5 SAILBOAT
> U+26FA TENT
> U+26FD FUEL PUMP
> U+2705 WHITE HEAVY CHECK MARK
> U+270A RAISED FIST ..U+270B RAISED HAND
> U+2728 SPARKLES
> U+274C CROSS MARK
> U+274E NEGATIVE SQUARED CROSS MARK
> U+2753 BLACK QUESTION MARK ORNAMENT ..U+2755 WHITE EXCLAMATION MARK ORNAMENT
> U+2757 HEAVY EXCLAMATION MARK SYMBOL
> U+2795 HEAVY PLUS SIGN ..U+2797 HEAVY DIVISION SIGN
> U+27B0 CURLY LOOP
> U+27BF DOUBLE CURLY LOOP
> U+2B1B BLACK LARGE SQUARE ..U+2B1C WHITE LARGE SQUARE
> U+2B50 WHITE MEDIUM STAR
> U+2B55 HEAVY LARGE CIRCLE
> U+A960 HANGUL CHOSEONG TIKEUT-MIEUM ..U+A97C HANGUL CHOSEONG SSANGYEORINHIEUH
> U+16FE0 TANGUT ITERATION MARK ..U+16FE4 KHITAN SMALL SCRIPT FILLER
> U+16FF0 VIETNAMESE ALTERNATE READING MARK CA ..U+16FF1 VIETNAMESE ALTERNATE READING MARK NHAY
> U+17000 TANGUT IDEOGRAPH-# ..U+187F7 TANGUT IDEOGRAPH-#
> U+18800 TANGUT COMPONENT-001 ..U+18CD5 KHITAN SMALL SCRIPT CHARACTER-#
> U+18D00 TANGUT IDEOGRAPH-# ..U+18D08 TANGUT IDEOGRAPH-#
> U+1AFF0 KATAKANA LETTER MINNAN TONE-2 ..U+1AFF3 KATAKANA LETTER MINNAN TONE-5
> U+1AFF5 KATAKANA LETTER MINNAN TONE-7 ..U+1AFFB KATAKANA LETTER MINNAN NASALIZED TONE-5
> U+1AFFD KATAKANA LETTER MINNAN NASALIZED TONE-7 ..U+1AFFE KATAKANA LETTER MINNAN NASALIZED TONE-8
> U+1B000 KATAKANA LETTER ARCHAIC E ..U+1B122 KATAKANA LETTER ARCHAIC WU
> U+1B132 HIRAGANA LETTER SMALL KO
> U+1B150 HIRAGANA LETTER SMALL WI ..U+1B152 HIRAGANA LETTER SMALL WO
> U+1B155 KATAKANA LETTER SMALL KO
> U+1B164 KATAKANA LETTER SMALL WI ..U+1B167 KATAKANA LETTER SMALL N
> U+1B170 NUSHU CHARACTER-# ..U+1B2FB NUSHU CHARACTER-#
> U+1F004 MAHJONG TILE RED DRAGON
> U+1F0CF PLAYING CARD BLACK JOKER
> U+1F18E NEGATIVE SQUARED AB
> U+1F191 SQUARED CL ..U+1F19A SQUARED VS
> U+1F200 SQUARE HIRAGANA HOKA ..U+1F202 SQUARED KATAKANA SA
> U+1F210 SQUARED CJK UNIFIED IDEOGRAPH-624B ..U+1F23B SQUARED CJK UNIFIED IDEOGRAPH-914D
> U+1F240 TORTOISE SHELL BRACKETED CJK UNIFIED IDEOGRAPH-672C ..U+1F248 TORTOISE SHELL BRACKETED CJK UNIFIED IDEOGRAPH-6557
> U+1F250 CIRCLED IDEOGRAPH ADVANTAGE ..U+1F251 CIRCLED IDEOGRAPH ACCEPT
> U+1F260 ROUNDED SYMBOL FOR FU ..U+1F265 ROUNDED SYMBOL FOR CAI
> U+1F680 ROCKET ..U+1F6C5 LEFT LUGGAGE
> U+1F6CC SLEEPING ACCOMMODATION
> U+1F6D0 PLACE OF WORSHIP ..U+1F6D2 SHOPPING TROLLEY
> U+1F6D5 HINDU TEMPLE ..U+1F6D7 ELEVATOR
> U+1F6DC WIRELESS ..U+1F6DF RING BUOY
> U+1F6EB AIRPLANE DEPARTURE ..U+1F6EC AIRPLANE ARRIVING
> U+1F6F4 SCOOTER ..U+1F6FC ROLLER SKATE
> U+1F7E0 LARGE ORANGE CIRCLE ..U+1F7EB LARGE BROWN SQUARE
> U+1F7F0 HEAVY EQUALS SIGN
> U+1FA70 BALLET SHOES ..U+1FA7C CRUTCH
> U+1FA80 YO-YO ..U+1FA88 FLUTE
> U+1FA90 RINGED PLANET ..U+1FABD WING
> U+1FABF GOOSE ..U+1FAC5 PERSON WITH CROWN
> U+1FACE MOOSE ..U+1FADB PEA POD
> U+1FAE0 MELTING FACE ..U+1FAE8 SHAKING FACE
> U+1FAF0 HAND WITH INDEX FINGER AND THUMB CROSSED ..U+1FAF8 RIGHTWARDS PUSHING HAND
>
> // Reserved (count: 85) - Become used to be of width 2, becomes 1
>
> U+2E9A
> U+2EF4 ..U+2EFF
> U+2FD6 ..U+2FEF
> U+2FFC ..U+2FFF
> U+3040
> U+3097 ..U+3098
> U+3100 ..U+3104
> U+3130
> U+318F
> U+31E4 ..U+31EF
> U+321F
> U+A48D ..U+A48F
> U+A4C7 ..U+A4CF
> U+FE53
> U+FE67
> U+FE6C ..U+FE6F
> U+FF00
>
> // Used to be of width 2, becomes 1
> U+3248 CIRCLED NUMBER TEN ON BLACK SQUARE ..U+324F CIRCLED NUMBER EIGHTY ON BLACK SQUARE
>

I created a new issue with the following changes and additional information:

- The standard uses the term "code point" instead of "codepoint"
- I think the associated annex would be annex C, which is only
informative and would typically only provide examples for actual
differences. I tried to used the typical phrase form used in annex C.
I also used only two blocks of lists, one for 1 to 2 and one for 2 to
1.

Please reload and double-check:

https://cplusplus.github.io/LWG/issue3780

Thanks,

- Daniel

Received on 2022-09-17 17:18:23