C++ Logo

sg16

Advanced search

[SG16-Unicode] P0645 Text Formatting review followup

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Sun, 8 Jul 2018 14:03:47 -0700
Just a small followup on our discussion of P0645 Text Formatting during the
previous meeting.

1. Interpretation of width with multibyte encodings and combining
characters.

P0645R2 currently doesn't specify the units of width. Possible options are
(from lower to higher abstraction level):

* Code units
* Code points
* Grapheme clusters

Python 3 uses code points as can be seen from the following example:

>>> o = b'\x6F\xCC\x88'.decode('utf8')
>>> o
'ö'
>>> '{:>2}'.format(o)
'ö' # note missing space
>>> o = b'\xC3\xB6'.decode('utf8')
>>> o
'ö'
>>> '{:>2}'.format(o)
' ö'

I have slight preference to grapheme clusters because according to Unicode
Standard Annex #29 UNICODE TEXT SEGMENTATION
<http://unicode.org/reports/tr29/> they correspond to “user-perceived
characters” (at least that seems to be the intention, whether they are
successful in that is another question).

Zach provided an example of "👨\u200D👩\u200D👧\u200D👦", where \u200D is a
zero-width joiner (ZWJ), rendered as a single glyph representing a
family "👨‍👩‍👧‍👦".
However, if I interpret the following part of
http://www.unicode.org/reports/tr29/tr29-29.html#GB10 correctly:

> Do not break within emoji modifier sequences or emoji zwj sequences.

this is not a problem and "👨\u200D👩\u200D👧\u200D👦" will constitute a
single grapheme cluster.

That said, making grapheme clusters width units may add significant
complexity to the implementation with minor benefits, so I'm fine going
with code points especially since there is an established example of doing
this (Python) and it's already an improvement over stdio & iostreams.

2. Interpretation of fill.

It seems there was a general agreement that fill should be a code point but
please let me know if you have other ideas.

3. There was a question about signed and unsigned char. I checked and there
is no special handling for these types which means that they are treated as
integral types, only char and wchar_t are treated specially as character
types (and later charN_t will be added).

4. Interpretation of n in format_to_n.

There was no agreement whether n should be specified in code units or code
points. An argument in favor of code units is that n often gives the output
buffer size. On the other hand, using code points would be more consistent
with width.

I plan to add support for specifying width and fill as code points in fmt (Zach
gave some useful pointers on how to do that, thanks!) and will report back
with any user feedback.

Cheers,
Victor

Received on 2018-07-08 23:04:08