C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P0645 Text Formatting review followup

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 23 Jul 2018 00:04:48 -0400
SG16 discussed this topic some more at our last meeting. Notes are
available at:
-
https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#july-11th-2018

Victor suggested the idea of defining field widths using an encoding
agnostic concept that maps to extended grapheme clusters (EGCs) for
Unicode encodings. For other encodings, it would presumably map in some
implementation defined way - probably 1x1 to code points. I like this
idea. During the discussion an experiment was suggested to take Eric
Niebler's range-v3 calendar example
<https://github.com/ericniebler/range-v3/blob/master/example/calendar.cpp>
[1] and to modify it to print emojis for holidays. For example, to
substitute U+1F384 Christmas Tree for December 25th:

         October November December
               1 2 3 1 2 3 4 5 6 7 1 2 3 4 5
   4 5 6 7 8 9 10 8 9 10 11 12 13 14 6 7 8 9 10 11 12
  11 12 13 14 15 16 17 15 16 17 18 19 20 21 13 14 15 16 17 18 19
  18 19 20 21 22 23 24 22 23 24 25 🦃 27 28 20 21 22 23 24 🎄 26
  25 26 27 28 29 30 🎃 29 30 27 28 29 30 31

This email is formatted to use a fixed width font for the calendar data
above. On my system, the emoji are rendered such that the emoji
characters consume more than one (but less than two!) columns of output
thus breaking the intended presentation. This presumably occurs because
fixed width variants of these characters are not available. This makes
me wonder how useful use of EGCs as field width units will be in practice.

Mark Davis recently posted the following link to the (not SG16) Unicode
mailing list. This discusses, amongst a number of other interesting
topics, rendering of emojis as single user perceived characters vs
multiple user perceived characters. This is relevant for the discussion
of family emojis.
-
https://docs.google.com/document/d/1pC7N32TnmDr2xzFW4HscA1DyAPPZnwILUH2_03UL6Jo/preview

With regard to interpretation of fill characters, I think there needs to
be a requirement that the fill "character" consume exactly one unit of
field width, however that is defined.

Tom.

[1]:
https://github.com/ericniebler/range-v3/blob/master/example/calendar.cpp

On 07/08/2018 05:24 PM, Corentin wrote:
> The *implementation* complexity of using grapheme clusters ( there is
> also a runtime complexity ) - would only come from the order in which
> things are standardized.
> I think grapheme clusters iterator is something SG16 wants, and the
> day we have that, it would be easy to add it in fmt (since the wording
> would be well defined, and the implementor would have to implement
> support for grapheme cluster anyway).
>
> That's why It's probably wise to ignore all charX_t overloads and
> specializations (for both parameters and format string), until such a
> time that we can use these basic building blocks and common definition.
>
> Otherwise, I think you are probably right that grapheme cluster for
> Unicode strings (those using charX_t) makes sense.
>
> Except of course this is completely in the hand of the renderer. your
> family emoji renders as 4 emojis because my computer probably lacks
> the appropriate font.
> We must accept that we can not provide a way for the value of match to
> match anything that will be rendered. aka if people use it for text
> alignment, it will never be right.
> The way graphical software deal with that is that they rely on font
> metrics - aka they compute a width from the actual font used to to the
> rendering
>
>
> 4/ For me ( I think it was lost in the chat ) - the semantic of N
> should depend on the value_type of the output iterator/function
>
> Corentin
>
>
>
> Le dim. 8 juil. 2018 à 23:04, Victor Zverovich
> <victor.zverovich_at_[hidden] <mailto:victor.zverovich_at_[hidden]>> a écrit :
>
> Just a small followup on our discussion of P0645 Text Formatting
> during the previous meeting.
>
> 1. Interpretation of width with multibyte encodings and combining
> characters.
>
> P0645R2 currently doesn't specify the units of width. Possible
> options are (from lower to higher abstraction level):
>
> * Code units
> * Code points
> * Grapheme clusters
>
> Python 3 uses code points as can be seen from the following example:
>
> >>> o = b'\x6F\xCC\x88'.decode('utf8')
> >>> o
> 'ö'
> >>> '{:>2}'.format(o)
> 'ö' # note missing space
> >>> o = b'\xC3\xB6'.decode('utf8')
> >>> o
> 'ö'
> >>> '{:>2}'.format(o)
> ' ö'
>
> I have slight preference to grapheme clusters because according to
> Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION
> <http://unicode.org/reports/tr29/> they correspond
> to “user-perceived characters” (at least that seems to be the
> intention, whether they are successful in that is another question).
>
> Zach provided an example of "👨\u200D👩\u200D👧\u200D👦", where
> \u200D is a zero-width joiner (ZWJ), rendered as a single glyph
> representing a family "👨‍👩‍👧‍👦". However, if I interpret the
> following part of
> http://www.unicode.org/reports/tr29/tr29-29.html#GB10 correctly:
>
> > Do not break within emoji modifier sequences or emoji zwj sequences.
>
> this is not a problem and "👨\u200D👩\u200D👧\u200D👦" will
> constitute a single grapheme cluster.
>
> That said, making grapheme clusters width units may add
> significant complexity to the implementation with minor benefits,
> so I'm fine going with code points especially since there is an
> established example of doing this (Python) and it's already an
> improvement over stdio & iostreams.
>
> 2. Interpretation of fill.
>
> It seems there was a general agreement that fill should be a code
> point but please let me know if you have other ideas.
>
> 3. There was a question about signed and unsigned char. I checked
> and there is no special handling for these types which means that
> they are treated as integral types, only char and wchar_t are
> treated specially as character types (and later charN_t will be
> added).
>
> 4. Interpretation of n in format_to_n.
>
> There was no agreement whether n should be specified in code units
> or code points. An argument in favor of code units is that n often
> gives the output buffer size. On the other hand, using code points
> would be more consistent with width.
>
> I plan to add support for specifying width and fill as code points
> in fmt (Zach gave some useful pointers on how to do that, thanks!)
> and will report back with any user feedback.
>
> Cheers,
> Victor
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden] <mailto:Unicode_at_[hidden]>
> http://www.open-std.org/mailman/listinfo/unicode
>
>
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode



Received on 2018-07-23 06:04:52