Date: Sun, 8 Jul 2018 23:24:18 +0200
The *implementation* complexity of using grapheme clusters ( there is also
a runtime complexity ) - would only come from the order in which things are
standardized.
I think grapheme clusters iterator is something SG16 wants, and the day we
have that, it would be easy to add it in fmt (since the wording would be
well defined, and the implementor would have to implement support for
grapheme cluster anyway).
That's why It's probably wise to ignore all charX_t overloads and
specializations (for both parameters and format string), until such a time
that we can use these basic building blocks and common definition.
Otherwise, I think you are probably right that grapheme cluster for Unicode
strings (those using charX_t) makes sense.
Except of course this is completely in the hand of the renderer. your
family emoji renders as 4 emojis because my computer probably lacks the
appropriate font.
We must accept that we can not provide a way for the value of match to
match anything that will be rendered. aka if people use it for text
alignment, it will never be right.
The way graphical software deal with that is that they rely on font metrics
- aka they compute a width from the actual font used to to the rendering
4/ For me ( I think it was lost in the chat ) - the semantic of N should
depend on the value_type of the output iterator/function
Corentin
Le dim. 8 juil. 2018 à 23:04, Victor Zverovich <victor.zverovich_at_[hidden]>
a écrit :
> Just a small followup on our discussion of P0645 Text Formatting during
> the previous meeting.
>
> 1. Interpretation of width with multibyte encodings and combining
> characters.
>
> P0645R2 currently doesn't specify the units of width. Possible options are
> (from lower to higher abstraction level):
>
> * Code units
> * Code points
> * Grapheme clusters
>
> Python 3 uses code points as can be seen from the following example:
>
> >>> o = b'\x6F\xCC\x88'.decode('utf8')
> >>> o
> 'ö'
> >>> '{:>2}'.format(o)
> 'ö' # note missing space
> >>> o = b'\xC3\xB6'.decode('utf8')
> >>> o
> 'ö'
> >>> '{:>2}'.format(o)
> ' ö'
>
> I have slight preference to grapheme clusters because according to Unicode
> Standard Annex #29 UNICODE TEXT SEGMENTATION
> <http://unicode.org/reports/tr29/> they correspond to “user-perceived
> characters” (at least that seems to be the intention, whether they are
> successful in that is another question).
>
> Zach provided an example of "👨\u200D👩\u200D👧\u200D👦", where \u200D is
> a zero-width joiner (ZWJ), rendered as a single glyph representing a
> family "👨👩👧👦". However, if I interpret the following part of
> http://www.unicode.org/reports/tr29/tr29-29.html#GB10 correctly:
>
> > Do not break within emoji modifier sequences or emoji zwj sequences.
>
> this is not a problem and "👨\u200D👩\u200D👧\u200D👦" will constitute a
> single grapheme cluster.
>
> That said, making grapheme clusters width units may add significant
> complexity to the implementation with minor benefits, so I'm fine going
> with code points especially since there is an established example of doing
> this (Python) and it's already an improvement over stdio & iostreams.
>
> 2. Interpretation of fill.
>
> It seems there was a general agreement that fill should be a code point
> but please let me know if you have other ideas.
>
> 3. There was a question about signed and unsigned char. I checked and
> there is no special handling for these types which means that they are
> treated as integral types, only char and wchar_t are treated specially as
> character types (and later charN_t will be added).
>
> 4. Interpretation of n in format_to_n.
>
> There was no agreement whether n should be specified in code units or code
> points. An argument in favor of code units is that n often gives the output
> buffer size. On the other hand, using code points would be more consistent
> with width.
>
> I plan to add support for specifying width and fill as code points in fmt (Zach
> gave some useful pointers on how to do that, thanks!) and will report back
> with any user feedback.
>
> Cheers,
> Victor
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>
a runtime complexity ) - would only come from the order in which things are
standardized.
I think grapheme clusters iterator is something SG16 wants, and the day we
have that, it would be easy to add it in fmt (since the wording would be
well defined, and the implementor would have to implement support for
grapheme cluster anyway).
That's why It's probably wise to ignore all charX_t overloads and
specializations (for both parameters and format string), until such a time
that we can use these basic building blocks and common definition.
Otherwise, I think you are probably right that grapheme cluster for Unicode
strings (those using charX_t) makes sense.
Except of course this is completely in the hand of the renderer. your
family emoji renders as 4 emojis because my computer probably lacks the
appropriate font.
We must accept that we can not provide a way for the value of match to
match anything that will be rendered. aka if people use it for text
alignment, it will never be right.
The way graphical software deal with that is that they rely on font metrics
- aka they compute a width from the actual font used to to the rendering
4/ For me ( I think it was lost in the chat ) - the semantic of N should
depend on the value_type of the output iterator/function
Corentin
Le dim. 8 juil. 2018 à 23:04, Victor Zverovich <victor.zverovich_at_[hidden]>
a écrit :
> Just a small followup on our discussion of P0645 Text Formatting during
> the previous meeting.
>
> 1. Interpretation of width with multibyte encodings and combining
> characters.
>
> P0645R2 currently doesn't specify the units of width. Possible options are
> (from lower to higher abstraction level):
>
> * Code units
> * Code points
> * Grapheme clusters
>
> Python 3 uses code points as can be seen from the following example:
>
> >>> o = b'\x6F\xCC\x88'.decode('utf8')
> >>> o
> 'ö'
> >>> '{:>2}'.format(o)
> 'ö' # note missing space
> >>> o = b'\xC3\xB6'.decode('utf8')
> >>> o
> 'ö'
> >>> '{:>2}'.format(o)
> ' ö'
>
> I have slight preference to grapheme clusters because according to Unicode
> Standard Annex #29 UNICODE TEXT SEGMENTATION
> <http://unicode.org/reports/tr29/> they correspond to “user-perceived
> characters” (at least that seems to be the intention, whether they are
> successful in that is another question).
>
> Zach provided an example of "👨\u200D👩\u200D👧\u200D👦", where \u200D is
> a zero-width joiner (ZWJ), rendered as a single glyph representing a
> family "👨👩👧👦". However, if I interpret the following part of
> http://www.unicode.org/reports/tr29/tr29-29.html#GB10 correctly:
>
> > Do not break within emoji modifier sequences or emoji zwj sequences.
>
> this is not a problem and "👨\u200D👩\u200D👧\u200D👦" will constitute a
> single grapheme cluster.
>
> That said, making grapheme clusters width units may add significant
> complexity to the implementation with minor benefits, so I'm fine going
> with code points especially since there is an established example of doing
> this (Python) and it's already an improvement over stdio & iostreams.
>
> 2. Interpretation of fill.
>
> It seems there was a general agreement that fill should be a code point
> but please let me know if you have other ideas.
>
> 3. There was a question about signed and unsigned char. I checked and
> there is no special handling for these types which means that they are
> treated as integral types, only char and wchar_t are treated specially as
> character types (and later charN_t will be added).
>
> 4. Interpretation of n in format_to_n.
>
> There was no agreement whether n should be specified in code units or code
> points. An argument in favor of code units is that n often gives the output
> buffer size. On the other hand, using code points would be more consistent
> with width.
>
> I plan to add support for specifying width and fill as code points in fmt (Zach
> gave some useful pointers on how to do that, thanks!) and will report back
> with any user feedback.
>
> Cheers,
> Victor
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>
Received on 2018-07-08 23:24:32