C++ Logo


Advanced search

Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 7 Sep 2019 23:25:43 -0400
On 9/7/19 9:11 PM, Zach Laine wrote:
> On Sat, Sep 7, 2019 at 7:31 PM Tom Honermann via Lib
> <lib_at_[hidden] <mailto:lib_at_[hidden]>> wrote:
> On 9/7/19 8:27 PM, Tony V E wrote:
>> I think we would want it to be measured in glyphs.
> I agree that would be ideal, but...
> Stop right there. If that's ideal, let's do that. Or at least, let's
> leave room for it to be done at some point. Specifying CUs now
> prevents the ideal from ever being realized.
There are other options. For example, a future extension could allow
specifying what units are to be used for field width.
>> Are you suggesting code points because glyphs are too hard?
> I don't know how to achieve that. Field width doesn't really work
> for alignment unless one assumes a monospace font. We could
> measure in terms of extended grapheme clusters, but EGCS width has
> changed over time (e.g., family emoji). That makes alignment
> dependent on both display properties and Unicode version. And, of
> course, this would drag in locale dependence as well.
> If you just count N=EGCs, you get the "right" answer. if your
> terminal shows more or less than N characters, get a new terminal.
> What I mean by this is that there should be no consideration of fonts.
I see field width as either indicating storage (number of code units) or
alignment. The number of user perceived characters is not useful for
aligning text unless a monospace font is assumed. Therefore, storage
seems like the more useful measurement. This also aligns with
format_to_n and formatted_size which, unless I'm mistaken, work in code
units. (It would be nice to clarify the wording for these as well; what
is meant by "number of characters in the character representation"?)
> As for the need for a locale, I don't get that. Grapheme breaking is
> simple, and requires no locale info. Do you mean Unicode data?
> Picking a version and sticking with it should be sufficient. No
> system that I know of has multiple Unicode versions to pick from
> programatically.
For char and wchar_t, encoding is locale dependent. Think POSIX LANG=C
(probably ASCII or ISO-8859-1) vs LANG=C.UTF-8.
> Zach

Received on 2019-09-08 05:25:48