C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Wed, 18 Sep 2019 12:30:24 -0700
Based on the discussion in this thread and additional research, I came to
the conclusion the current proposed resolution is incorrect and will submit
a paper explaining why in details and proposing a better alternative.

Cheers,
Victor

On Wed, Sep 18, 2019 at 10:29 AM Daniel Krügler via Lib <
lib_at_[hidden]> wrote:

> Am So., 8. Sept. 2019 um 02:13 Uhr schrieb Tom Honermann via Lib
> <lib_at_[hidden]>:
> >
> > [format.string.std]p7 states:
> >
> > The positive-integer in width is a decimal integer defining the minimum
> field width. If width is not specified, there is no minimum field width,
> and the field width is determined based on the content of the field.
> >
> > Is field width measured in code units, code points, or something else?
> >
> > Consider the following example assuming a UTF-8 locale:
> >
> > std::format("{}", "\xC3\x81"); // U+00C1 { LATIN CAPITAL
> LETTER A WITH ACUTE }
> > std::format("{}", "\x41\xCC\x81"); // U+0041 U+0301 { LATIN CAPITAL
> LETTER A } { COMBINING ACUTE ACCENT }
> >
> > In both cases, the arguments encode the same user-perceived character
> (Á). The first uses two UTF-8 code units to encode a single code point
> that represents a single glyph using a composed Unicode normalization
> form. The second uses three code units to encode two code points that
> represent the same glyph using a decomposed Unicode normalization form.
> >
> > How is the field width determined? If measured in code units, the first
> has a width of 2 and the second of 3. If measured in code points, the
> first has a width of 1 and the second of 2. If measured in grapheme
> clusters, both have a width of 1. Is the determination locale dependent?
> >
> > Proposed resolution:
> >
> > Field widths are measured in code units and are not locale dependent.
> Modify [format.string.std]p7 as follows:
> >
> > The positive-integer in width is a decimal integer defining the minimum
> field width. If width is not specified, there is no minimum field width,
> and the field width is determined based on the content of the field. Field
> width is measured in code units. Each byte of a multibyte character
> contributes to the field width.
> >
> > (code unit is not formally defined in the standard. Most uses occur in
> UTF-8 and UTF-16 specific contexts, but [lex.ext]p5 uses it in an encoding
> agnostic context.)
> >
> > Tom.
>
> Unfortunately, issue submission and LWG reflector announcement have
> been combined and a long thread was the consequence of this
> submission, please consider in the future to separate submission and
> LWG discussions.
>
> In this case a new issue has been created, please reload and double-check:
>
> https://cplusplus.github.io/LWG/issue3290
>
> Thanks,
>
> - Daniel
> _______________________________________________
> Lib mailing list
> Lib_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2019/09/13615.php
>

Received on 2019-09-18 21:30:37