sg16: Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 10 Sep 2019 10:07:26 -0400

On 9/9/19 3:47 PM, Corentin wrote:
>
>
> On Mon, 9 Sep 2019 at 21:29, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 9/9/19 3:26 AM, Corentin wrote:
>>
>> On Mon, Sep 9, 2019, 4:34 AM Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> My preferred direction for exploration is a future extension
>> that enables opt-in to field widths that are encoding
>> dependent (and therefore locale dependent for char and
>> wchar_t). For example (using 'L' appended to the width; 'L'
>> doesn't conflict with the existing type options):
>>
>> std::format("{:3L}", "\xC3\x81"); // produces
>> "\xC3\x81\x20\x20"; 3 EGCs.
>>
>> std::format("{:3L}", "ch"); what does that produces?
> "ch " (one trailing space). The implied constraint with respect
> to literals is that they must be compatible with whatever the
> locale dependent encoding is. If your question was intended to
> ask whether transliteration should occur here or whether "ch"
> might be presented with a ligature, well that is yet another
> dimension of why field widths don't really work for aligning text
> (in general, it works just fine for characters for which one code
> unit == one code point == one glyph that can be presented in a
> monospace font).
>
>
> See https://en.wikipedia.org/wiki/Slovak_orthography
Ah, digraphs. Unicode doesn't provide general support for digraphs so
whether "ch" represents the individual Slovak "c" and "h" characters or
the letter "ch" is not apparent. If "c" and "h" was intended, then
U+034F {
COMBINING GRAPHEME JOINER } could be used to indicate that (the name of
this joiner is a misnomer). U+200C {ZERO WIDTH NON-JOINER } and U+200D {
ZERO WIDTH JOINER } could be used to prevent ligation, but doesn't help
to determine which character is intended. I tend to think Unicode is
deficient in this area, but I'm no expert in it. Regardless, this is
more support for field widths being insufficient for display alignment.
>
>> Locale specifiers should only affect region specific rules, not
>> whether something is interpreted as bytes or not
> Ideally I agree, but that isn't the reality we are faced with.
>
>
> I feel like we completely talk past each other and i am sorry I don't
> make my point clear.
> Yes, the encoding is currently derived from the locale, no, it does
> not have to be.
>
> It is possible to answer the question "what is the encoding the
> current process" without pulling the <locale> header.
> Pulling the locale header does NOT give you that information.
I don't see how the <locale> header is relevant here. The standard
doesn't have to answer the question of where the locale information
comes from. LANG=C vs LANG=C.UTF-8 isn't (currently) reflected in <locale>.
> And yes on some systems (linux), it is attached to the idea of locale.
All POSIX systems and Windows.
>
> It is important to separate the two when dealing with Unicode
We're not dealing solely with Unicode here. We're discussing char and
wchar_t which may or may not (depending on platform and locale) indicate
a Unicode or non-Unicode encoding. I don't see a way to separate them
today.
>
>> But again, I'm far from convinced that this is actually
>> useful since EGCs don't suffice to ensure an aligned result
>> anyway as nicely described in Henri's post
>> (https://hsivonen.fi/string-length).
>>
>> Agreed but i think you know that code units is the least useful
>> option in this case and i am concerned about choosing a bad
>> option to make a fix easy.
>
> I didn't propose code units in order to make an easy fix. The
> intent was to choose the best option given the trade offs
> involved. Since none of code units, code points, scalar values,
> or EGCs would result in reliable alignment and most uses of such
> alignment (e.g., via printf) are used in situations where
> characters outside the basic source character set are unlikely to
> appear [citation needed], I felt that avoiding the locale
> dependency was the more important goal.
>
> I think the user intent is more important. I don't want an emoji to
> be considered 17 width units to quote Henri's
> EGCs is the less bad approximation
I guess that is one place we disagree.
>
> But stating that the char overload is bytes and the upcoming char8_t
> one is text would be okay, I think. Maybe. even if surprising

And this is another. Repeating what I stated earlier, If the current
locale has a UTF-8 encoding, I would be disappointed if the following
two calls produced different string contents:

std::format( "{:3}", "\xC3\x81"); // U+00C1{ LATIN CAPITAL LETTER A
WITH ACUTE }
std::format(u8"{:3}", u8"\xC3\x81"); // U+00C1{ LATIN CAPITAL LETTER A
WITH ACUTE }

Perhaps it would be helpful to enumerate what we expect to be portable
uses of field widths. My personal take is that they are useful to
specify widths for fields where the content is restricted to members of
the basic source character set where we already have a guarantee that
each character can be represented with one code unit. That is
sufficient to allow field widths to portably work as expected (assuming
a monospace font if display is relevant) for formatting of arithmetic
and pointer types as none of those require characters outside of the
basic source character set. It is also sufficient for character and
string literals restricted to the basic source character set. I think
it is reasonable to require that, for text in general, some other means
is required to achieve alignment. Those restrictions make the
distinction between code unit, code point, scalar values, and EGCs
meaningless in the context of field widths.

Tom.

Received on 2019-09-10 16:07:29