sg16: Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 9 Sep 2019 21:47:46 +0200

On Mon, 9 Sep 2019 at 21:29, Tom Honermann <tom_at_[hidden]> wrote:

> On 9/9/19 3:26 AM, Corentin wrote:
>
>
> On Mon, Sep 9, 2019, 4:34 AM Tom Honermann <tom_at_[hidden]> wrote:
>
>> My preferred direction for exploration is a future extension that enables
>> opt-in to field widths that are encoding dependent (and therefore locale
>> dependent for char and wchar_t). For example (using 'L' appended to the
>> width; 'L' doesn't conflict with the existing type options):
>>
>> std::format("{:3L}", "\xC3\x81"); // produces "\xC3\x81\x20\x20"; 3 EGCs.
>>
> std::format("{:3L}", "ch"); what does that produces?
>
> "ch " (one trailing space). The implied constraint with respect to
> literals is that they must be compatible with whatever the locale dependent
> encoding is. If your question was intended to ask whether transliteration
> should occur here or whether "ch" might be presented with a ligature, well
> that is yet another dimension of why field widths don't really work for
> aligning text (in general, it works just fine for characters for which one
> code unit == one code point == one glyph that can be presented in a
> monospace font).
>

See https://en.wikipedia.org/wiki/Slovak_orthography

> Locale specifiers should only affect region specific rules, not whether
> something is interpreted as bytes or not
>
> Ideally I agree, but that isn't the reality we are faced with.
>

I feel like we completely talk past each other and i am sorry I don't make
my point clear.
Yes, the encoding is currently derived from the locale, no, it does not
have to be.

It is possible to answer the question "what is the encoding the current
process" without pulling the <locale> header.
Pulling the locale header does NOT give you that information.
And yes on some systems (linux), it is attached to the idea of locale.

It is important to separate the two when dealing with Unicode

> But again, I'm far from convinced that this is actually useful since EGCs
>> don't suffice to ensure an aligned result anyway as nicely described in
>> Henri's post (https://hsivonen.fi/string-length).
>>
> Agreed but i think you know that code units is the least useful option in
> this case and i am concerned about choosing a bad option to make a fix easy.
>
>
> I didn't propose code units in order to make an easy fix. The intent was
> to choose the best option given the trade offs involved. Since none of
> code units, code points, scalar values, or EGCs would result in reliable
> alignment and most uses of such alignment (e.g., via printf) are used in
> situations where characters outside the basic source character set are
> unlikely to appear [citation needed], I felt that avoiding the locale
> dependency was the more important goal.
>
I think the user intent is more important. I don't want an emoji to be
considered 17 width units to quote Henri's
EGCs is the less bad approximation

But stating that the char overload is bytes and the upcoming char8_t one is
text would be okay, I think. Maybe. even if surprising

> Tom.
>

Received on 2019-09-09 21:48:00