On Mon, Sep 9, 2019, 4:34 AM Tom Honermann <tom@honermann.net> wrote:

On 9/8/19 7:05 PM, Zach Laine wrote:

On Sun, Sep 8, 2019 at 3:00 PM Tom Honermann via Lib <lib@lists.isocpp.org> wrote:

On Sep 8, 2019, at 2:46 PM, Corentin via Lib <lib@lists.isocpp.org> wrote:

On Sun, 8 Sep 2019 at 19:30, Tom Honermann <tom@honermann.net> wrote:

On 9/8/19 12:40 PM, Corentin wrote:

On Sun, 8 Sep 2019 at 18:12, Tom Honermann <tom@honermann.net> wrote:

On 9/8/19 6:00 AM, Corentin via Lib wrote:

On Sun, 8 Sep 2019 at 11:17, Corentin <corentin.jabot@gmail.com> wrote:

On Sun, 8 Sep 2019 at 09:52, Billy O'Neal (VC LIBS) <bion@microsoft.com> wrote:

> I agree that EGCS is the best option. That doesn't drag locale

Because we don’t get to assume that we’re talking about Unicode at all, it absolutely drags in locale.

Sorry, I should have been more specific.

There is a non-tailored Unicode EGCS boundary algorithm (but it can be tailored)

I didn't mean to imply that text manipulation can be done without knowing its encoding and never use "locale" to mean encoding.

EGCS are only defined for text whose character repertoire is Unicode, other encodings deal with codepoints

To be clear, the difference of whether the EGC algorithm is required to be tailored or not is that tailoring for all intent and purposes requires

icu or something with CLDR, which restrict the platforms on which this can be implemented

Tailoring is not relevant to this discussion.

It is - see https://unicode.org/reports/tr29/ "ch" is 2 EGCS in most locales but in Slovak it's 1. I don't make the rules :D

It isn't relevant in determining how we resolve this issue. If the resolution is that field widths are measured in EGCs, then we've already decided that the width is locale dependent and tailoring becomes an implementation detail.

No, format decided to be locale-independent (for good reason) and applying locale specific behavior implicitly would be against that.

I'n arguing for encoding specific behavior

You seem to be missing the point that, for char and wchar_t, the encoding can’t be known (in general) without consulting the locale. Again, LANG=C vs LANG=C.UTF-8.

Tom.

Tom, you seem to be missing the point that std::format does not such consultation! It is locale-agnostic. It is assumed to be char-based, not Windows 1252, not UTF-8, not even ASCII.

That is exactly my point! And why my proposed resolution was to specify width in terms of code units.

This means that the definition of width as being a CU is the de facto status quo. I'm suggesting that later on, we pull a fast one and specify that we meant that it should have been UTF-8-based instead of char-based. This may mean that we need to add a char8_t overload, or it may be palatable to just change the current interface's contract. I assume the former will be necessary, since people tend to hate silent contract changes (with good reason).

Victor's fmtlib implementation already effectively does what you suggest. See https://github.com/fmtlib/fmt/commit/38325248e5310ddbea41390974e496e8495f7324.

I think this isn't a good state to be in though. If the current locale has a UTF-8 encoding, I would be disappointed if the following two calls produced different string contents:

std::format( "{:3}", "\xC3\x81"); // U+00C1 { LATIN CAPITAL LETTER A WITH ACUTE }std::format(u8"{:3}", u8"\xC3\x81"); // U+00C1{LATIN CAPITAL LETTER A WITH ACUTE }

If the width is code units for the char based overload and EGCs for the char8_t based one, then the first will produce "\xC3\x81\x20" (one inserted space) and the second "\xC3\x81\x20\x20" (two inserted spaces). I think users would find that surprising.

I think we are going there 0- we will have to if we take the code units route.

It matches a discussion I recall we had probably at kona that at the moment fmt is more of a bytes formatting library - with the expectation that u8 overload would format text

So, if we do nothing, we get what you want. If we *specify* that CUs are the width, we color the future debate about the Unicode-aware version in a Unicode-unfriendly direction.

If we do nothing, we are in the situation where different implementors may do different things

My preferred direction for exploration is a future extension that enables opt-in to field widths that are encoding dependent (and therefore locale dependent for char and wchar_t). For example (using 'L' appended to the width; 'L' doesn't conflict with the existing type options):

std::format("{:3L}", "\xC3\x81"); // produces "\xC3\x81\x20\x20"; 3 EGCs.

std::format("{:3L}", "ch"); what does that produces?

Locale specifiers should only affect region specific rules, not whether something is interpreted as bytes or not

But again, I'm far from convinced that this is actually useful since EGCs don't suffice to ensure an aligned result anyway as nicely described in Henri's post (https://hsivonen.fi/string-length).

Agreed but i think you know that code units is the least useful option in this case and i am concerned about choosing a bad option to make a fix easy.

Tom.

Zach