On Sun, Sep 8, 2019 at 3:00 PM Tom Honermann via Lib <lib@lists.isocpp.org> wrote:

On Sep 8, 2019, at 2:46 PM, Corentin via Lib <lib@lists.isocpp.org> wrote:

On Sun, 8 Sep 2019 at 19:30, Tom Honermann <tom@honermann.net> wrote:

On 9/8/19 12:40 PM, Corentin wrote:

On Sun, 8 Sep 2019 at 18:12, Tom Honermann <tom@honermann.net> wrote:

On 9/8/19 6:00 AM, Corentin via Lib wrote:

On Sun, 8 Sep 2019 at 11:17, Corentin <corentin.jabot@gmail.com> wrote:

On Sun, 8 Sep 2019 at 09:52, Billy O'Neal (VC LIBS) <bion@microsoft.com> wrote:

> I agree that EGCS is the best option. That doesn't drag locale

Because we don’t get to assume that we’re talking about Unicode at all, it absolutely drags in locale.

Sorry, I should have been more specific.

There is a non-tailored Unicode EGCS boundary algorithm (but it can be tailored)

I didn't mean to imply that text manipulation can be done without knowing its encoding and never use "locale" to mean encoding.

EGCS are only defined for text whose character repertoire is Unicode, other encodings deal with codepoints

To be clear, the difference of whether the EGC algorithm is required to be tailored or not is that tailoring for all intent and purposes requires

icu or something with CLDR, which restrict the platforms on which this can be implemented

Tailoring is not relevant to this discussion.

It is - see https://unicode.org/reports/tr29/ "ch" is 2 EGCS in most locales but in Slovak it's 1. I don't make the rules :D

It isn't relevant in determining how we resolve this issue. If the resolution is that field widths are measured in EGCs, then we've already decided that the width is locale dependent and tailoring becomes an implementation detail.

No, format decided to be locale-independent (for good reason) and applying locale specific behavior implicitly would be against that.
I'n arguing for encoding specific behavior

You seem to be missing the point that, for char and wchar_t, the encoding can’t be known (in general) without consulting the locale. Again, LANG=C vs LANG=C.UTF-8.

Tom.

Tom, you seem to be missing the point that std::format does not such consultation! It is locale-agnostic. It is assumed to be char-based, not Windows 1252, not UTF-8, not even ASCII.

This means that the definition of width as being a CU is the de facto status quo. I'm suggesting that later on, we pull a fast one and specify that we meant that it should have been UTF-8-based instead of char-based. This may mean that we need to add a char8_t overload, or it may be palatable to just change the current interface's contract. I assume the former will be necessary, since people tend to hate silent contract changes (with good reason).

So, if we do nothing, we get what you want. If we *specify* that CUs are the width, we color the future debate about the Unicode-aware version in a Unicode-unfriendly direction.

Zach