sg16: Re: [SG16-Unicode] [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

From: Billy O'Neal (VC LIBS) <"Billy>
Date: Mon, 9 Sep 2019 19:33:05 +0000

> everything within the programs assumes ACP, if you are trying to say output to the console, and the ACP is utf8, the console has to expect utf8

The probability of ACP being UTF-8 is approximately zero.

Billy3

________________________________
From: Tom Honermann <tom_at_[hidden]>
Sent: Monday, September 9, 2019 12:29:41 PM
To: Corentin <corentin.jabot_at_[hidden]>
Cc: Zach Laine <whatwasthataddress_at_[hidden]>; Library Working Group <lib_at_[hidden]>; Victor Zverovich <victor.zverovich_at_[hidden]>; Billy O'Neal (VC LIBS) <bion_at_[hidden]>; unicode_at_[hidden] <unicode_at_[hidden]>
Subject: Re: [isocpp-lib] New issue: Are std::format field widths code units, code points, or something else?

On 9/9/19 3:26 AM, Corentin wrote:

On Mon, Sep 9, 2019, 4:34 AM Tom Honermann <tom_at_[hidden]<mailto:tom_at_[hidden]>> wrote:

My preferred direction for exploration is a future extension that enables opt-in to field widths that are encoding dependent (and therefore locale dependent for char and wchar_t). For example (using 'L' appended to the width; 'L' doesn't conflict with the existing type options):

std::format("{:3L}", "\xC3\x81"); // produces "\xC3\x81\x20\x20"; 3 EGCs.

std::format("{:3L}", "ch"); what does that produces?
"ch " (one trailing space). The implied constraint with respect to literals is that they must be compatible with whatever the locale dependent encoding is. If your question was intended to ask whether transliteration should occur here or whether "ch" might be presented with a ligature, well that is yet another dimension of why field widths don't really work for aligning text (in general, it works just fine for characters for which one code unit == one code point == one glyph that can be presented in a monospace font).
Locale specifiers should only affect region specific rules, not whether something is interpreted as bytes or not
Ideally I agree, but that isn't the reality we are faced with.

But again, I'm far from convinced that this is actually useful since EGCs don't suffice to ensure an aligned result anyway as nicely described in Henri's post (https://hsivonen.fi/string-length<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhsivonen.fi%2Fstring-length&data=02%7C01%7Cbion%40microsoft.com%7Cbf89faaa554648aa995008d7355c11b8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036541864348883&sdata=p3b4a9UMteCzWPrhZxcUK2hTidr0fDz%2BP8db4l5ck8k%3D&reserved=0>).

Agreed but i think you know that code units is the least useful option in this case and i am concerned about choosing a bad option to make a fix easy.

I didn't propose code units in order to make an easy fix. The intent was to choose the best option given the trade offs involved. Since none of code units, code points, scalar values, or EGCs would result in reliable alignment and most uses of such alignment (e.g., via printf) are used in situations where characters outside the basic source character set are unlikely to appear [citation needed], I felt that avoiding the locale dependency was the more important goal.

Tom.

Received on 2019-09-09 21:33:10