sg16: Re: [SG16] Wording strategy for Unicode std::format

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 26 Apr 2021 23:08:00 -0400

On 4/19/21 5:19 AM, Corentin Jabot via SG16 wrote:
>
>
> On Mon, Apr 19, 2021 at 11:01 AM Peter Brett via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> Good, I think that deliberately forbidding implicit transcoding is
> the way to go here.
>
>
> I would like to challenge that
>
> Converting between UTF-X and UTF-Y is a lossless operation.
> Say users want to debug a u32_string, they should not have to have to
> use a u32 format string: requires more storage for the format string,
> there might be extra conversion in the opposite direction to be able
> to do anything useful with it (no system will be able to print a utf32
> string natively).
>
> So what is it that we gain by not allowing format(u8"{}", u"");
> and format(u8"{}", U"");?
>
> I do agree that format(u8"{}", ""); is a bit dicey and not allowing it
> is a good starting point.
>
> (and of course, format("{}", u8""); would have strong opposition from me)

Could you elaborate regarding these last two? Why is the first only "a
bit dicey" while the latter prompts "strong opposition"? Is it because
the latter has more likelihood to involve a lossy conversion, at least
for code intended to be portable?

Tom.

>
> What should the following code do?
>
> #include <locale>
>
> #include <format>
>
> std::locale::global(std::locale("en_GB"));
>
> auto a = std::format("{:} {:L}", 10000, 20000);
>
> auto b = std::format(u8"{:} {:L}", 10000, 20000);
>
> Options:
>
> 1. *a = "10000 20,000", b = u8"10000 20,000"*. In effect, mandate
> that implementations supply UTF-8/16/32 locale.
> 2. *a = "10000 20,000", b = u8"10000 20000"*. Silently ignore the
> ‘L’ option, setting up for an ABI break if we want to
> introduce Better Global Locale at some point in the future. It
> may also cause non-obvious changes in behaviour for users who
> start using std::format with 'L', but then want to u8-ify it
> later.
> 3. *a = "10000 20,000", b = /<std::format_error>/.* Forbid
> implementers from even attempting to make 'L' option work in
> UTF-8/16/32 formatting.
> 4. *A = "10000 20,000", b = /<conditionally-supported>/.* Many
> implementers will choose to throw std::format_error. However,
> if wchar_t is 32-bit, and the wide execution encoding is
> UTF-32, then some implementers will actually be able to ‘L’
> Just Work in UTF-32 formatting, and it doesn’t seem
> unreasonable to allow them to do that.
>
> “std::locale in its current form is pretty much useless,” may be a
> true statement but it doesn’t help me make progress.
>
>
> I interpret it as: lets tread carefully and lets not back ourselves
> into a corner.
> However I think that Victor's point that we should not create
> inconsistencies between the different overloads is also a very good point
>
> Peter
>
> *From:*Victor Zverovich <victor.zverovich_at_[hidden]
> <mailto:victor.zverovich_at_[hidden]>>
> *Sent:* 18 April 2021 23:23
> *To:* Peter Brett <pbrett_at_[hidden] <mailto:pbrett_at_[hidden]>>
> *Cc:* SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>>
> *Subject:* Re: Wording strategy for Unicode std::format
>
> EXTERNAL MAIL
>
> > {fmt} already does exactly this, right?
>
> {fmt} explicitly disallows any implicit transcoding at the
> formatting level.
>
> > How can we word this so as to make `{L}` substitutions for
> UTF-8/16/32
>
> > formatting conditionally-supported, depending on whether the
> > implementation provides the necessary specializations of <locale>
> > facilities?
>
> I'm less concerned with how we word this and more with the fact
> that std::locale in its current form is pretty much useless. I
> would recommend looking into this instead of blindly extending it
> to new code unit types.
>
> Cheers,
>
> Victor
>
> On Fri, Apr 16, 2021 at 10:08 AM Peter Brett <pbrett_at_[hidden]
> <mailto:pbrett_at_[hidden]>> wrote:
>
> Hi all (esp. Victor),
>
> We discussed adding C++23 support for homogeneous formatting
> in UTF-8,
> UTF-16 and UTF-32. For C++23, we would like to allow UTF-8 format
> strings with UTF-8 substitutions, UTF-16 format strings with
> UTF-16
> substitutions, etc. In a future version of the standard
> (where UTF
> transcoding is guaranteed to be available) we would like to
> extend this
> to allowing e.g. UTF-32 substitutions into UTF-8 format strings.
>
> Victor: {fmt} already does exactly this, right?
>
> As far as I can tell, most of the wording is already in place
> for this,
> and it will only be necessary to mandate the addition of specific
> overloads and template specialisations.
>
> My current sticking point is the way we have specified the
> locale-specific form (with the `L` option). Take the `{L}`
> substitution
> for bool, for example. In P1892 I chose to specify this in
> terms of
> std::numpunct<charT>, but the standard only requires the standard
> library to provide numpunct<char> and numpunct<wchar_t>
> specializations.
> Similar problems arise for `L` with other standard format
> specifiers.
>
> How can we word this so as to make `{L}` substitutions for
> UTF-8/16/32
> formatting conditionally-supported, depending on whether the
> implementation provides the necessary specializations of <locale>
> facilities?
>
> Advice appreciated.
>
> Peter
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>

Received on 2021-04-26 22:08:05