sg16: Re: [SG16] Wording strategy for Unicode std::format

From: Peter Brett <pbrett_at_[hidden]>
Date: Mon, 26 Apr 2021 17:56:54 +0000

There has been very little interest in char*_t overloads though.

I’ve been thinking about this quite a bit over the weekend (esp. during our impromptu Twitter-based SG16 meeting).

I’ve come to the conclusion that char8_t is actually a solution in search of a problem. All you know when you see a u8string is that its associated encoding is UTF-8, but this isn’t actually very useful for anyone; you still have to treat it as a bag of bytes of indeterminate encoding because, by (lack of) contract, a u8string can contain any old nonsense.

Therefore, if I have the option to build my code with UTF-8 as the literal encoding (and nearly everyone has that option), all char8_t does is to provide me with two mutually-incompatible ways to express “some unknown bytes that might be text.”

I’ve therefore decided not to spend any more of my limited available committee time on any language or library changes related to charN_t. Instead I think it would be more productive for me to focus on adding features that provide text codecs and validation (thanks, JeanHeyd!)

Thanks again for all your valuable contributions to this discussion, Victor!

               Peter

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Sent: 26 April 2021 18:36
To: Peter Brett <pbrett_at_[hidden]>
Cc: sg16_at_[hidden]
Subject: Re: [SG16] Wording strategy for Unicode std::format

EXTERNAL MAIL
> Corentin:
> If our requirements are ... - all of which I agree with individually - I don't see a path forward.

I don't see any problem. Since we are in the beginning of the C++23 cycle it makes sense to tackle more difficult problems such as locale first. Adding format overloads is not hard and makes more sense to do later when we have a better understanding and hopefully solution for locale issues.

> A non-broken Unicode locale support calls for research and implementation experience outside of the standard.

That's exactly what I intend to do, contributors are welcome.

> Providing something that is consistent and *not more* broken than the narrow overload seems useful in the short term.

First I don't think it's necessary per my comment above and second it will likely severely constrain our design space for fixing these issues in the future.

> Peter:
> Just for clarity, what does {fmt} currently do?

Unfortunately {fmt} cannot fix the standard library so it does something completely crazy in char*_t overloads, namely uses a narrow locale and casts characters. There has been very little interest in char*_t overloads though.

Cheers,
Victor

On Thu, Apr 22, 2021 at 8:57 AM Peter Brett <pbrett_at_[hidden]<mailto:pbrett_at_[hidden]>> wrote:
Hi Victor,

This is helpful, thank you. I will put full ‘L’ handling for UTF-8/16/32 as the preferred option in the paper.

Just for clarity, what does {fmt} currently do? Obviously if it currently does something different then I will have to do some work to demonstrate implementability.

Best wishes,

           Peter

From: SG16 <sg16-bounces_at_[hidden]pp.org<mailto:sg16-bounces_at_[hidden]>> On Behalf Of Victor Zverovich via SG16
Sent: 22 April 2021 15:27
To: SG16 <sg16_at_[hidden]<mailto:sg16_at_[hidden]>>
Cc: Victor Zverovich <victor.zverovich_at_[hidden]<mailto:victor.zverovich_at_[hidden]>>
Subject: Re: [SG16] Wording strategy for Unicode std::format

EXTERNAL MAIL
> Peter:
> What should the following code do?

I think (1) is the only acceptable option because all the rest are inconsistent with existing std::format overloads.

> “std::locale in its current form is pretty much useless,” may be a true statement but it doesn’t help me make progress.

Maybe we are trying to make "progress" in the wrong direction? We don't have to quickly hack something together for new std::format overloads. We didn't have a chance to look at locale in C++20 but now is a great time.

> Corentin:
> Converting between UTF-X and UTF-Y is a lossless operation.

Only valid ones. There is still a question of handling transcoding errors.

> what is it that we gain by not allowing format(u8"{}", u""); and format(u8"{}", U"");?

We gain consistency between all std::format overloads, simple specification, not having to deal with transcoding errors. I am not suggesting that it shouldn't be possible but that it should be explicit, e.g.

  format(u8"{}", xcode(u""))

With explicit approach you can easily configure error handling.

> we could provide only the u8 overload

Sure provided that we have transcoding facilities. I don't think u16 and u32 overloads are particularly useful since you can't do much with the result.

> mandates that the existing locale be specialize for char8_t

If this specialization inherits all existing locale problems then I think it's not a good idea.

- Victor

On Mon, Apr 19, 2021 at 3:18 AM Corentin Jabot via SG16 <sg16_at_[hidden]<mailto:sg16_at_[hidden]>> wrote:
Talking with Peter, we realized we could provide only the u8 overload and mandates that the existing locale be specialize for char8_t
We believe this would

  * Satisfy Victor's excellent remark about the need not to be gratuitously inconsistent
  * Put minimum strain on implementers
  * Let us move forward with having a Unicode overload in 23.

--
SG16 mailing list
SG16_at_[hidden]<mailto:SG16_at_[hidden]>
https://lists.isocpp.org/mailman/listinfo.cgi/sg16<https://urldefense.com/v3/__https:/lists.isocpp.org/mailman/listinfo.cgi/sg16__;!!EHscmS1ygiU1lA!WueeYVkg4epLn98-McfxKUi3lJONY6lPzMPbUArFN5V6WCOZxR45PasGv15tlA$>

Received on 2021-04-26 12:57:22