C++ Logo


Advanced search

Re: [SG16] Wording strategy for Unicode std::format

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 26 Apr 2021 15:33:57 -0400
On 4/26/21 1:56 PM, Peter Brett via SG16 wrote:
> /There has been very little interest in char*_t overloads though./
> I’ve been thinking about this quite a bit over the weekend (esp.
> during our impromptu Twitter-based SG16 meeting).
> I’ve come to the conclusion that char8_t is actually a solution in
> search of a problem. All you know when you see a u8string is that its
> associated encoding is UTF-8, but this isn’t actually very useful for
> anyone; you still have to treat it as a bag of bytes of indeterminate
> encoding because, by (lack of) contract, a u8string can contain any
> old nonsense.
Some of this is true, but not all of it.

It is correct that a char8_t sequence or a std::u8string may contain
garbage. This is not unique to char8_t; it is true for all of the other
character types as well. It is also true of Rust's std::string (see
it is possible to put garbage in Rust's string types too. The Rust
standard library has a precondition that std::string contains
well-formed UTF-8. Rust's string type is much better designed to avoid
such garbage than std::basic_string is; no one is going to claim otherwise.

But that lack of a strict well-formed guarantee does not prevent solving
real problems. The introduction of char8_t solved a real problem in the
standard library itself; it enabled std::path constructor overloads to
support UTF-8 and eliminated the std::u8path() factory function workaround.

char8_t was never intended to solve all UTF-8 related problems. It is
intended to enable type safe use of UTF-8 without accidentally mixing
UTF-8 text with text in other encodings (stored in char).

Anyone using char8_t for anything other than UTF-8 (except for the
special case of creating ill-formed UTF-8 text for the purposes of
testing that such text is correctly rejected/handled) is abusing the
type. Full stop.

We can certainly add a type that enforces well-formed UTF-8 using either
char or char8_t; I suspect it would look a lot like Rust's string type,
including unchecked functions that enable construction with garbage
because we don't want to pay the overhead of constant re-validation.
Functions that except text will always have to have a wide contract or a
precondition on well-formed text. That cannot be avoided.

> Therefore, if I have the option to build my code with UTF-8 as the
> literal encoding (and nearly everyone has that option), all char8_t
> does is to provide me with two mutually-incompatible ways to express
> “some unknown bytes that might be text.”
The standard library does not have that option. Nor does a lot of
commercial software, including nearly all software that runs in some
particular ecosystems. There are 3-5 million C++ developers (or
whatever the current estimation is) and experiences vary considerably.
Please don't assume that the options available to you or that your
experience translates widely throughout the community. I have
personally worked on code bases that, for both migration cost and
technical reasons, would find switching to UTF-8 literal encoding

> I’ve therefore decided not to spend any more of my limited available
> committee time on any language or library changes related to charN_t.
> Instead I think it would be more productive for me to focus on adding
> features that provide text codecs and validation (thanks, JeanHeyd!)
That is certainly your choice and your contributions are of course welcome.


> Thanks again for all your valuable contributions to this discussion,
> Victor!
> Peter
> *From:*Victor Zverovich <victor.zverovich_at_[hidden]>
> *Sent:* 26 April 2021 18:36
> *To:* Peter Brett <pbrett_at_[hidden]>
> *Cc:* sg16_at_[hidden]
> *Subject:* Re: [SG16] Wording strategy for Unicode std::format
> > Corentin:
> > If our requirements are ... - all of which I agree with individually
> - I don't see a path forward.
> I don't see any problem. Since we are in the beginning of the C++23
> cycle it makes sense to tackle more difficult problems such as locale
> first. Adding format overloads is not hard and makes more sense to do
> later when we have a better understanding and hopefully solution for
> locale issues.
> > A non-broken Unicode locale support calls for research and
> implementation experience outside of the standard.
> That's exactly what I intend to do, contributors are welcome.
> > Providing something that is consistent and *not more* broken than
> the narrow overload seems useful in the short term.
> First I don't think it's necessary per my comment above and second it
> will likely severely constrain our design space for fixing these
> issues in the future.
> > Peter:
> > Just for clarity, what does {fmt} currently do?
> Unfortunately {fmt} cannot fix the standard library so it does
> something completely crazy in char*_t overloads, namely uses a narrow
> locale and casts characters. There has been very little interest in
> char*_t overloads though.
> Cheers,
> Victor
> On Thu, Apr 22, 2021 at 8:57 AM Peter Brett <pbrett_at_[hidden]
> <mailto:pbrett_at_[hidden]>> wrote:
> Hi Victor,
> This is helpful, thank you. I will put full ‘L’ handling for
> UTF-8/16/32 as the preferred option in the paper.
> Just for clarity, what does {fmt} currently do? Obviously if it
> currently does something different then I will have to do some
> work to demonstrate implementability.
> Best wishes,
> Peter
> *From:*SG16 <sg16-bounces_at_[hidden]
> <mailto:sg16-bounces_at_[hidden]>> *On Behalf Of *Victor
> Zverovich via SG16
> *Sent:* 22 April 2021 15:27
> *To:* SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>>
> *Cc:* Victor Zverovich <victor.zverovich_at_[hidden]
> <mailto:victor.zverovich_at_[hidden]>>
> *Subject:* Re: [SG16] Wording strategy for Unicode std::format
> > Peter:
> > What should the following code do?
> I think (1) is the only acceptable option because all the rest are
> inconsistent with existing std::format overloads.
> > “std::locale in its current form is pretty much useless,” may be
> a true statement but it doesn’t help me make progress.
> Maybe we are trying to make "progress" in the wrong direction? We
> don't have to quickly hack something together for new std::format
> overloads. We didn't have a chance to look at locale in C++20 but
> now is a great time.
> > Corentin:
> > Converting between UTF-X and UTF-Y is a lossless operation.
> Only valid ones. There is still a question of handling transcoding
> errors.
> > what is it that we gain by not allowing format(u8"{}", u"");
> and format(u8"{}", U"");?
> We gain consistency between all std::format overloads, simple
> specification, not having to deal with transcoding errors. I am
> not suggesting that it shouldn't be possible but that it should be
> explicit, e.g.
> format(u8"{}", xcode(u""))
> With explicit approach you can easily configure error handling.
> > we could provide only the u8 overload
> Sure provided that we have transcoding facilities. I don't think
> u16 and u32 overloads are particularly useful since you can't do
> much with the result.
> > mandates that the existing locale be specialize for char8_t
> If this specialization inherits all existing locale problems then
> I think it's not a good idea.
> - Victor
> On Mon, Apr 19, 2021 at 3:18 AM Corentin Jabot via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> Talking with Peter, we realized we could provide only the u8
> overload and mandates that the existing locale be specialize
> for char8_t
> We believe this would
> * Satisfy Victor's excellent remark about the need not to be
> gratuitously inconsistent
> * Put minimum strain on implementers
> * Let us move forward with having a Unicode overload in 23.
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> <https://urldefense.com/v3/__https:/lists.isocpp.org/mailman/listinfo.cgi/sg16__;!!EHscmS1ygiU1lA!WueeYVkg4epLn98-McfxKUi3lJONY6lPzMPbUArFN5V6WCOZxR45PasGv15tlA$>

Received on 2021-04-26 14:34:03