sg16: Re: [SG16] Wording strategy for Unicode std::format

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 26 Apr 2021 22:58:10 -0400

On 4/26/21 4:01 PM, Corentin Jabot wrote:
>
>
> On Mon, Apr 26, 2021 at 9:34 PM Tom Honermann via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 4/26/21 1:56 PM, Peter Brett via SG16 wrote:
>>
>> /There has been very little interest in char*_t overloads though./
>>
>> I’ve been thinking about this quite a bit over the weekend (esp.
>> during our impromptu Twitter-based SG16 meeting).
>>
>> I’ve come to the conclusion that char8_t is actually a solution
>> in search of a problem. All you know when you see a u8string is
>> that its associated encoding is UTF-8, but this isn’t actually
>> very useful for anyone; you still have to treat it as a bag of
>> bytes of indeterminate encoding because, by (lack of) contract, a
>> u8string can contain any old nonsense.
>>
> Some of this is true, but not all of it.
>
> It is correct that a char8_t sequence or a std::u8string may
> contain garbage. This is not unique to char8_t; it is true for
> all of the other character types as well. It is also true of
> Rust's std::string (see from_utf8_unchecked()
> <https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8_unchecked>);
> it is possible to put garbage in Rust's string types too. The
> Rust standard library has a precondition that std::string contains
> well-formed UTF-8. Rust's string type is much better designed to
> avoid such garbage than std::basic_string is; no one is going to
> claim otherwise.
>
> Rust is putting preconditions on from_utf8_unchecked() - doing the
> same thing in C++ would make the value proposition of char8_t A LOT
> MORE interesting
We can do that with a new type (e.g., std::text). This wasn't a
reasonable option for u8string. And the invariants needed can't be
maintained by a code unit type.
>
> But that lack of a strict well-formed guarantee does not prevent
> solving real problems. The introduction of char8_t solved a real
> problem in the standard library itself; it enabled std::path
> constructor overloads to support UTF-8 and eliminated the
> std::u8path() factory function workaround.
>
> char8_t was never intended to solve all UTF-8 related problems.
> It is intended to enable type safe use of UTF-8 without
> accidentally mixing UTF-8 text with text in other encodings
> (stored in char).
>
> Anyone using char8_t for anything other than UTF-8 (except for the
> special case of creating ill-formed UTF-8 text for the purposes of
> testing that such text is correctly rejected/handled) is abusing
> the type. Full stop.
>
> We need the wording to explicitly say that. And enforce it
The only way we can do that for the existing interfaces is to add
preconditions. Zach looked into that (remember P1880
<https://wg21.link/p1880>?) and determined the required wording updates
were not worth the effort (see discussion in our 2020-05-13 telecon
<https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2020.md#may-13th-2020>
and https://github.com/cplusplus/papers/issues/630).
>
> We can certainly add a type that enforces well-formed UTF-8 using
> either char or char8_t; I suspect it would look a lot like Rust's
> string type, including unchecked functions that enable
> construction with garbage because we don't want to pay the
> overhead of constant re-validation. Functions that except text
> will always have to have a wide contract or a precondition on
> well-formed text. That cannot be avoided.
>
> Yes. We need to put these preconditions in place.
I'm not so sure about "need", but I agree it would be a good improvement
if a suitable wording strategy can be identified.
>
>> Therefore, if I have the option to build my code with UTF-8 as
>> the literal encoding (and nearly everyone has that option), all
>> char8_t does is to provide me with two mutually-incompatible ways
>> to express “some unknown bytes that might be text.”
>>
> The standard library does not have that option. Nor does a lot of
> commercial software, including nearly all software that runs in
> some particular ecosystems. There are 3-5 million C++ developers
> (or whatever the current estimation is) and experiences vary
> considerably. Please don't assume that the options available to
> you or that your experience translates widely throughout the
> community. I have personally worked on code bases that, for both
> migration cost and technical reasons, would find switching to
> UTF-8 literal encoding infeasible.
>
>> I’ve therefore decided not to spend any more of my limited
>> available committee time on any language or library changes
>> related to charN_t. Instead I think it would be more productive
>> for me to focus on adding features that provide text codecs and
>> validation (thanks, JeanHeyd!)
>>
> That is certainly your choice and your contributions are of course
> welcome.
>
>
> The point I think Peter is making is that the number one issue with
> both char, and uint8_t is that they carry no semantics.
> uint8_t is just an integer, char is either a narrow code unit, a byte,
> an integer or some kind, or a code unit in some other encoding.
>
> The problem with that is they are not useful to establish expectation.
> You get something, which library will interpret differently which will
> result in mojibake.
Agreed.
>
> The _only_ way we have to avoid this issue is to to the best we can to
> ensure that there is an ecosystem-wide reasonable expectation that
> char8_t, but more importantly u8string_view and u8string denotes
> utf-8 (meaning, tautologically, valid utf-8).
Agreed, I don't know of anyone suggesting char8_t should be used for
anything other than UTF-8 data.
>
> This *doesn't* mean that there cannot be functions accepting these
> types with a wide contract, such as validating/decoding/transcoding
> functions.
> But from within the center of the Unicode sandwich, that expectation
> should be materialized by preconditions that these types when passed
> to text functions such as fmt denote UTF-8.
> We should be giving a lot more consideration to P1880 even if it means
> doing a survey of all text functions in the standard library (which we
> should be doing anyhow).
> Note that most functions operating on xxstring(_view) have no text
> semantic, it's just code units manipulation.
Oh, you clearly do remember P1880! I agree with all of this; we need
someone to do the work.
>
> The reasoning extends to ALL text functions, that there is a
> precondition that whatever is passed to a function that accepts text
> is in the encoding expected by that function
> - and a text/character function is defined tautologically in a
> roundabout way as something that interprets its input using an encoding.
>
> This doesn't change the status quo, it just gives us a tool to explain
> exactly why mojibake happens, let us recognize that mojibake is bad
> and the consequence of a non-well-formed program,
> and gives us an opportunity to trace a well defined path.
>
> By the same logic
>
> * Narrow string literals have the same abstract values when
> interpreted as in the narrow literal or narrow execution encoding
> * Ditto for wide
> * Changing the locale changes the execution encoding and so affect
> preconditions, which in practice means that utf8 literal encoding
> interpreted as Big5 will it UB pretty fast, which will result in
> mojibake, which is status quo
>
So far, we seem to be on the same page.
> And from there we realize that utf8 string literals do not do much
> more than adding another footgun to our users' arsenal.
You lost me here. There are no locale sensitive functions in the
standard library that work on char8_t based data. char8_t removed the
footgun aimed by char-based UTF-8 data getting passed to char-based
functions that expect literal or locale encoding.
> And it's problematic because string literals are created everywhere
> from within the unicode sandwich so they make precondition violations
> more likely.
My understanding is that the vast majority of text comes from data
brought in from outside the program, not from string literals. String
literals are clearly of more relevance for formatting facilities of course.
> And sure we could say to our users "don't do that" but users will, and
> everything will keep being terrible.

I'm not sure what you are referring to here. What would we tell
programmers not to do? Can you provide an example?

Tom.

>
>
>
> Tom.
>
>> Thanks again for all your valuable contributions to this
>> discussion, Victor!
>>
>> Peter
>>
>> *From:*Victor Zverovich <victor.zverovich_at_[hidden]>
>> <mailto:victor.zverovich_at_[hidden]>
>> *Sent:* 26 April 2021 18:36
>> *To:* Peter Brett <pbrett_at_[hidden]> <mailto:pbrett_at_[hidden]>
>> *Cc:* sg16_at_[hidden] <mailto:sg16_at_[hidden]>
>> *Subject:* Re: [SG16] Wording strategy for Unicode std::format
>>
>> EXTERNAL MAIL
>>
>> > Corentin:
>> > If our requirements are ... - all of which I agree with
>> individually - I don't see a path forward.
>>
>> I don't see any problem. Since we are in the beginning of the
>> C++23 cycle it makes sense to tackle more difficult problems such
>> as locale first. Adding format overloads is not hard and makes
>> more sense to do later when we have a better understanding and
>> hopefully solution for locale issues.
>>
>> > A non-broken Unicode locale support calls for research and
>> implementation experience outside of the standard.
>>
>> That's exactly what I intend to do, contributors are welcome.
>>
>> > Providing something that is consistent and *not more* broken
>> than the narrow overload seems useful in the short term.
>>
>> First I don't think it's necessary per my comment above and
>> second it will likely severely constrain our design space for
>> fixing these issues in the future.
>>
>> > Peter:
>>
>> > Just for clarity, what does {fmt} currently do?
>>
>> Unfortunately {fmt} cannot fix the standard library so it does
>> something completely crazy in char*_t overloads, namely uses a
>> narrow locale and casts characters. There has been very little
>> interest in char*_t overloads though.
>>
>> Cheers,
>>
>> Victor
>>
>> On Thu, Apr 22, 2021 at 8:57 AM Peter Brett <pbrett_at_[hidden]
>> <mailto:pbrett_at_[hidden]>> wrote:
>>
>> Hi Victor,
>>
>> This is helpful, thank you. I will put full ‘L’ handling for
>> UTF-8/16/32 as the preferred option in the paper.
>>
>> Just for clarity, what does {fmt} currently do? Obviously if
>> it currently does something different then I will have to do
>> some work to demonstrate implementability.
>>
>> Best wishes,
>>
>> Peter
>>
>> *From:*SG16 <sg16-bounces_at_[hidden]
>> <mailto:sg16-bounces_at_[hidden]>> *On Behalf Of *Victor
>> Zverovich via SG16
>> *Sent:* 22 April 2021 15:27
>> *To:* SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>>
>> *Cc:* Victor Zverovich <victor.zverovich_at_[hidden]
>> <mailto:victor.zverovich_at_[hidden]>>
>> *Subject:* Re: [SG16] Wording strategy for Unicode std::format
>>
>> EXTERNAL MAIL
>>
>> > Peter:
>>
>> > What should the following code do?
>>
>> I think (1) is the only acceptable option because all the
>> rest are inconsistent with existing std::format overloads.
>>
>> > “std::locale in its current form is pretty much useless,”
>> may be a true statement but it doesn’t help me make progress.
>>
>> Maybe we are trying to make "progress" in the wrong
>> direction? We don't have to quickly hack something together
>> for new std::format overloads. We didn't have a chance to
>> look at locale in C++20 but now is a great time.
>>
>> > Corentin:
>>
>> > Converting between UTF-X and UTF-Y is a lossless operation.
>>
>> Only valid ones. There is still a question of handling
>> transcoding errors.
>>
>> > what is it that we gain by not allowing format(u8"{}",
>> u""); and format(u8"{}", U"");?
>>
>> We gain consistency between all std::format overloads, simple
>> specification, not having to deal with transcoding errors. I
>> am not suggesting that it shouldn't be possible but that it
>> should be explicit, e.g.
>>
>> format(u8"{}", xcode(u""))
>>
>> With explicit approach you can easily configure error handling.
>>
>> > we could provide only the u8 overload
>>
>> Sure provided that we have transcoding facilities. I don't
>> think u16 and u32 overloads are particularly useful since you
>> can't do much with the result.
>>
>> > mandates that the existing locale be specialize for char8_t
>>
>> If this specialization inherits all existing locale problems
>> then I think it's not a good idea.
>>
>> - Victor
>>
>> On Mon, Apr 19, 2021 at 3:18 AM Corentin Jabot via SG16
>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>
>> Talking with Peter, we realized we could provide only the
>> u8 overload and mandates that the existing locale be
>> specialize for char8_t
>>
>> We believe this would
>>
>> * Satisfy Victor's excellent remark about the need not
>> to be gratuitously inconsistent
>> * Put minimum strain on implementers
>> * Let us move forward with having a Unicode overload in 23.
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>> <https://urldefense.com/v3/__https:/lists.isocpp.org/mailman/listinfo.cgi/sg16__;!!EHscmS1ygiU1lA!WueeYVkg4epLn98-McfxKUi3lJONY6lPzMPbUArFN5V6WCOZxR45PasGv15tlA$>
>>
>>
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-04-26 21:58:15