Thanks Victor, a few comments based on today's meeting below...

On 6/12/24 3:36 PM, Victor Zverovich via SG16 wrote:
Forwarding Peter's email to SG16 since it's relevant to today's discussion and contains the list of options that we have.

- Victor

---------- Forwarded message ---------
From: Peter Dimov via Lib <lib@lists.isocpp.org>
Date: Tue, May 7, 2024 at 12:12 PM
Subject: Re: [isocpp-lib] Issue 4087: Standard exception messages have unspecified encoding
To: Tom Honermann <tom@honermann.net>, <lib@lists.isocpp.org>
Cc: Peter Dimov <pdimov@gmail.com>


Tom Honermann wrote:
> On 5/7/24 12:54 PM, Peter Dimov via Lib wrote:
> > Tom Honermann wrote:
> >> I don't think the proposed resolution is implementable since the
> >> ordinary literal encoding is not necessarily known at run-time and
> >> there is no way for it to be provided from a call site (at least, not
> >> without plumbing an additional parameter with a default argument to
> >> everywhere that might throw a filesystem_error object).
> > The literal encoding is a constant, so there's no need for it to be
> > passed as a parameter.
>
> In the standard, yes, but less so in the real world where it can vary from one
> TU to another (and does in practice).

I don't think this is a scenario we need to concern ourselves with; but even
if we hypothetically did, passing a parameter to the throw point is not going
to fix things.

That's because it's what() that needs to return the literal encoding of the
translation unit calling it, not the throw point (which doesn't know who will
call what().)

In practice, the implementation will transcode to the literal encoding in
the constructor of filesystem_error. There's nothing better to do.

Also in practice, we have the following options:

1. Return a mishmash of encodings from exception::what;
2. Return a string in the literal encoding from exception::what;
3. Return a string in UTF-8 from exception::what;
4. Return a string in "the runtime locale encoding" from exception::what.

(4) is in quotes because there's no "the" runtime locale encoding.

(3) is what everyone needs and wants, but it's a breaking change.

(1) is what happens today, but we don't want it.

(2) is the only option we can realistically adopt, and in practice, most of
the time, that's what we get today, in whole or in part.

During today's meeting, we (mostly Jens) dove into the C++ Standard's use of NTBS and NTMBS and references from [exception]p6. We determined that the standard is most inline with the intent for (4) above with the understanding that the global locale may change between the construction of an exception object and a call to its what() member and that there are no requirements on either users or implementors to provide messages in any particular encoding. In particular, many of the standard library exception types (e.g., those in the <stdexcept> header ([stdexcept.syn])) have a postcondition that what() returns a reference to a string that exactly matches the user-provided string passed as what_arg to the constructor (with only an implied precondition that the argument is an NTBS). We therefore aren't at liberty to specify any particular encoding.

Subject to agreement from implementors, we could specify that paths passed as arguments to std::filesystem::filesystem_error constructors are formatted as an escaped string ([format.string.escaped]) in the encoding of the global C++ locate.

For std::format support of std::exception objects, we can then specify that the message returned from what() is interpreted as being in the encoding of the global C++ locale and converted to the desired encoding with substitution characters (or escape sequences) substituted for portions that are not well-formed according to the global C++ locale encoding. This will at least ensure a validly encoded result in the event that the message returned by what() is not well-formed (as in the case of paths containing arbitrary byte sequences) or contains text in an unexpected encoding. We can debate the use of substitution characters (which lose information) vs the use of escape sequences (which can preserve the original message content).

Tom.


_______________________________________________
Lib mailing list
Lib@lists.isocpp.org
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
Link to this post: http://lists.isocpp.org/lib/2024/05/28203.php