C++ Logo

sg16

Advanced search

Re: [isocpp-sg16] Fwd: [isocpp-lib] Issue 4087: Standard exception messages have unspecified encoding

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 12 Jun 2024 18:05:28 -0400
Thanks Victor, a few comments based on today's meeting below...

On 6/12/24 3:36 PM, Victor Zverovich via SG16 wrote:
> Forwarding Peter's email to SG16 since it's relevant to today's
> discussion and contains the list of options that we have.
>
> - Victor
>
> ---------- Forwarded message ---------
> From: *Peter Dimov via Lib* <lib_at_[hidden]>
> Date: Tue, May 7, 2024 at 12:12 PM
> Subject: Re: [isocpp-lib] Issue 4087: Standard exception messages have
> unspecified encoding
> To: Tom Honermann <tom_at_[hidden]>, <lib_at_[hidden]>
> Cc: Peter Dimov <pdimov_at_[hidden]>
>
>
> Tom Honermann wrote:
> > On 5/7/24 12:54 PM, Peter Dimov via Lib wrote:
> > > Tom Honermann wrote:
> > >> I don't think the proposed resolution is implementable since the
> > >> ordinary literal encoding is not necessarily known at run-time and
> > >> there is no way for it to be provided from a call site (at least, not
> > >> without plumbing an additional parameter with a default argument to
> > >> everywhere that might throw a filesystem_error object).
> > > The literal encoding is a constant, so there's no need for it to be
> > > passed as a parameter.
> >
> > In the standard, yes, but less so in the real world where it can
> vary from one
> > TU to another (and does in practice).
>
> I don't think this is a scenario we need to concern ourselves with;
> but even
> if we hypothetically did, passing a parameter to the throw point is
> not going
> to fix things.
>
> That's because it's what() that needs to return the literal encoding
> of the
> translation unit calling it, not the throw point (which doesn't know
> who will
> call what().)
>
> In practice, the implementation will transcode to the literal encoding in
> the constructor of filesystem_error. There's nothing better to do.
>
> Also in practice, we have the following options:
>
> 1. Return a mishmash of encodings from exception::what;
> 2. Return a string in the literal encoding from exception::what;
> 3. Return a string in UTF-8 from exception::what;
> 4. Return a string in "the runtime locale encoding" from exception::what.
>
> (4) is in quotes because there's no "the" runtime locale encoding.
>
> (3) is what everyone needs and wants, but it's a breaking change.
>
> (1) is what happens today, but we don't want it.
>
> (2) is the only option we can realistically adopt, and in practice,
> most of
> the time, that's what we get today, in whole or in part.
>
During today's meeting, we (mostly Jens) dove into the C++ Standard's
use of NTBS and NTMBS and references from [exception]p6
<http://eel.is/c++draft/exception#6>. We determined that the standard is
most inline with the intent for (4) above with the understanding that
the global locale may change between the construction of an exception
object and a call to its what() member and that there are no
requirements on either users or implementors to provide messages in any
particular encoding. In particular, many of the standard library
exception types (e.g., those in the <stdexcept> header ([stdexcept.syn]
<http://eel.is/c++draft/std.exceptions#stdexcept.syn>)) have a
postcondition that what() returns a reference to a string that exactly
matches the user-provided string passed as what_arg to the constructor
(with only an implied precondition that the argument is an NTBS). We
therefore aren't at liberty to specify any particular encoding.

Subject to agreement from implementors, we could specify that paths
passed as arguments to std::filesystem::filesystem_error constructors
are formatted as an escaped string ([format.string.escaped]
<https://eel.is/c++draft/format.string.escaped>) in the encoding of the
global C++ locate.

For std::format support of std::exception objects, we can then specify
that the message returned from what() is interpreted as being in the
encoding of the global C++ locale and converted to the desired encoding
with substitution characters (or escape sequences) substituted for
portions that are not well-formed according to the global C++ locale
encoding. This will at least ensure a validly encoded result in the
event that the message returned by what() is not well-formed (as in the
case of paths containing arbitrary byte sequences) or contains text in
an unexpected encoding. We can debate the use of substitution characters
(which lose information) vs the use of escape sequences (which can
preserve the original message content).

Tom.

>
> _______________________________________________
> Lib mailing list
> Lib_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2024/05/28203.php
>

Received on 2024-06-12 22:05:30