Date: Tue, 30 Jul 2024 18:42:11 -0400
On 7/30/24 6:38 PM, Tom Honermann via SG16 wrote:
>
> SG16 will hold a meeting *tomorrow* on Wednesday, July 31st, at 19:30
> UTC (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20240731T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>
> The agenda follows.
>
> * P3068R2: Allowing exception throwing in constant-evaluation
> <https://wg21.link/p3068r2>.
> * LWG issue 4087: Standard exception messages have unspecified
> encoding <https://cplusplus.github.io/LWG/issue4087>.
>
> LEWG has requested that we review P3068R2 with respect to
> std::exception and related types and encoding concerns for the message
> provided by the what() member function. The concerns are effectively
> the same as those reported in LWG 4087, but in the special case of
> constant evaluation.
>
> We discussed LWG 4087 during the 2024-06-12 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings#june-12th-2024>.
> Unfortunately, I still haven't published the meeting summary for that
> meeting (work, life, burnout), so that link isn't helpful right now.
> I'll respond to this email with a copy of the (excellent) minutes that
> Eddie Nolan took for that meeting. We spent much of that meeting
> discovering what the status quo is with regard to the standard
> wording. We didn't poll any direction. The status quo appears to be:
>
> * what() returns an implementation-defined NTBS per [exception]p5
> <http://eel.is/c++draft/exception#5>.
> * what() permits return of an NTMBS per [exception]p6
> <http://eel.is/c++draft/exception#6>.
> * The NTMBS encoding is dependent on the C++ locale; it is the
> encoding that the std::codecvt<wchar_t, char, std::mbstate_t>
> facet uses on the char side of the conversion per the reference in
> [exception]p6 <http://eel.is/c++draft/exception#6>.
> * There is no guarantee that the C++ locale has not changed in
> between construction of an exception object and a call to what()
> for that same object.
> * The postconditions of the std::exception copy constructor and
> assignment operator and the constructors of the exception classes
> declared in <stdexcept> all require that what() return a pointer
> to an exact copy of the what_arg string provided when the
> exception object was constructed; no transcoding is permitted. The
> postconditions of std::filesystem::filesystem_error are similar
> per [fs.filesystem.error.members]
> <http://eel.is/c++draft/fs.filesystem.error.members>.
> * We might be able to strengthen the requirements for handling of
> encodings for std::filesystem::filesystem_error::what()
> specifically; normative encouragement is present per
> [fs.filesystem.error.members]p7
> <http://eel.is/c++draft/fs.filesystem.error.members#7>.
>
> The status quo suggests that, for the purposes of std::format(), the
> string returned by what() should be treated as containing (possibly
> ill-formed) text in the NTMBS encoding of the current C++ locale (or
> perhaps an explicitly provided std::locale argument).
>
> With respect to P3068R2, there is currently no notion of a locale
> dependent NTMBS encoding during constant evaluation. We'll need to
> discuss the ramifications of this, presumably identify an encoding to
> use instead (presumably the ordinary literal encoding), and determine
> how to adjust wording accordingly.
>
Here are the rough minutes from the 2024-06-12 SG16 meeting for
reference. Thank you again to Eddie Nolan for capturing these!
Attendance:
- (SD) Steve Downey
- (JM) Jens Maurer
- (TH) Tom Honermann
- (VZ) Victor Zverovich
- (MD) Mark de Wever
- (BG) Braden Ganetsky
- (NO) Nathan Owen
SD: Our agenda is to discuss the LWG issues. We'll be discussing 4070,
4087, and 4090.
SD: (reading the issue) If CharT is char, path::value_type is wchar_t,
and the literal encoding is UTF-8, then the escaped path is transcoded
from the native encoding for wide character strings to UTF-8 with
maximal subparts of ill-formed subsequences substituted with u+fffd
replacement character per the Unicode Standard [...]. Otherwise,
transcoding is implementation-defined.
This seems to mean that the Unicode substitutions are only done for an
escaped path, i.e. when the ? option is used. Otherwise, the form of
transcoding is completely implementation-defined. However, this makes no
sense. An escaped string will have no ill-formed subsequences, because
they will already have been replaced
So only unescaped strings can have ill-formed sequences by the time we
do transcoding to char, but whether or not any u+fffd substitution
occurs is just implementation-defined.
I believe we want to specify the substitutions are done when transcoding
an unescaped path (and it doesn't matter whether we specify it for
escaped paths, because it's a no-op if escaping happens first, as is
apparently intended).
It does matter whether we escape first or perform substitutions first.
If we escape first then every code unit in an ill-formed sequence is
individually escaped as \x{hex-digit-sequence}. So an ill-formed
sequence of two wchar_t values will be escaped as two \x{...} strings,
which are then transcoded to UTF-8. If we transcode (with substitutions
first) then the entire ill-formed sequence is replaced with a single
replacement character, which will then be escaped as \x{fffd}. SG16
should be asked to confirm that escaping first is intended, so that an
escaped string shows the original invalid code units. For a non-escaped
string, we want the ill-formed sequence to be formatted as �, which the
proposed resolution tries to ensure.
VZ: As an author of the paper I'd like to confirm that it's indeed
intended to first do escaping and then do transcoding. That's why the
wording is that. I agree with Jonathan that it misses the important bit
that for non-escaped paths. I think the resolution is mostly correct,
except I think Tom commented in the email that the second part of the
resolution, which is new to me, is a little bit incorrect. I think we
want is to kind of invert the condition there, but this does something
completely different.
TH: I think what we want to say there is just "and the literal encoding
is not UTF-8". wchar_t encoding is still implementation defined so
there's still an implementation defined aspect there. I don't think we
need to add anything to the implementation-definedness.
SD: What we're saying is that if you're fully in Unicode, there's no
implementation defined behavior, we're completely mandating the behavior.
TH: Specifically, when the literal encoding is UTF-8.
SD: And this is an index entry that just links back, so it's just trying
to identify-- this is just an index entry, there isn't any larger
context. It's trying to describe it well enough so someone looking at
the implementation-defined behaviors can find it.
SD: I'll admit I haven't really thought about this a lot.
VZ: I agree with Tom, this is a mistake in the table of implementation
defined behavior. It should do what Tom says-- we should replace, "not
converting from wchar_t to UTF-8" with "when the literal encoding is not
UTF-8." And the first part, I think, is correct.
SD: Okay. So in the text of the standard itself we want to basically
strike "escaped path" and replace it with that "(possibly escaped) string"
VZ: That part is fine.
SD: But defining the implementation-defined behavior is not correct.
VZ: They should just take the wording as it is and put it in the index.
SD: All the wordings in that index are very short summaries of what the
implementation-defined category is.
JM: It just tries to give a headline. It shouldn't be wrong but it's not
necessarily complete. As long as we satisfy that, it's good enough.
TH: Viktor, it sounds like you have a good handle on it. Want to paste
the recommendation in chat?
VZ: That's what I'm typing.
VZ (in chat): "the literal encoding is not UTF-8" instead of "not
converting from wchar_t to UTF-8"
SD: That probably covers any interesting case.
JM: It's not fully right because it's not implementation-defined only if
CharT is char and path::value_type is wchar_t.
presumably if CharT is char16_t, everything's also implementation
defined but I don't know whether that's a possibility.
JM: And it talks about literal encoding when it might want to talk about
ordinary literal encodings. Is it talking about the literal encoding for
wide strings or for char? But that question's not on our plate.
JM: Because the wide literal encoding could be UTF-16 or something, or
UCS-2 or whatever.
JM: So I like Viktor's suggestion for the implementation defined
behavior index.
JM: We already have "literal encoding" in the normative text, so if it's
ambiguous there it should be the same ambiguity in the
implementation-defined index.
TH: Should it say "ordinary literal encoding?"
JM: Maybe but that's not the question of this issue.
SD: There are many places we've already made this mistake. Cleaning it
up should be a one-time thing where we go through and clarify whether we
actually mean ordinary or literal encoding. The sense I'm getting is
that we want to change
JM: All we're doing here is correctly quoting the normative text.
SD: So the resolution is we accept the resolution for clause 1, and for
the second part accept Viktor's recommendation.
JM: And what was the concern why this doesn't work? Because what we have
here is the text about the literal encoding thing, right? Let me see
Tom's email.
SD: This doesn't constrain what an implementation can define it to be--
they could perfectly well convert to UTF-8 when the ordinary literal
encoding's not UTF-8 but it's up to implementations to serve their users.
JM: Yes, okay, great. So, Tom, are you happy with not introducing
"ordinary" for the sake of quoting the text, or should we make this a
bigger issue?
TH: No, I'm fine with that, like Steve said, we can do a separate
cleanup issue or file an LWG issue.
JM: So the green text should be "and the literal encoding is not UTF-8."
TH: Yes, that sounds good.
SD: Moving on to 4087.
VZ: std::exception is a few remaining standard types that isn't
formattable. I looked into it and found the problem that we don't
actually specify what encoding the string returned by what() is in. We
just say that it's something that can be converted to wstring somehow.
Which is very vague. So it's impossible to implement a formatter
properly because you don't know the encoding to convert from or whether
conversion is needed. I gave an example with a path, but it's a more
general problem-- path is one of the most obvious and outrageous cases
because, as part of the path, you can get the filename. So the exception
encoding has one encoding and you get a filename in a possibly different
encoding and try to format it with the literal encoding and you get
three encodings in one message-- simply a mess. My proposed resolution
is incomplete-- it's just a first attempt to propose something to start
the discussion. I'm saying it should probably be compatible with the
ordinary literal encoding. That's what people normally do, combine it
with literal strings and output. I had an email forwarded to SG16 which
had 4 options which nicely summarize what we can choose. I think Tom,
separately suggested using the locale encoding.
SD: I would expect this, barring any external constraints, to be in the
current execution encoding. Which isn't necessarily the literal
encoding. That is a common source of broken text, but that is the state
of the world. If I'm handed a char* and no other information, it's the
execution encoding.
VZ: At the very least we should specify the encoding. Now it doesn't say
anything.
JM: Fully agreed.
SD: Especially because this is instructing end users what they should be
stuffing in these things.
TH: Does multibyte not imply the locale encoding?
JM: It doesn't.
TH: Because we have the association with mbstowcs.
TH: This has always been very vaguely specified.
SD: Does NTBS include multibyte?
JM: Yes. Well, no, wait. The other way around, I thought. Wait wait
wait. A null-terminated byte string, NTBS, is a char sequence whose
highest addressed element with defined content has value 0. NO other
element has value 0. An NTMBS is an NTBS, that has a sequence of valid
multibyte characters. So an NTBS is everything, an NTMBS is one that has
valid multibyte characters.
TH: Whatever those are.
JM: Now the question is what this NTBS-- it's in the C standard.
TH: mbstowcs.
JM: The mbstowcs function converts a sequence of multibyte characters
that begins in the initial shift state-- it just returns a
null-terminated byte sequence-- the conversion function into a sequence
of corresponding wide characters. Each is converted as if by a call to
mbtowc function. Except that the conversion state of the mbtowc function
is not affected. So for the specific conversion it defers to the other
function.
SD: This does seem overall to be just a whole class of interesting ways
of producing broken text. The whole what() facility, assembling
user-specified data with string literals and doing something to them in
the hopes that someone can reconstruct something intelligible.
JM: The heading for this mbtowc function says, the behavior of the
multibyte character functions is affected by the LC_CTYPE category of
the current locale. Apparently LC_CTYPE can change what a multibyte
character sequence is. That means, essentially, the definition of what a
multibyte character string is is dependent on the LC_CTYPE locale
category because the definition of a multibyte character sequence says
it must be a valid sequence, and the locale tells me what's valid and
what's not. Presumably that means it's actually the global locale or
thread-related locale.
SD: Or in our current terminology, the execution encoding.
JM: Which is unfortunate, because usually C++ tries to make the local
explicit in the interface. In iostreams you can imbue the locale of your
choice, you don't need global state which is broken by design.
SD: Plus the built in race condition during the exception.
TH: Passing in a locale wouldn't work because the message is constructed
much earlier.
JM: You want to pass it in at the place the exception is generated, not
when the what function is called.
TH: But the locale could have changed when you invoke what().
JM: At least it's well-defined. If you call what() and can't make sense
out of it, then it's your fault. But it's hypothetical, because there's
no way to pass a locale at the point where the filesystem is generated.
Is there locale stuff on filesystem?
TH: There is.
VZ: No, it just says "a system-specific encoding."
TH: I think some of the functions do actually take a locale, it's used
to do a conversion to the encoding you're talking about.
JM: The example in the issue where file size is being queried doesn't
seem like somewhere a locale fits in.
SD: This is an example of the general problem.
JM: Looking at the example, there's nowhere to pass in a locale. No one
expects to pass a locale to a file size query function.
TH: The only way you get non-mojibake out is if the global locale was
consistent from the time the message was created to when it was received.
VZ: Did we figure out that NTBS is always in the global locale?
JM: NTMBS is in the global locale.
SD: Unless it's specifically a string literal.
VZ: But what we have is NTBS.
TH: The remarks say NTMBS but the text says NTBS. It's not consistent.
JM: The returns says NTBS, which is any kind of null-terminated byte
sequence. The remarks say, we already told you earlier that NTBMS is a
valid NTBS. That just gives permission to the implementation to give you
an NTMBS as opposed to just an NTBS. What we can do is, for the case of
an NTMBS, where we already say it's suitable for conversion and display
as a wstring, we might want to clarify that it was suitable at the time
of construction of the exception for wstring, because it needs to
evaluate the LCC type at construction time, not when you call what.
That's what's missing from the remarks, otherwise we already have the
cross-reference to CVT so we already know what's happening. For the
returns part, which are the minimum requirements, we haven't solved
anything. So far the standard has even refrained from telling you it
must be a multibyte string if multibyte strings are on your platform. An
implementation can return an ASCII only string even if it could return a
multibyte string. We have two things we should do. One is to clarify the
remarks with respect to when the suitable for conversion and display
holds. That holds only immediately after construction and not later (or
we restrict changing the LCC type or whatever). What we do for plain
NTBS's-- I don't know. Maybe the best thing is not to talk about it.
TH: For solving Viktor's problem, we have two concerns. The file path,
incorporating it into a message. That's one issue. As for taking this
NTBS that comes out and getting it formatted, we can specify "as if
using the C function. We can just specify to use the global locale and
you get what you get. Sometimes there might be some weird translations.
VZ: To clarify, by global locale we mean global C locale? There's
multiple sets of locales. LCC locale and C locale. You can separately
set both of them, they're unrelated.
SD: You can change various parts of the locale bits independently.
JM: No, we're talking about the function call set_locale. There's a C
variant of the global set_locale, that takes a C variant, and there's an
equivalent for the C++ locale structures, which sets an independent
locale state.
TH: It may also set the C locale.
JM: But it's not required to. At the start of the program you can expect
that they're the same. The question is, which one do we take. Presumably
the C++ locale.
TH: Except that we have the reference to the conversion functions which
are C-based and use the C locale.
JM: Where?
TH: Maybe I misunderstood before. mbstowcs?
JM: That's not what we do here. The actual cross reference is to the
codecvt facet. The class codecvt is for use when converting from one
character encoding to another... . We have ctype, wchar_t, and mbstate.
Presumably wchar_t is the internal encoding, the external encoding is
char, and the state_t is an mbstate which is a transformation. codecvt
converts between the native character set for ordinary and wide characters.
TH: This might be a case where it'd be good to try to ... some
implementations and set the C and C++ locales differently and see which
one you get.
JM: Where do you want to get what?
TH: Produce an exception object but have locale set.. but we don't specify
JM: We want to specify transcoding..
TH: There is transcoding of file paths on the Windows side.
JM: This text talks about the OS-dependent current encoding for path
names which in this case is CP1251.
VZ: I think path is a red herring because it has its own unrelated
transcoding. What we need to specify is, what's the target encoding for
exception? And specify what the output of the path method should be
converted into. What Tom is suggestion is to look at what path does,
that's not correct.
SD: For anyone producing this string, what should they be trying to do?
I think they should be targeting the current execution encoding as
defined by locale.
EN: Which one.
JM: The global C++ locale. No reference to the C locale in the cross
reference. It says locale codecvt, which you get from the global C++ locale.
VZ: I have a question to Jens. Clarify: the what() makes sense after you
construct, because locale can change
SD: At the point of construction.
VZ: Locale can be changed asynchronously-- what do you mean by that?
SD: That you have a problem if someone does that.
JM: Well, no. Where's the global C++ locale query function? Is that the
default ctor of the locale class?
TH: Maybe? There might also be a global static factory function.
JM: Yes, there's a classic thing (useless) and a global locale function...
JM: If we have a named locale, you get the C locale set to the same
thing, otherwise all bets are off. locale() is the constructor of the
locale class which gets you a copy of the global c++ locale. Race
conditions aren't relevant here-- it's the ctor of a class, no special
rules on race conditions, you can call the locale global setting
function unsynchronized and the stdlib has to deal with it.
SD: You have a logical race condition between starting this process and
who interprets it, but that's baked in.
VZ: But the ctor might have multiple arguments ,what if the local
changes in between?
JM: Your program is broken.
VZ: Why?
JM: Because we don't prevent anyone from changing the locale midway. The
best atomicity guarantee is the default ctor of locale. If you call it
multiple times in close proximity and get different results, tough luck.
TH: So you're supposed to acquire your own copy and reuse it.
VZ: We should specify that somehow, that it's in one locale and not a
multiple of locales.
SD: That's instructing programmers not to do broken things.
VZ: One exception object with potentially multiple things it needs
locale for.
JM: What multiple things? It will use the locale in effect when
initiating the ctor call of the exceptions. For user exceptions that,
eg.. combine system_errors into one exceptions, all we can do is throw
up our hands. WE can't even query which encodings were used. Changing
the global locale is a bad idea and as much as we should be able to
convey that, we should do that.
SD: The best we can do is, for exception definitions, what does what()
return? An NTBS to be interpreted in the locale that was in effect when
the exception was constructed.
JM: I don't know about this NTBS part. I'm pretty sure, if we have an
implementation that is an NTMBS, then that should be in the locale at
the time of construction. That's the easy part. If we just have an NTBS,
that is not a multibyte string, which isn't Victor's example, by the
way, because it combines UTF-8 with an odd encoding in the exception
string, but for the NTBS case where the implementation chooses not to
provide an NTMBS, just an NTBS, which doesn't have enough capabilities
to represent the union of the characters in the explanatory string and
the path, I don't know what to do.
JM: Maybe all we can do is say, an implementation defined NTBS, and stop
there, and you get what you get, but you can give a remarks
recommendation for what happens for the multibyte case. If the
implementation tries to be helpful to you, you should get something we
know how to interpret, but it's in principle QOI. Maybe you're on a
small system where there's no practical choice of encoding, so the NTBS
of your system is all that counts.
SD: It's possible that an NTMBS is still a single byte encoding. It's
about who's promising what and when. But I agree. In the remarks when we
clarify this, we can say if someone hasn't handed you a string in the
locale when the string was constructed, they're breaking your contract.
JM: So do we have agreement that we fine tune the NTMBS wording because
we have machinery how to interpret NTMBS strings, and we leave the
guarantee alone because there isn't any guarantee?
SD: The guarantee is that it's null terminated.
JM: Which is not very useful. But again, the remarks say, not as clearly
as they could, they say this is how you can be helpful to your users.
And it's implementation defined so presumably you can ask your implementer.
SD: I propose that I'll take on drafting something after this meeting,
that we can propose as the resolution.
VZ: It's not sufficient. The core of the issue is you can't say anything
about what() and we're not fixing that.
SD: That's the state of the world. The remarks are guiding QOI.
VZ: It's broken and we're keeping it broken. We're not doing anything.
SD: I don't see a way of telling everyone generally, because there's
user data that can show up in what(). The file name can be misencoded.
So there's no way to guarantee that this can be put into a properly
encoding string of anything.
TH: But what we can say is that for the purposes of std::format, if you
call what() on an exception object, interpret it as an NTMBS, and for
anything that doesn't convert you do escaping.
SD: Yes. When trying to format one of these strings, you're going to
have to be suspicious because it's foreign data. It's an NTMBS in the
execution encoding, do your best to produce output in the requested
format. But what() should be in the execution encoding.
EN: Couldn't we require NTMBS that's not NTBS to be forbidden?
JM: Do we have an overview of standard library implementations? If we
tighten the rules on what what() can return, we tighten the rules on
what can be passed as ctor args to e.g. logic_error. Because what()
returns byte-wise what was passed in. So no transcoding can happen in
the ctor. But now that we require a valid MBS in what() we require a
valid MBS as the ctor to the exceptions. Those people don't care about
encoding-- they just want to say what they put in they get out. Do we
want to invalidate them?
EN: Seems convincing that we shouldn't.
SD: I think that, first off, if the execution encoding and ordinary
literal encoding aren't compatible, you have deep problems producing any
output whatsoever. Hello world starts to fail.
JM: That's all fine but the point is my program has library UB if I
violate the preconditions of a library function. That's not a good place
to be in.
TH: I strongly agree. We don't want to invalidate any user code.
SD: Not in the exception or what parts. IN producing a formatter, things
are in play.
JM: The formatter has all rights to say, I expect the what string to be
an NTMBS. That is totally fine and good. Then you convert from that
NTMBS to whatever you want and go from there. That seems plausible. But
that's the formatter at the point it wants to output that stuff. It
needs to understand the details. We can strengthen the words to say
something like, we recommend that implementations when constructing the
what string on your own, as opposed to a user one, should create a valid
NTMBS. I don't think we can do more than that. Life would be so much
better if we just said UTF-8 everywhere, but that's not our life.
VZ: I think there was some mischaracterization of the example. Something
like, because it's path we can't do anything. In fact we can do a lot to
improve the situation. We can have all the info, even though we don't
now. If we know the encoding of the path, we can get a perfect output
even in the constraints of the current system. You can display arbitrary
binary data through escaping.
JM: Do you want to expose the filesystem path itself in the exception
object so the formatter can use it?
VZ: It's already exposed as part of the message and should be aligned
with the rest of the text, not in a collection of the text.
JM: We agree that what() should not be in multiple encoding. It *should*
as in implementation recommended practice, definitely. That's what we're
trying to formulate.
SD: I think the phrase here, in the native format, is woefully vague,
and a source of confusion as part of this. As Jens already pointed out,
there are other exceptions which just take a string and copy it, which
have all the same potential issues as Viktor identified. This is more
remediable by an implementer.
JM: We can address the filesystem issue, I think, at least we can push
implementations in the right direction. I'm not sure we can do anything
reasonable for exceptions as a whole.
TH: I agree. Viktor, I think if a path is being included in the message,
we'd want to reinterpret it and escape it using the mechanism
std::format use. Can we get away with convincing implementers to change
existing code?
JM: Certainly, it's an untenable situation that we have one NTMBS that
uses two encodings inside. That will never ever work.
VZ: We have an implementer here. Mark, would you be willing to change
exception messages?
MD: I'm not sure we want to. This might have implications on users.
Would be good to investigate further. With libc++ we are typically UTF-8
only which makes our life easier. ON Windows people have multiple encodings.
JM: The filesystem_error members not only takes two components but
actually three components. A what arg, a a literal string, a path, and
an error code. We have three components: user-defined (unknown
encoding), path (known), error_code converted to system_error string
(known to implementation). The problem is slightly larger than we
thought. We also need to require from users that their what() arg is of
the right kind, which is probably okay for filesystem_error because we
have good reasons to require that at that point.
SD: I'll do some drafting work about the remarks.
JM: And there also needs to be a change in the guarantee of the
filesystem_error what. It just says that what() returns an NTBS, We can
tighten that part and add preconditions to the what() arg of the ctors.
We might be able to convince implementers to improve that, specifically
for filesystem_error.
JM: Introducing 4090. We have std::format, we have variants of
std::foramt that take an explicit std::locale parameter. Let's focus on
those. Then we have an L format specifier that uses the locale you
passed in. We have a statement, "For integral types, the locale-specific
form causes the context's locale to be used to insert the appropriate
digit group separator characters." There's probably something similar
for floating point. We have several options to get that promise done and
remember that locales are user-configurable and therefore the users
actually sees which virtual functions are being invoked. For iostreams
we have specific rules under which circumstances functions are being
invoked. When outputting numbers num_put is invoked. WE don't say this
here, we should say something. Is it enough for users to override the
num_put facility to get different formatting, or do they also or instead
of, replace the numpunct facility? There are also _byname facets, are
those relevant? And so that's the fundamental question here. The problem
is that the num_put facility may not actually insert the appropriate
digit group separator characters even though numpunct may specify which
ones are appropriate, because the user may ignore numpunct and do
num_put the way I want. numpunct also allows only single-byte characters
as digit separators. If we have UTF-8 and some Asian locale, we could do
interesting separator characters. We can't do that with numpunct.
There's a practical benefit of requiring a call to num_put.
SD: I could make the digit separator a half-width comma.
JM: Something like that.
TH: We could find out what implementations are doing.
SD: Because these are user-creatable, they're user-perceivable.
JM: User-observable, so we need to be precise or expressly imprecise.
SD: Either tell users precisely what's going to happen or tell them to
bring it up with their implementers.
JM: Or tell them they're doing something wrong and invoking UB.
JM: Tom suggested we should have an implementation survey of existing
std::format implementations.
TH: I'm looking at the msft implementation, was hoping Mark would know
offhand.
MD: I don't know offhand, I can take a look. It's different from streams.
JM: Iostreams requires num_put. num_put just uses character ranges, not
streams, right?
TH: The msft implementation does use numpunct somewhere. In an internal
function called write_integral. And no uses of num_put.
JM: and num_put takes a reference to an ios_base as a parameter, which
is an alien concept to construct in a formatter, but not something
that'd be impossible to construct. But it does take an iter_type, which
is an output iterator, which is a template parameter, which by default
is an ostreambuf_iterator, but that's presumably configurable-- except
that then we have to tell the world what we actually use if not an
ostreambuf_iterator. So we might conclude that num_put is too tied to
iostreams to bother with in the format context.
TH: Has std::format been implemented in a shipping libstdc++?
MD: Yes in 13. Not complete, 14 has more improvements.
TH: libstdc++ seems to use numpunct as well. Lots of uses of numpunct
and no uses of num_put.
JM: I agree. Looks like numpunct wins.
SD: We should specify that it's doing numpunct. That does mean that you
only get a char type for it.
JM: Well, we already know the locale interface is broken. If we come up
with a better way, maybe we'll have a new overload of std::format.
MD: IMO, this opinion should also address floating point and boolean
values (the true name and the false name).
TH: Jens, will you offer a PR for your issue?
JM: I don't know, why? It's broken, you can keep all the pieces. I'm not
supplying glue.
TH: Well, but in terms of actually specifying that numpunct is used?
JM: Yes, so?
TH: What do you think we should do with the issue you filed?
JM: WE should tell LWG that SG16 resolved that after implementation
review, numpunct is the winner and use of numpunct is explicitly
specified, as is true type, false type, and for floating point numpunct
is the only thing. WE can pass that on as prose text and someone can
morph it into a PR if they want to.
Tom.
>
> SG16 will hold a meeting *tomorrow* on Wednesday, July 31st, at 19:30
> UTC (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20240731T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>
> The agenda follows.
>
> * P3068R2: Allowing exception throwing in constant-evaluation
> <https://wg21.link/p3068r2>.
> * LWG issue 4087: Standard exception messages have unspecified
> encoding <https://cplusplus.github.io/LWG/issue4087>.
>
> LEWG has requested that we review P3068R2 with respect to
> std::exception and related types and encoding concerns for the message
> provided by the what() member function. The concerns are effectively
> the same as those reported in LWG 4087, but in the special case of
> constant evaluation.
>
> We discussed LWG 4087 during the 2024-06-12 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings#june-12th-2024>.
> Unfortunately, I still haven't published the meeting summary for that
> meeting (work, life, burnout), so that link isn't helpful right now.
> I'll respond to this email with a copy of the (excellent) minutes that
> Eddie Nolan took for that meeting. We spent much of that meeting
> discovering what the status quo is with regard to the standard
> wording. We didn't poll any direction. The status quo appears to be:
>
> * what() returns an implementation-defined NTBS per [exception]p5
> <http://eel.is/c++draft/exception#5>.
> * what() permits return of an NTMBS per [exception]p6
> <http://eel.is/c++draft/exception#6>.
> * The NTMBS encoding is dependent on the C++ locale; it is the
> encoding that the std::codecvt<wchar_t, char, std::mbstate_t>
> facet uses on the char side of the conversion per the reference in
> [exception]p6 <http://eel.is/c++draft/exception#6>.
> * There is no guarantee that the C++ locale has not changed in
> between construction of an exception object and a call to what()
> for that same object.
> * The postconditions of the std::exception copy constructor and
> assignment operator and the constructors of the exception classes
> declared in <stdexcept> all require that what() return a pointer
> to an exact copy of the what_arg string provided when the
> exception object was constructed; no transcoding is permitted. The
> postconditions of std::filesystem::filesystem_error are similar
> per [fs.filesystem.error.members]
> <http://eel.is/c++draft/fs.filesystem.error.members>.
> * We might be able to strengthen the requirements for handling of
> encodings for std::filesystem::filesystem_error::what()
> specifically; normative encouragement is present per
> [fs.filesystem.error.members]p7
> <http://eel.is/c++draft/fs.filesystem.error.members#7>.
>
> The status quo suggests that, for the purposes of std::format(), the
> string returned by what() should be treated as containing (possibly
> ill-formed) text in the NTMBS encoding of the current C++ locale (or
> perhaps an explicitly provided std::locale argument).
>
> With respect to P3068R2, there is currently no notion of a locale
> dependent NTMBS encoding during constant evaluation. We'll need to
> discuss the ramifications of this, presumably identify an encoding to
> use instead (presumably the ordinary literal encoding), and determine
> how to adjust wording accordingly.
>
Here are the rough minutes from the 2024-06-12 SG16 meeting for
reference. Thank you again to Eddie Nolan for capturing these!
Attendance:
- (SD) Steve Downey
- (JM) Jens Maurer
- (TH) Tom Honermann
- (VZ) Victor Zverovich
- (MD) Mark de Wever
- (BG) Braden Ganetsky
- (NO) Nathan Owen
SD: Our agenda is to discuss the LWG issues. We'll be discussing 4070,
4087, and 4090.
SD: (reading the issue) If CharT is char, path::value_type is wchar_t,
and the literal encoding is UTF-8, then the escaped path is transcoded
from the native encoding for wide character strings to UTF-8 with
maximal subparts of ill-formed subsequences substituted with u+fffd
replacement character per the Unicode Standard [...]. Otherwise,
transcoding is implementation-defined.
This seems to mean that the Unicode substitutions are only done for an
escaped path, i.e. when the ? option is used. Otherwise, the form of
transcoding is completely implementation-defined. However, this makes no
sense. An escaped string will have no ill-formed subsequences, because
they will already have been replaced
So only unescaped strings can have ill-formed sequences by the time we
do transcoding to char, but whether or not any u+fffd substitution
occurs is just implementation-defined.
I believe we want to specify the substitutions are done when transcoding
an unescaped path (and it doesn't matter whether we specify it for
escaped paths, because it's a no-op if escaping happens first, as is
apparently intended).
It does matter whether we escape first or perform substitutions first.
If we escape first then every code unit in an ill-formed sequence is
individually escaped as \x{hex-digit-sequence}. So an ill-formed
sequence of two wchar_t values will be escaped as two \x{...} strings,
which are then transcoded to UTF-8. If we transcode (with substitutions
first) then the entire ill-formed sequence is replaced with a single
replacement character, which will then be escaped as \x{fffd}. SG16
should be asked to confirm that escaping first is intended, so that an
escaped string shows the original invalid code units. For a non-escaped
string, we want the ill-formed sequence to be formatted as �, which the
proposed resolution tries to ensure.
VZ: As an author of the paper I'd like to confirm that it's indeed
intended to first do escaping and then do transcoding. That's why the
wording is that. I agree with Jonathan that it misses the important bit
that for non-escaped paths. I think the resolution is mostly correct,
except I think Tom commented in the email that the second part of the
resolution, which is new to me, is a little bit incorrect. I think we
want is to kind of invert the condition there, but this does something
completely different.
TH: I think what we want to say there is just "and the literal encoding
is not UTF-8". wchar_t encoding is still implementation defined so
there's still an implementation defined aspect there. I don't think we
need to add anything to the implementation-definedness.
SD: What we're saying is that if you're fully in Unicode, there's no
implementation defined behavior, we're completely mandating the behavior.
TH: Specifically, when the literal encoding is UTF-8.
SD: And this is an index entry that just links back, so it's just trying
to identify-- this is just an index entry, there isn't any larger
context. It's trying to describe it well enough so someone looking at
the implementation-defined behaviors can find it.
SD: I'll admit I haven't really thought about this a lot.
VZ: I agree with Tom, this is a mistake in the table of implementation
defined behavior. It should do what Tom says-- we should replace, "not
converting from wchar_t to UTF-8" with "when the literal encoding is not
UTF-8." And the first part, I think, is correct.
SD: Okay. So in the text of the standard itself we want to basically
strike "escaped path" and replace it with that "(possibly escaped) string"
VZ: That part is fine.
SD: But defining the implementation-defined behavior is not correct.
VZ: They should just take the wording as it is and put it in the index.
SD: All the wordings in that index are very short summaries of what the
implementation-defined category is.
JM: It just tries to give a headline. It shouldn't be wrong but it's not
necessarily complete. As long as we satisfy that, it's good enough.
TH: Viktor, it sounds like you have a good handle on it. Want to paste
the recommendation in chat?
VZ: That's what I'm typing.
VZ (in chat): "the literal encoding is not UTF-8" instead of "not
converting from wchar_t to UTF-8"
SD: That probably covers any interesting case.
JM: It's not fully right because it's not implementation-defined only if
CharT is char and path::value_type is wchar_t.
presumably if CharT is char16_t, everything's also implementation
defined but I don't know whether that's a possibility.
JM: And it talks about literal encoding when it might want to talk about
ordinary literal encodings. Is it talking about the literal encoding for
wide strings or for char? But that question's not on our plate.
JM: Because the wide literal encoding could be UTF-16 or something, or
UCS-2 or whatever.
JM: So I like Viktor's suggestion for the implementation defined
behavior index.
JM: We already have "literal encoding" in the normative text, so if it's
ambiguous there it should be the same ambiguity in the
implementation-defined index.
TH: Should it say "ordinary literal encoding?"
JM: Maybe but that's not the question of this issue.
SD: There are many places we've already made this mistake. Cleaning it
up should be a one-time thing where we go through and clarify whether we
actually mean ordinary or literal encoding. The sense I'm getting is
that we want to change
JM: All we're doing here is correctly quoting the normative text.
SD: So the resolution is we accept the resolution for clause 1, and for
the second part accept Viktor's recommendation.
JM: And what was the concern why this doesn't work? Because what we have
here is the text about the literal encoding thing, right? Let me see
Tom's email.
SD: This doesn't constrain what an implementation can define it to be--
they could perfectly well convert to UTF-8 when the ordinary literal
encoding's not UTF-8 but it's up to implementations to serve their users.
JM: Yes, okay, great. So, Tom, are you happy with not introducing
"ordinary" for the sake of quoting the text, or should we make this a
bigger issue?
TH: No, I'm fine with that, like Steve said, we can do a separate
cleanup issue or file an LWG issue.
JM: So the green text should be "and the literal encoding is not UTF-8."
TH: Yes, that sounds good.
SD: Moving on to 4087.
VZ: std::exception is a few remaining standard types that isn't
formattable. I looked into it and found the problem that we don't
actually specify what encoding the string returned by what() is in. We
just say that it's something that can be converted to wstring somehow.
Which is very vague. So it's impossible to implement a formatter
properly because you don't know the encoding to convert from or whether
conversion is needed. I gave an example with a path, but it's a more
general problem-- path is one of the most obvious and outrageous cases
because, as part of the path, you can get the filename. So the exception
encoding has one encoding and you get a filename in a possibly different
encoding and try to format it with the literal encoding and you get
three encodings in one message-- simply a mess. My proposed resolution
is incomplete-- it's just a first attempt to propose something to start
the discussion. I'm saying it should probably be compatible with the
ordinary literal encoding. That's what people normally do, combine it
with literal strings and output. I had an email forwarded to SG16 which
had 4 options which nicely summarize what we can choose. I think Tom,
separately suggested using the locale encoding.
SD: I would expect this, barring any external constraints, to be in the
current execution encoding. Which isn't necessarily the literal
encoding. That is a common source of broken text, but that is the state
of the world. If I'm handed a char* and no other information, it's the
execution encoding.
VZ: At the very least we should specify the encoding. Now it doesn't say
anything.
JM: Fully agreed.
SD: Especially because this is instructing end users what they should be
stuffing in these things.
TH: Does multibyte not imply the locale encoding?
JM: It doesn't.
TH: Because we have the association with mbstowcs.
TH: This has always been very vaguely specified.
SD: Does NTBS include multibyte?
JM: Yes. Well, no, wait. The other way around, I thought. Wait wait
wait. A null-terminated byte string, NTBS, is a char sequence whose
highest addressed element with defined content has value 0. NO other
element has value 0. An NTMBS is an NTBS, that has a sequence of valid
multibyte characters. So an NTBS is everything, an NTMBS is one that has
valid multibyte characters.
TH: Whatever those are.
JM: Now the question is what this NTBS-- it's in the C standard.
TH: mbstowcs.
JM: The mbstowcs function converts a sequence of multibyte characters
that begins in the initial shift state-- it just returns a
null-terminated byte sequence-- the conversion function into a sequence
of corresponding wide characters. Each is converted as if by a call to
mbtowc function. Except that the conversion state of the mbtowc function
is not affected. So for the specific conversion it defers to the other
function.
SD: This does seem overall to be just a whole class of interesting ways
of producing broken text. The whole what() facility, assembling
user-specified data with string literals and doing something to them in
the hopes that someone can reconstruct something intelligible.
JM: The heading for this mbtowc function says, the behavior of the
multibyte character functions is affected by the LC_CTYPE category of
the current locale. Apparently LC_CTYPE can change what a multibyte
character sequence is. That means, essentially, the definition of what a
multibyte character string is is dependent on the LC_CTYPE locale
category because the definition of a multibyte character sequence says
it must be a valid sequence, and the locale tells me what's valid and
what's not. Presumably that means it's actually the global locale or
thread-related locale.
SD: Or in our current terminology, the execution encoding.
JM: Which is unfortunate, because usually C++ tries to make the local
explicit in the interface. In iostreams you can imbue the locale of your
choice, you don't need global state which is broken by design.
SD: Plus the built in race condition during the exception.
TH: Passing in a locale wouldn't work because the message is constructed
much earlier.
JM: You want to pass it in at the place the exception is generated, not
when the what function is called.
TH: But the locale could have changed when you invoke what().
JM: At least it's well-defined. If you call what() and can't make sense
out of it, then it's your fault. But it's hypothetical, because there's
no way to pass a locale at the point where the filesystem is generated.
Is there locale stuff on filesystem?
TH: There is.
VZ: No, it just says "a system-specific encoding."
TH: I think some of the functions do actually take a locale, it's used
to do a conversion to the encoding you're talking about.
JM: The example in the issue where file size is being queried doesn't
seem like somewhere a locale fits in.
SD: This is an example of the general problem.
JM: Looking at the example, there's nowhere to pass in a locale. No one
expects to pass a locale to a file size query function.
TH: The only way you get non-mojibake out is if the global locale was
consistent from the time the message was created to when it was received.
VZ: Did we figure out that NTBS is always in the global locale?
JM: NTMBS is in the global locale.
SD: Unless it's specifically a string literal.
VZ: But what we have is NTBS.
TH: The remarks say NTMBS but the text says NTBS. It's not consistent.
JM: The returns says NTBS, which is any kind of null-terminated byte
sequence. The remarks say, we already told you earlier that NTBMS is a
valid NTBS. That just gives permission to the implementation to give you
an NTMBS as opposed to just an NTBS. What we can do is, for the case of
an NTMBS, where we already say it's suitable for conversion and display
as a wstring, we might want to clarify that it was suitable at the time
of construction of the exception for wstring, because it needs to
evaluate the LCC type at construction time, not when you call what.
That's what's missing from the remarks, otherwise we already have the
cross-reference to CVT so we already know what's happening. For the
returns part, which are the minimum requirements, we haven't solved
anything. So far the standard has even refrained from telling you it
must be a multibyte string if multibyte strings are on your platform. An
implementation can return an ASCII only string even if it could return a
multibyte string. We have two things we should do. One is to clarify the
remarks with respect to when the suitable for conversion and display
holds. That holds only immediately after construction and not later (or
we restrict changing the LCC type or whatever). What we do for plain
NTBS's-- I don't know. Maybe the best thing is not to talk about it.
TH: For solving Viktor's problem, we have two concerns. The file path,
incorporating it into a message. That's one issue. As for taking this
NTBS that comes out and getting it formatted, we can specify "as if
using the C function. We can just specify to use the global locale and
you get what you get. Sometimes there might be some weird translations.
VZ: To clarify, by global locale we mean global C locale? There's
multiple sets of locales. LCC locale and C locale. You can separately
set both of them, they're unrelated.
SD: You can change various parts of the locale bits independently.
JM: No, we're talking about the function call set_locale. There's a C
variant of the global set_locale, that takes a C variant, and there's an
equivalent for the C++ locale structures, which sets an independent
locale state.
TH: It may also set the C locale.
JM: But it's not required to. At the start of the program you can expect
that they're the same. The question is, which one do we take. Presumably
the C++ locale.
TH: Except that we have the reference to the conversion functions which
are C-based and use the C locale.
JM: Where?
TH: Maybe I misunderstood before. mbstowcs?
JM: That's not what we do here. The actual cross reference is to the
codecvt facet. The class codecvt is for use when converting from one
character encoding to another... . We have ctype, wchar_t, and mbstate.
Presumably wchar_t is the internal encoding, the external encoding is
char, and the state_t is an mbstate which is a transformation. codecvt
converts between the native character set for ordinary and wide characters.
TH: This might be a case where it'd be good to try to ... some
implementations and set the C and C++ locales differently and see which
one you get.
JM: Where do you want to get what?
TH: Produce an exception object but have locale set.. but we don't specify
JM: We want to specify transcoding..
TH: There is transcoding of file paths on the Windows side.
JM: This text talks about the OS-dependent current encoding for path
names which in this case is CP1251.
VZ: I think path is a red herring because it has its own unrelated
transcoding. What we need to specify is, what's the target encoding for
exception? And specify what the output of the path method should be
converted into. What Tom is suggestion is to look at what path does,
that's not correct.
SD: For anyone producing this string, what should they be trying to do?
I think they should be targeting the current execution encoding as
defined by locale.
EN: Which one.
JM: The global C++ locale. No reference to the C locale in the cross
reference. It says locale codecvt, which you get from the global C++ locale.
VZ: I have a question to Jens. Clarify: the what() makes sense after you
construct, because locale can change
SD: At the point of construction.
VZ: Locale can be changed asynchronously-- what do you mean by that?
SD: That you have a problem if someone does that.
JM: Well, no. Where's the global C++ locale query function? Is that the
default ctor of the locale class?
TH: Maybe? There might also be a global static factory function.
JM: Yes, there's a classic thing (useless) and a global locale function...
JM: If we have a named locale, you get the C locale set to the same
thing, otherwise all bets are off. locale() is the constructor of the
locale class which gets you a copy of the global c++ locale. Race
conditions aren't relevant here-- it's the ctor of a class, no special
rules on race conditions, you can call the locale global setting
function unsynchronized and the stdlib has to deal with it.
SD: You have a logical race condition between starting this process and
who interprets it, but that's baked in.
VZ: But the ctor might have multiple arguments ,what if the local
changes in between?
JM: Your program is broken.
VZ: Why?
JM: Because we don't prevent anyone from changing the locale midway. The
best atomicity guarantee is the default ctor of locale. If you call it
multiple times in close proximity and get different results, tough luck.
TH: So you're supposed to acquire your own copy and reuse it.
VZ: We should specify that somehow, that it's in one locale and not a
multiple of locales.
SD: That's instructing programmers not to do broken things.
VZ: One exception object with potentially multiple things it needs
locale for.
JM: What multiple things? It will use the locale in effect when
initiating the ctor call of the exceptions. For user exceptions that,
eg.. combine system_errors into one exceptions, all we can do is throw
up our hands. WE can't even query which encodings were used. Changing
the global locale is a bad idea and as much as we should be able to
convey that, we should do that.
SD: The best we can do is, for exception definitions, what does what()
return? An NTBS to be interpreted in the locale that was in effect when
the exception was constructed.
JM: I don't know about this NTBS part. I'm pretty sure, if we have an
implementation that is an NTMBS, then that should be in the locale at
the time of construction. That's the easy part. If we just have an NTBS,
that is not a multibyte string, which isn't Victor's example, by the
way, because it combines UTF-8 with an odd encoding in the exception
string, but for the NTBS case where the implementation chooses not to
provide an NTMBS, just an NTBS, which doesn't have enough capabilities
to represent the union of the characters in the explanatory string and
the path, I don't know what to do.
JM: Maybe all we can do is say, an implementation defined NTBS, and stop
there, and you get what you get, but you can give a remarks
recommendation for what happens for the multibyte case. If the
implementation tries to be helpful to you, you should get something we
know how to interpret, but it's in principle QOI. Maybe you're on a
small system where there's no practical choice of encoding, so the NTBS
of your system is all that counts.
SD: It's possible that an NTMBS is still a single byte encoding. It's
about who's promising what and when. But I agree. In the remarks when we
clarify this, we can say if someone hasn't handed you a string in the
locale when the string was constructed, they're breaking your contract.
JM: So do we have agreement that we fine tune the NTMBS wording because
we have machinery how to interpret NTMBS strings, and we leave the
guarantee alone because there isn't any guarantee?
SD: The guarantee is that it's null terminated.
JM: Which is not very useful. But again, the remarks say, not as clearly
as they could, they say this is how you can be helpful to your users.
And it's implementation defined so presumably you can ask your implementer.
SD: I propose that I'll take on drafting something after this meeting,
that we can propose as the resolution.
VZ: It's not sufficient. The core of the issue is you can't say anything
about what() and we're not fixing that.
SD: That's the state of the world. The remarks are guiding QOI.
VZ: It's broken and we're keeping it broken. We're not doing anything.
SD: I don't see a way of telling everyone generally, because there's
user data that can show up in what(). The file name can be misencoded.
So there's no way to guarantee that this can be put into a properly
encoding string of anything.
TH: But what we can say is that for the purposes of std::format, if you
call what() on an exception object, interpret it as an NTMBS, and for
anything that doesn't convert you do escaping.
SD: Yes. When trying to format one of these strings, you're going to
have to be suspicious because it's foreign data. It's an NTMBS in the
execution encoding, do your best to produce output in the requested
format. But what() should be in the execution encoding.
EN: Couldn't we require NTMBS that's not NTBS to be forbidden?
JM: Do we have an overview of standard library implementations? If we
tighten the rules on what what() can return, we tighten the rules on
what can be passed as ctor args to e.g. logic_error. Because what()
returns byte-wise what was passed in. So no transcoding can happen in
the ctor. But now that we require a valid MBS in what() we require a
valid MBS as the ctor to the exceptions. Those people don't care about
encoding-- they just want to say what they put in they get out. Do we
want to invalidate them?
EN: Seems convincing that we shouldn't.
SD: I think that, first off, if the execution encoding and ordinary
literal encoding aren't compatible, you have deep problems producing any
output whatsoever. Hello world starts to fail.
JM: That's all fine but the point is my program has library UB if I
violate the preconditions of a library function. That's not a good place
to be in.
TH: I strongly agree. We don't want to invalidate any user code.
SD: Not in the exception or what parts. IN producing a formatter, things
are in play.
JM: The formatter has all rights to say, I expect the what string to be
an NTMBS. That is totally fine and good. Then you convert from that
NTMBS to whatever you want and go from there. That seems plausible. But
that's the formatter at the point it wants to output that stuff. It
needs to understand the details. We can strengthen the words to say
something like, we recommend that implementations when constructing the
what string on your own, as opposed to a user one, should create a valid
NTMBS. I don't think we can do more than that. Life would be so much
better if we just said UTF-8 everywhere, but that's not our life.
VZ: I think there was some mischaracterization of the example. Something
like, because it's path we can't do anything. In fact we can do a lot to
improve the situation. We can have all the info, even though we don't
now. If we know the encoding of the path, we can get a perfect output
even in the constraints of the current system. You can display arbitrary
binary data through escaping.
JM: Do you want to expose the filesystem path itself in the exception
object so the formatter can use it?
VZ: It's already exposed as part of the message and should be aligned
with the rest of the text, not in a collection of the text.
JM: We agree that what() should not be in multiple encoding. It *should*
as in implementation recommended practice, definitely. That's what we're
trying to formulate.
SD: I think the phrase here, in the native format, is woefully vague,
and a source of confusion as part of this. As Jens already pointed out,
there are other exceptions which just take a string and copy it, which
have all the same potential issues as Viktor identified. This is more
remediable by an implementer.
JM: We can address the filesystem issue, I think, at least we can push
implementations in the right direction. I'm not sure we can do anything
reasonable for exceptions as a whole.
TH: I agree. Viktor, I think if a path is being included in the message,
we'd want to reinterpret it and escape it using the mechanism
std::format use. Can we get away with convincing implementers to change
existing code?
JM: Certainly, it's an untenable situation that we have one NTMBS that
uses two encodings inside. That will never ever work.
VZ: We have an implementer here. Mark, would you be willing to change
exception messages?
MD: I'm not sure we want to. This might have implications on users.
Would be good to investigate further. With libc++ we are typically UTF-8
only which makes our life easier. ON Windows people have multiple encodings.
JM: The filesystem_error members not only takes two components but
actually three components. A what arg, a a literal string, a path, and
an error code. We have three components: user-defined (unknown
encoding), path (known), error_code converted to system_error string
(known to implementation). The problem is slightly larger than we
thought. We also need to require from users that their what() arg is of
the right kind, which is probably okay for filesystem_error because we
have good reasons to require that at that point.
SD: I'll do some drafting work about the remarks.
JM: And there also needs to be a change in the guarantee of the
filesystem_error what. It just says that what() returns an NTBS, We can
tighten that part and add preconditions to the what() arg of the ctors.
We might be able to convince implementers to improve that, specifically
for filesystem_error.
JM: Introducing 4090. We have std::format, we have variants of
std::foramt that take an explicit std::locale parameter. Let's focus on
those. Then we have an L format specifier that uses the locale you
passed in. We have a statement, "For integral types, the locale-specific
form causes the context's locale to be used to insert the appropriate
digit group separator characters." There's probably something similar
for floating point. We have several options to get that promise done and
remember that locales are user-configurable and therefore the users
actually sees which virtual functions are being invoked. For iostreams
we have specific rules under which circumstances functions are being
invoked. When outputting numbers num_put is invoked. WE don't say this
here, we should say something. Is it enough for users to override the
num_put facility to get different formatting, or do they also or instead
of, replace the numpunct facility? There are also _byname facets, are
those relevant? And so that's the fundamental question here. The problem
is that the num_put facility may not actually insert the appropriate
digit group separator characters even though numpunct may specify which
ones are appropriate, because the user may ignore numpunct and do
num_put the way I want. numpunct also allows only single-byte characters
as digit separators. If we have UTF-8 and some Asian locale, we could do
interesting separator characters. We can't do that with numpunct.
There's a practical benefit of requiring a call to num_put.
SD: I could make the digit separator a half-width comma.
JM: Something like that.
TH: We could find out what implementations are doing.
SD: Because these are user-creatable, they're user-perceivable.
JM: User-observable, so we need to be precise or expressly imprecise.
SD: Either tell users precisely what's going to happen or tell them to
bring it up with their implementers.
JM: Or tell them they're doing something wrong and invoking UB.
JM: Tom suggested we should have an implementation survey of existing
std::format implementations.
TH: I'm looking at the msft implementation, was hoping Mark would know
offhand.
MD: I don't know offhand, I can take a look. It's different from streams.
JM: Iostreams requires num_put. num_put just uses character ranges, not
streams, right?
TH: The msft implementation does use numpunct somewhere. In an internal
function called write_integral. And no uses of num_put.
JM: and num_put takes a reference to an ios_base as a parameter, which
is an alien concept to construct in a formatter, but not something
that'd be impossible to construct. But it does take an iter_type, which
is an output iterator, which is a template parameter, which by default
is an ostreambuf_iterator, but that's presumably configurable-- except
that then we have to tell the world what we actually use if not an
ostreambuf_iterator. So we might conclude that num_put is too tied to
iostreams to bother with in the format context.
TH: Has std::format been implemented in a shipping libstdc++?
MD: Yes in 13. Not complete, 14 has more improvements.
TH: libstdc++ seems to use numpunct as well. Lots of uses of numpunct
and no uses of num_put.
JM: I agree. Looks like numpunct wins.
SD: We should specify that it's doing numpunct. That does mean that you
only get a char type for it.
JM: Well, we already know the locale interface is broken. If we come up
with a better way, maybe we'll have a new overload of std::format.
MD: IMO, this opinion should also address floating point and boolean
values (the true name and the false name).
TH: Jens, will you offer a PR for your issue?
JM: I don't know, why? It's broken, you can keep all the pieces. I'm not
supplying glue.
TH: Well, but in terms of actually specifying that numpunct is used?
JM: Yes, so?
TH: What do you think we should do with the issue you filed?
JM: WE should tell LWG that SG16 resolved that after implementation
review, numpunct is the winner and use of numpunct is explicitly
specified, as is true type, false type, and for floating point numpunct
is the only thing. WE can pass that on as prose text and someone can
morph it into a PR if they want to.
Tom.
Received on 2024-07-30 22:42:17