On 7/30/24 6:38 PM, Tom Honermann via SG16 wrote:

SG16 will hold a meeting tomorrow on Wednesday, July 31st, at 19:30 UTC (timezone conversion).

The agenda follows.

LEWG has requested that we review P3068R2 with respect to std::exception and related types and encoding concerns for the message provided by the what() member function. The concerns are effectively the same as those reported in LWG 4087, but in the special case of constant evaluation.

We discussed LWG 4087 during the 2024-06-12 SG16 meeting. Unfortunately, I still haven't published the meeting summary for that meeting (work, life, burnout), so that link isn't helpful right now. I'll respond to this email with a copy of the (excellent) minutes that Eddie Nolan took for that meeting. We spent much of that meeting discovering what the status quo is with regard to the standard wording. We didn't poll any direction. The status quo appears to be:

The status quo suggests that, for the purposes of std::format(), the string returned by what() should be treated as containing (possibly ill-formed) text in the NTMBS encoding of the current C++ locale (or perhaps an explicitly provided std::locale argument).

With respect to P3068R2, there is currently no notion of a locale dependent NTMBS encoding during constant evaluation. We'll need to discuss the ramifications of this, presumably identify an encoding to use instead (presumably the ordinary literal encoding), and determine how to adjust wording accordingly.

Here are the rough minutes from the 2024-06-12 SG16 meeting for reference. Thank you again to Eddie Nolan for capturing these!

Attendance:
- (SD) Steve Downey
- (JM) Jens Maurer
- (TH) Tom Honermann
- (VZ) Victor Zverovich
- (MD) Mark de Wever
- (BG) Braden Ganetsky
- (NO) Nathan Owen

SD: Our agenda is to discuss the LWG issues. We'll be discussing 4070, 4087, and 4090.

SD: (reading the issue) If CharT is char, path::value_type is wchar_t, and the literal encoding is UTF-8, then the escaped path is transcoded from the native encoding for wide character strings to UTF-8 with maximal subparts of ill-formed subsequences substituted with u+fffd replacement character per the Unicode Standard [...]. Otherwise, transcoding is implementation-defined.

This seems to mean that the Unicode substitutions are only done for an escaped path, i.e. when the ? option is used. Otherwise, the form of transcoding is completely implementation-defined. However, this makes no sense. An escaped string will have no ill-formed subsequences, because they will already have been replaced

So only unescaped strings can have ill-formed sequences by the time we do transcoding to char, but whether or not any u+fffd substitution occurs is just implementation-defined.

I believe we want to specify the substitutions are done when transcoding an unescaped path (and it doesn't matter whether we specify it for escaped paths, because it's a no-op if escaping happens first, as is apparently intended).

It does matter whether we escape first or perform substitutions first. If we escape first then every code unit in an ill-formed sequence is individually escaped as \x{hex-digit-sequence}. So an ill-formed sequence of two wchar_t values will be escaped as two \x{...} strings, which are then transcoded to UTF-8. If we transcode (with substitutions first) then the entire ill-formed sequence is replaced with a single replacement character, which will then be escaped as \x{fffd}. SG16 should be asked to confirm that escaping first is intended, so that an escaped string shows the original invalid code units. For a non-escaped string, we want the ill-formed sequence to be formatted as �, which the proposed resolution tries to ensure.

VZ: As an author of the paper I'd like to confirm that it's indeed intended to first do escaping and then do transcoding. That's why the wording is that. I agree with Jonathan that it misses the important bit that for non-escaped paths. I think the resolution is mostly correct, except I think Tom commented in the email that the second part of the resolution, which is new to me, is a little bit incorrect. I think we want is to kind of invert the condition there, but this does something completely different.

TH: I think what we want to say there is just "and the literal encoding is not UTF-8". wchar_t encoding is still implementation defined so there's still an implementation defined aspect there. I don't think we need to add anything to the implementation-definedness.

SD: What we're saying is that if you're fully in Unicode, there's no implementation defined behavior, we're completely mandating the behavior.

TH: Specifically, when the literal encoding is UTF-8.

SD: And this is an index entry that just links back, so it's just trying to identify-- this is just an index entry, there isn't any larger context. It's trying to describe it well enough so someone looking at the implementation-defined behaviors can find it.

SD: I'll admit I haven't really thought about this a lot.

VZ: I agree with Tom, this is a mistake in the table of implementation defined behavior. It should do what Tom says-- we should replace, "not converting from wchar_t to UTF-8" with "when the literal encoding is not UTF-8." And the first part, I think, is correct.

SD: Okay. So in the text of the standard itself we want to basically strike "escaped path" and replace it with that "(possibly escaped) string"

VZ: That part is fine.

SD: But defining the implementation-defined behavior is not correct.

VZ: They should just take the wording as it is and put it in the index.

SD: All the wordings in that index are very short summaries of what the implementation-defined category is.

JM: It just tries to give a headline. It shouldn't be wrong but it's not necessarily complete. As long as we satisfy that, it's good enough.

TH: Viktor, it sounds like you have a good handle on it. Want to paste the recommendation in chat?

VZ: That's what I'm typing.

VZ (in chat): "the literal encoding is not UTF-8" instead of "not converting from wchar_t to UTF-8"

SD: That probably covers any interesting case.

JM: It's not fully right because it's not implementation-defined only if CharT is char and path::value_type is wchar_t.

presumably if CharT is char16_t, everything's also implementation defined but I don't know whether that's a possibility.

JM: And it talks about literal encoding when it might want to talk about ordinary literal encodings. Is it talking about the literal encoding for wide strings or for char? But that question's not on our plate.

JM: Because the wide literal encoding could be UTF-16 or something, or UCS-2 or whatever.

JM: So I like Viktor's suggestion for the implementation defined behavior index.

JM: We already have "literal encoding" in the normative text, so if it's ambiguous there it should be the same ambiguity in the implementation-defined index.

TH: Should it say "ordinary literal encoding?"

JM: Maybe but that's not the question of this issue.

SD: There are many places we've already made this mistake. Cleaning it up should be a one-time thing where we go through and clarify whether we actually mean ordinary or literal encoding. The sense I'm getting is that we want to change

JM: All we're doing here is correctly quoting the normative text.

SD: So the resolution is we accept the resolution for clause 1, and for the second part accept Viktor's recommendation.

JM: And what was the concern why this doesn't work? Because what we have here is the text about the literal encoding thing, right? Let me see Tom's email.

SD: This doesn't constrain what an implementation can define it to be-- they could perfectly well convert to UTF-8 when the ordinary literal encoding's not UTF-8 but it's up to implementations to serve their users.

JM: Yes, okay, great. So, Tom, are you happy with not introducing "ordinary" for the sake of quoting the text, or should we make this a bigger issue?

TH: No, I'm fine with that, like Steve said, we can do a separate cleanup issue or file an LWG issue.

JM: So the green text should be "and the literal encoding is not UTF-8."

TH: Yes, that sounds good.

SD: Moving on to 4087.

VZ: std::exception is a few remaining standard types that isn't formattable. I looked into it and found the problem that we don't actually specify what encoding the string returned by what() is in. We just say that it's something that can be converted to wstring somehow. Which is very vague. So it's impossible to implement a formatter properly because you don't know the encoding to convert from or whether conversion is needed. I gave an example with a path, but it's a more general problem-- path is one of the most obvious and outrageous cases because, as part of the path, you can get the filename. So the exception encoding has one encoding and you get a filename in a possibly different encoding and try to format it with the literal encoding and you get three encodings in one message-- simply a mess. My proposed resolution is incomplete-- it's just a first attempt to propose something to start the discussion. I'm saying it should probably be compatible with the ordinary literal encoding. That's what people normally do, combine it with literal strings and output. I had an email forwarded to SG16 which had 4 options which nicely summarize what we can choose. I think Tom, separately suggested using the locale encoding.

SD: I would expect this, barring any external constraints, to be in the current execution encoding. Which isn't necessarily the literal encoding. That is a common source of broken text, but that is the state of the world. If I'm handed a char* and no other information, it's the execution encoding.

VZ: At the very least we should specify the encoding. Now it doesn't say anything.

JM: Fully agreed.

SD: Especially because this is instructing end users what they should be stuffing in these things.

TH: Does multibyte not imply the locale encoding?

JM: It doesn't.

TH: Because we have the association with mbstowcs.

TH: This has always been very vaguely specified.

SD: Does NTBS include multibyte?

JM: Yes. Well, no, wait. The other way around, I thought. Wait wait wait. A null-terminated byte string, NTBS, is a char sequence whose highest addressed element with defined content has value 0. NO other element has value 0. An NTMBS is an NTBS, that has a sequence of valid multibyte characters. So an NTBS is everything, an NTMBS is one that has valid multibyte characters.

TH: Whatever those are.

JM: Now the question is what this NTBS-- it's in the C standard.

TH: mbstowcs.

JM: The mbstowcs function converts a sequence of multibyte characters that begins in the initial shift state-- it just returns a null-terminated byte sequence-- the conversion function into a sequence of corresponding wide characters. Each is converted as if by a call to mbtowc function. Except that the conversion state of the mbtowc function is not affected. So for the specific conversion it defers to the other function.

SD: This does seem overall to be just a whole class of interesting ways of producing broken text. The whole what() facility, assembling user-specified data with string literals and doing something to them in the hopes that someone can reconstruct something intelligible.

JM: The heading for this mbtowc function says, the behavior of the multibyte character functions is affected by the LC_CTYPE category of the current locale. Apparently LC_CTYPE can change what a multibyte character sequence is. That means, essentially, the definition of what a multibyte character string is is dependent on the LC_CTYPE locale category because the definition of a multibyte character sequence says it must be a valid sequence, and the locale tells me what's valid and what's not. Presumably that means it's actually the global locale or thread-related locale.

SD: Or in our current terminology, the execution encoding.

JM: Which is unfortunate, because usually C++ tries to make the local explicit in the interface. In iostreams you can imbue the locale of your choice, you don't need global state which is broken by design.

SD: Plus the built in race condition during the exception.

TH: Passing in a locale wouldn't work because the message is constructed much earlier.

JM: You want to pass it in at the place the exception is generated, not when the what function is called.

TH: But the locale could have changed when you invoke what().

JM: At least it's well-defined. If you call what() and can't make sense out of it, then it's your fault. But it's hypothetical, because there's no way to pass a locale at the point where the filesystem is generated. Is there locale stuff on filesystem?

TH: There is.

VZ: No, it just says "a system-specific encoding."

TH: I think some of the functions do actually take a locale, it's used to do a conversion to the encoding you're talking about.

JM: The example in the issue where file size is being queried doesn't seem like somewhere a locale fits in.

SD: This is an example of the general problem.

JM: Looking at the example, there's nowhere to pass in a locale. No one expects to pass a locale to a file size query function.

TH: The only way you get non-mojibake out is if the global locale was consistent from the time the message was created to when it was received.

VZ: Did we figure out that NTBS is always in the global locale?

JM: NTMBS is in the global locale.

SD: Unless it's specifically a string literal.

VZ: But what we have is NTBS.

TH: The remarks say NTMBS but the text says NTBS. It's not consistent.

JM: The returns says NTBS, which is any kind of null-terminated byte sequence. The remarks say, we already told you earlier that NTBMS is a valid NTBS. That just gives permission to the implementation to give you an NTMBS as opposed to just an NTBS. What we can do is, for the case of an NTMBS, where we already say it's suitable for conversion and display as a wstring, we might want to clarify that it was suitable at the time of construction of the exception for wstring, because it needs to evaluate the LCC type at construction time, not when you call what. That's what's missing from the remarks, otherwise we already have the cross-reference to CVT so we already know what's happening. For the returns part, which are the minimum requirements, we haven't solved anything. So far the standard has even refrained from telling you it must be a multibyte string if multibyte strings are on your platform. An implementation can return an ASCII only string even if it could return a multibyte string. We have two things we should do. One is to clarify the remarks with respect to when the suitable for conversion and display holds. That holds only immediately after construction and not later (or we restrict changing the LCC type or whatever). What we do for plain NTBS's-- I don't know. Maybe the best thing is not to talk about it.

TH: For solving Viktor's problem, we have two concerns. The file path, incorporating it into a message. That's one issue. As for taking this NTBS that comes out and getting it formatted, we can specify "as if using the C function. We can just specify to use the global locale and you get what you get. Sometimes there might be some weird translations.

VZ: To clarify, by global locale we mean global C locale? There's multiple sets of locales. LCC locale and C locale. You can separately set both of them, they're unrelated.

SD: You can change various parts of the locale bits independently.

JM: No, we're talking about the function call set_locale. There's a C variant of the global set_locale, that takes a C variant, and there's an equivalent for the C++ locale structures, which sets an independent locale state.

TH: It may also set the C locale.

JM: But it's not required to. At the start of the program you can expect that they're the same. The question is, which one do we take. Presumably the C++ locale.

TH: Except that we have the reference to the conversion functions which are C-based and use the C locale.

JM: Where?

TH: Maybe I misunderstood before. mbstowcs?

JM: That's not what we do here. The actual cross reference is to the codecvt facet. The class codecvt is for use when converting from one character encoding to another... . We have ctype, wchar_t, and mbstate. Presumably wchar_t is the internal encoding, the external encoding is char, and the state_t is an mbstate which is a transformation. codecvt converts between the native character set for ordinary and wide characters.

TH: This might be a case where it'd be good to try to ... some implementations and set the C and C++ locales differently and see which one you get.

JM: Where do you want to get what?

TH: Produce an exception object but have locale set.. but we don't specify

JM: We want to specify transcoding..

TH: There is transcoding of file paths on the Windows side.

JM: This text talks about the OS-dependent current encoding for path names which in this case is CP1251.

VZ: I think path is a red herring because it has its own unrelated transcoding. What we need to specify is, what's the target encoding for exception? And specify what the output of the path method should be converted into. What Tom is suggestion is to look at what path does, that's not correct.

SD: For anyone producing this string, what should they be trying to do? I think they should be targeting the current execution encoding as defined by locale.

EN: Which one.

JM: The global C++ locale. No reference to the C locale in the cross reference. It says locale codecvt, which you get from the global C++ locale.

VZ: I have a question to Jens. Clarify: the what() makes sense after you construct, because locale can change

SD: At the point of construction.

VZ: Locale can be changed asynchronously-- what do you mean by that?

SD: That you have a problem if someone does that.

JM: Well, no. Where's the global C++ locale query function? Is that the default ctor of the locale class?

TH: Maybe? There might also be a global static factory function.

JM: Yes, there's a classic thing (useless) and a global locale function...

JM: If we have a named locale, you get the C locale set to the same thing, otherwise all bets are off. locale() is the constructor of the locale class which gets you a copy of the global c++ locale. Race conditions aren't relevant here-- it's the ctor of a class, no special rules on race conditions, you can call the locale global setting function unsynchronized and the stdlib has to deal with it.

SD: You have a logical race condition between starting this process and who interprets it, but that's baked in.

VZ: But the ctor might have multiple arguments ,what if the local changes in between?

JM: Your program is broken.

VZ: Why?

JM: Because we don't prevent anyone from changing the locale midway. The best atomicity guarantee is the default ctor of locale. If you call it multiple times in close proximity and get different results, tough luck.

TH: So you're supposed to acquire your own copy and reuse it.

VZ: We should specify that somehow, that it's in one locale and not a multiple of locales.

SD: That's instructing programmers not to do broken things.

VZ: One exception object with potentially multiple things it needs locale for.

JM: What multiple things? It will use the locale in effect when initiating the ctor call of the exceptions. For user exceptions that, eg.. combine system_errors into one exceptions, all we can do is throw up our hands. WE can't even query which encodings were used. Changing the global locale is a bad idea and as much as we should be able to convey that, we should do that.

SD: The best we can do is, for exception definitions, what does what() return? An NTBS to be interpreted in the locale that was in effect when the exception was constructed.

JM: I don't know about this NTBS part. I'm pretty sure, if we have an implementation that is an NTMBS, then that should be in the locale at the time of construction. That's the easy part. If we just have an NTBS, that is not a multibyte string, which isn't Victor's example, by the way, because it combines UTF-8 with an odd encoding in the exception string, but for the NTBS case where the implementation chooses not to provide an NTMBS, just an NTBS, which doesn't have enough capabilities to represent the union of the characters in the explanatory string and the path, I don't know what to do.

JM: Maybe all we can do is say, an implementation defined NTBS, and stop there, and you get what you get, but you can give a remarks recommendation for what happens for the multibyte case. If the implementation tries to be helpful to you, you should get something we know how to interpret, but it's in principle QOI. Maybe you're on a small system where there's no practical choice of encoding, so the NTBS of your system is all that counts.

SD: It's possible that an NTMBS is still a single byte encoding. It's about who's promising what and when. But I agree. In the remarks when we clarify this, we can say if someone hasn't handed you a string in the locale when the string was constructed, they're breaking your contract.

JM: So do we have agreement that we fine tune the NTMBS wording because we have machinery how to interpret NTMBS strings, and we leave the guarantee alone because there isn't any guarantee?

SD: The guarantee is that it's null terminated.

JM: Which is not very useful. But again, the remarks say, not as clearly as they could, they say this is how you can be helpful to your users. And it's implementation defined so presumably you can ask your implementer.

SD: I propose that I'll take on drafting something after this meeting, that we can propose as the resolution.

VZ: It's not sufficient. The core of the issue is you can't say anything about what() and we're not fixing that.

SD: That's the state of the world. The remarks are guiding QOI.

VZ: It's broken and we're keeping it broken. We're not doing anything.

SD: I don't see a way of telling everyone generally, because there's user data that can show up in what(). The file name can be misencoded. So there's no way to guarantee that this can be put into a properly encoding string of anything.

TH: But what we can say is that for the purposes of std::format, if you call what() on an exception object, interpret it as an NTMBS, and for anything that doesn't convert you do escaping.

SD: Yes. When trying to format one of these strings, you're going to have to be suspicious because it's foreign data. It's an NTMBS in the execution encoding, do your best to produce output in the requested format. But what() should be in the execution encoding.

EN: Couldn't we require NTMBS that's not NTBS to be forbidden?

JM: Do we have an overview of standard library implementations? If we tighten the rules on what what() can return, we tighten the rules on what can be passed as ctor args to e.g. logic_error. Because what() returns byte-wise what was passed in. So no transcoding can happen in the ctor. But now that we require a valid MBS in what() we require a valid MBS as the ctor to the exceptions. Those people don't care about encoding-- they just want to say what they put in they get out. Do we want to invalidate them?

EN: Seems convincing that we shouldn't.

SD: I think that, first off, if the execution encoding and ordinary literal encoding aren't compatible, you have deep problems producing any output whatsoever. Hello world starts to fail.

JM: That's all fine but the point is my program has library UB if I violate the preconditions of a library function. That's not a good place to be in.

TH: I strongly agree. We don't want to invalidate any user code.

SD: Not in the exception or what parts. IN producing a formatter, things are in play.

JM: The formatter has all rights to say, I expect the what string to be an NTMBS. That is totally fine and good. Then you convert from that NTMBS to whatever you want and go from there. That seems plausible. But that's the formatter at the point it wants to output that stuff. It needs to understand the details. We can strengthen the words to say something like, we recommend that implementations when constructing the what string on your own, as opposed to a user one, should create a valid NTMBS. I don't think we can do more than that. Life would be so much better if we just said UTF-8 everywhere, but that's not our life.

VZ: I think there was some mischaracterization of the example. Something like, because it's path we can't do anything. In fact we can do a lot to improve the situation. We can have all the info, even though we don't now. If we know the encoding of the path, we can get a perfect output even in the constraints of the current system. You can display arbitrary binary data through escaping.

JM: Do you want to expose the filesystem path itself in the exception object so the formatter can use it?

VZ: It's already exposed as part of the message and should be aligned with the rest of the text, not in a collection of the text.

JM: We agree that what() should not be in multiple encoding. It *should* as in implementation recommended practice, definitely. That's what we're trying to formulate.

SD: I think the phrase here, in the native format, is woefully vague, and a source of confusion as part of this. As Jens already pointed out, there are other exceptions which just take a string and copy it, which have all the same potential issues as Viktor identified. This is more remediable by an implementer.

JM: We can address the filesystem issue, I think, at least we can push implementations in the right direction. I'm not sure we can do anything reasonable for exceptions as a whole.

TH: I agree. Viktor, I think if a path is being included in the message, we'd want to reinterpret it and escape it using the mechanism std::format use. Can we get away with convincing implementers to change existing code?

JM: Certainly, it's an untenable situation that we have one NTMBS that uses two encodings inside. That will never ever work.

VZ: We have an implementer here. Mark, would you be willing to change exception messages?

MD: I'm not sure we want to. This might have implications on users. Would be good to investigate further. With libc++ we are typically UTF-8 only which makes our life easier. ON Windows people have multiple encodings.

JM: The filesystem_error members not only takes two components but actually three components. A what arg, a a literal string, a path, and an error code. We have three components: user-defined (unknown encoding), path (known), error_code converted to system_error string (known to implementation). The problem is slightly larger than we thought. We also need to require from users that their what() arg is of the right kind, which is probably okay for filesystem_error because we have good reasons to require that at that point.

SD: I'll do some drafting work about the remarks.

JM: And there also needs to be a change in the guarantee of the filesystem_error what. It just says that what() returns an NTBS, We can tighten that part and add preconditions to the what() arg of the ctors. We might be able to convince implementers to improve that, specifically for filesystem_error.

JM: Introducing 4090. We have std::format, we have variants of std::foramt that take an explicit std::locale parameter. Let's focus on those. Then we have an L format specifier that uses the locale you passed in. We have a statement, "For integral types, the locale-specific form causes the context's locale to be used to insert the appropriate digit group separator characters." There's probably something similar for floating point. We have several options to get that promise done and remember that locales are user-configurable and therefore the users actually sees which virtual functions are being invoked. For iostreams we have specific rules under which circumstances functions are being invoked. When outputting numbers num_put is invoked. WE don't say this here, we should say something. Is it enough for users to override the num_put facility to get different formatting, or do they also or instead of, replace the numpunct facility? There are also _byname facets, are those relevant? And so that's the fundamental question here. The problem is that the num_put facility may not actually insert the appropriate digit group separator characters even though numpunct may specify which ones are appropriate, because the user may ignore numpunct and do num_put the way I want. numpunct also allows only single-byte characters as digit separators. If we have UTF-8 and some Asian locale, we could do interesting separator characters. We can't do that with numpunct. There's a practical benefit of requiring a call to num_put.

SD: I could make the digit separator a half-width comma.

JM: Something like that.

TH: We could find out what implementations are doing.

SD: Because these are user-creatable, they're user-perceivable.

JM: User-observable, so we need to be precise or expressly imprecise.

SD: Either tell users precisely what's going to happen or tell them to bring it up with their implementers.

JM: Or tell them they're doing something wrong and invoking UB.

JM: Tom suggested we should have an implementation survey of existing std::format implementations.

TH: I'm looking at the msft implementation, was hoping Mark would know offhand.

MD: I don't know offhand, I can take a look. It's different from streams.

JM: Iostreams requires num_put. num_put just uses character ranges, not streams, right?

TH: The msft implementation does use numpunct somewhere. In an internal function called write_integral. And no uses of num_put.

JM: and num_put takes a reference to an ios_base as a parameter, which is an alien concept to construct in a formatter, but not something that'd be impossible to construct. But it does take an iter_type, which is an output iterator, which is a template parameter, which by default is an ostreambuf_iterator, but that's presumably configurable-- except that then we have to tell the world what we actually use if not an ostreambuf_iterator. So we might conclude that num_put is too tied to iostreams to bother with in the format context.

TH: Has std::format been implemented in a shipping libstdc++?

MD: Yes in 13. Not complete, 14 has more improvements.

TH: libstdc++ seems to use numpunct as well. Lots of uses of numpunct and no uses of num_put.

JM: I agree. Looks like numpunct wins.

SD: We should specify that it's doing numpunct. That does mean that you only get a char type for it.

JM: Well, we already know the locale interface is broken. If we come up with a better way, maybe we'll have a new overload of std::format.

MD: IMO, this opinion should also address floating point and boolean values (the true name and the false name).

TH: Jens, will you offer a PR for your issue?

JM: I don't know, why? It's broken, you can keep all the pieces. I'm not supplying glue.

TH: Well, but in terms of actually specifying that numpunct is used?

JM: Yes, so?

TH: What do you think we should do with the issue you filed?

JM: WE should tell LWG that SG16 resolved that after implementation review, numpunct is the winner and use of numpunct is explicitly specified, as is true type, false type, and for floating point numpunct is the only thing. WE can pass that on as prose text and someone can morph it into a PR if they want to.

Tom.