The Microsoft implementation doesn’t have to deal with any encodings where the basic format string control characters aren’t invariant. We have to deal with a slightly different issue where some bytes of multibyte characters in some encodings are bit-identical to format control characters. This means that if we detect such an encoding may be in use we can’t skip any bytes when searching through the format string for control characters, as we’d have no way of knowing if control characters we find were multibyte continuations or not (there may be clever backtracking tricks we could do).
Additionally, because we have nothing like P1885 in our compiler we can’t parse the format string at compile time if we don’t detect the literal encoding is UTF-8 (we detect UTF-8 by forming a literal string at constexpr time and looking at its bytes, this may be possible for other encodings, but it would be encoding specific). This is because if we can’t detect UTF-8 we _assume_ that the literal encoding is the same as the _runtime_ system locale (this locale is different from the one std::locale() gives, and can’t be changed in windows without a reboot). This is, to some extent, wrong. And means if you don’t use the `/utf-8` compiler option then in some cases your program won’t run correctly on systems with a different locale, however it’s the usual/”expected” behavior. We then use the runtime system locale and various CRT functions to correctly iterate over the format string.
On the subject of P1885, we don’t actually need P1885 to be standardized to implement constexpr format string parsing on non-utf execution character sets, we just need to compiler magic that enables a P1885 implementation, and that’s something that, as far as I understand (I’m not an expert in our compiler frontend) is not that hard to implement, we just haven’t gotten to it yet.
To do compile time parsing for such encodings we’d need to get access to some of the CRT’s encoding database at compile time, it’s very possible, but a little annoying. On the upside once we do that our behavior will be correct even if the program is run on a system with a different “system” locale than the one implied by `/execution-charset`.
Also, the “status-quo” is that mixing object files compiled using different execution charsets is unlikely to always work. It can affect ABI, and indeed if you have basic_format_args or _format-arg-store_ as part of your ABI then trying to pass that object to a version of std::format that’s been instantiated using `/utf-8` won’t work correctly if the format string contains any non-ascii characters. In any case, I think the standard can assume that all TUs are compiled with the same execution character set, if not then your code is non-standard and may not work correctly.
I sometimes wish users weren’t allowed to pass around _format-arg-stores_ or basic_format_args. I suppose it’s required to some extent for user specified formatters, but it’s one of those types that’s vulnerable to being locked in a non-optimal data layout because of ABI. Basic_format_args in particular implements a clever data layout (custom type erasure) that it would be nice to be able to change around. Oh well.
On Mon, Jun 7, 2021 at 10:42 AM Hubert Tong <email@example.com> wrote:
[off-list: note CC] I'm not sending to the list this close to plenary but I would like to send a response to Tom before plenary starts.
[+list] Now that plenary is over...
On Mon, Jun 7, 2021 at 1:13 AM Tom Honermann <firstname.lastname@example.org> wrote:
On 6/6/21 8:15 PM, Hubert Tong wrote:
I am not aware of implementation experience for this paper in environments where characters significant to the interpretation of the format string are not locale-invariant. There is, however, reason to believe that an implementation can be realistically deployed to such environments while giving some ability of the user to choose the text encoding under which format strings are parsed. As it is, the paper uses format-string and wformat-string as exposition-only types in the signature of the `format` functions. It is possible for an implementation to version these functions (across translation unit boundaries) through embedding the text encoding information into these types. A mechanism such as std::text_encoding::literal().mib() from P1885R5 (which has not yet advanced to plenary) could be used.
That mechanism does not appear to be an option for the vformat_to() or vformat() overloads since their signatures do not include format-string or wformat-string. It doesn't look to me like that information can be smuggled through the types of basic_format_args or basic_format_context either, though perhaps they could be used to store a value that indicates the literal encoding.
Thanks for pointing this out; looks like more to plan looking into for the implementation on my end.
With a per-TU limitation, working some magic into make_format_args will work (I think). I admit to not being an experienced library implementer.
[P1885 is currently scheduled for consideration by LEWG during its 2021-08-03 telecon]
Thanks for the info.
The above approach is perhaps more limiting than strictly necessary upon extensions that allow the translation of string literals to be changed within a translation unit. It is noted that P1885R5 exposes the literal encoding as a consteval function, which is compatible with context-sensitive evaluation by the implementation. In case there is an appetite to allow for such context-sensitivity for format strings, it is probably the case that updating the text to allow for exposition-only extra parameters in the signature is purely a specification matter and does not affect implementations where such scenarios do not occur. It is also rather likely that implementations which do employ such extra parameters are conforming anyway (because the extra parameters are only observable when a user applies an extension). Nevertheless, the paper may be just the beginning of a number of changes that are candidates for being considered retroactive to C++20.
It looks to me like construction of a basic_format_context specialization is effectively unspecified due to lack of constructors and the presence of exposition only data members. Perhaps more of it can be specified as exposition only.
Incidentally, I think the specification of basic_format_context may be missing an exposition only std::locale data member corresponding to any passed to a formatting function.
I think, implementation-wise, something can still be done with make_format_args for this. Not sure how much of what's needed will fit within the leeway of the wording.
TL;DR: The paper sets a direction (but does not actually spell out that it does) of using the encoding associated with literal translation for parsing format strings. The work around improving the management and handling of said encodings is still ongoing; therefore, where this paper leads us is not as clear as it could be given additional time. Nevertheless, it is probably the case that further incremental improvements can be made on top of this paper without compatibility breakage for implementations that choose to deploy earlier. In certain environments, quality-of-implementation around this paper may be dependent on additional improvements to the specification. Since this paper is being considered to be retroactive to C++20, it is reasonable to expect that improvements of the aforementioned kind would also be considered for retroactive inclusion as they are discovered.
I agree, though the words "probably the case" give me pause. Regardless, at least for me, this does not translate to a desire to delay adopting this paper.
I agree that establishing sufficient consensus on the broad design intent is important.