On Wed, Dec 8, 2021, 23:40 Tom Honermann <tom@honermann.net> wrote:
On 12/5/21 2:26 PM, Jens Maurer wrote:
On 05/12/2021 01.04, Tom Honermann wrote:On 12/4/21 6:05 AM, Jens Maurer wrote:If we impose a requirement for a code unit -> code point decoder for the literal encoding at compile-time, we should make such a facility generally available instead of hiding it in the guts of the std::format parser.I think JeanHeyd's work on P1629 <https://wg21.link/p1629> will fill this niche. It would be nice if the features he proposes in N2730 <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2730.htm> were usable at compile-time as well, but that will likely have to await some kind of constexpr support in C.Why? We've made functions constexpr that are inherited from C before.Sure, we have, and could do so again. In this case, there are behaviors that we would have to specify that should be decided in conjunction with WG14. For example, the N2730 "mc" and "mwc" function variants operate on the locale dependent execution encoding. We would have to specify what that means for compile-time evaluation. The obvious answer is, of course, that it means the ordinary/wide literal encoding. Since that encoding may differ from the run-time execution encoding, this presumably means defining a locale (or at least the LC_CTYPE locale category) for use at compile-time. We would then have to tie the behavior to std::is_constant_evaluated() (so that the separation of compile-time vs run-time is rigorously defined) for which there is presently no corresponding C facility.
These are not necessarily simple functions that can be readily be inlined or made builtins. As we've previously discussed, EBCDIC code pages do not all consistently encode '{' and '}'. An ISO-2022 escape mechanism that allows switching character sets presumably would require the implementation to track shift state and have access to character set tables in order to recognize all encodings of these characters. Though, perhaps such an encoding is disallowed by [lex.charset]p6? It isn't clear to me how to apply that wording to shift-state encodings.
Nothing precludes shift state literal encodings, see note in the same paragraph.
That note only applies to characters outside the basic literal character set. It doesn't apply (normatively or otherwise) to the scenario I presented.
To elaborate, the scenario I had in mind concerns something we
recently discussed; that '{' and '}' are mapped to 0xC0 and 0xD0
respectively in IBM-1047,
but mapped to 0x43 and 0xDC in IBM-273
and 0x51 and 0x54 in IBM-297.
When used with an ISO-2022 encoding that supports invoking those
code pages via escape sequences, it is possible to encounter
multiple encodings of those characters. However, I haven't been
able to determine if any compiler that targets an EBCDIC
environment supports such an encoding. IBM xlC supports SI/SO
sequences for switching between single-byte and double-byte
encoding, but I haven't found any documentation that suggests
escape sequence invocation of code pages is supported. Perhaps
Hubert can provide more information.
A similar scenario applies for ISO-2022 encodings like
ISO-2022-CN, ISO-2022-JP, and ISO-2022-KR though. An escape
sequence could invoke the ASCII character set over GR such that '{' is encoded at both 0x7B and 0xFB. I
don't know if that should be considered to violate [lex.charset]p6.
Tom.
Tom.