sg16: Re: [SG16] Agenda for the 2021-12-01 SG16 telecon

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Thu, 9 Dec 2021 18:31:51 +0100

On Thu, Dec 9, 2021 at 5:39 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 12/9/21 12:28 AM, Corentin Jabot wrote:
>
>
>
> On Wed, Dec 8, 2021, 23:40 Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 12/5/21 2:26 PM, Jens Maurer wrote:
>>
>> On 05/12/2021 01.04, Tom Honermann wrote:
>>
>> On 12/4/21 6:05 AM, Jens Maurer wrote:
>>
>> If we impose a requirement for a code unit -> code point decoder for the
>> literal encoding at compile-time, we should make such a facility generally
>> available instead of hiding it in the guts of the std::format parser.
>>
>> I think JeanHeyd's work on P1629 <https://wg21.link/p1629> <https://wg21.link/p1629> will fill this niche. It would be nice if the features he proposes in N2730 <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2730.htm> <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2730.htm> were usable at compile-time as well, but that will likely have to await some kind of constexpr support in C.
>>
>> Why? We've made functions constexpr that are inherited from C
>> before.
>>
>> Sure, we have, and could do so again. In this case, there are behaviors
>> that we would have to specify that should be decided in conjunction with
>> WG14. For example, the N2730 "mc" and "mwc" function variants operate on
>> the locale dependent execution encoding. We would have to specify what that
>> means for compile-time evaluation. The obvious answer is, of course, that
>> it means the ordinary/wide literal encoding. Since that encoding may differ
>> from the run-time execution encoding, this presumably means defining a
>> locale (or at least the LC_CTYPE locale category) for use at
>> compile-time. We would then have to tie the behavior to
>> std::is_constant_evaluated() (so that the separation of compile-time vs
>> run-time is rigorously defined) for which there is presently no
>> corresponding C facility.
>>
>> These are not necessarily simple functions that can be readily be inlined
>> or made builtins. As we've previously discussed, EBCDIC code pages do not
>> all consistently encode '{' and '}'. An ISO-2022 escape mechanism that
>> allows switching character sets presumably would require the implementation
>> to track shift state and have access to character set tables in order to
>> recognize all encodings of these characters. Though, perhaps such an
>> encoding is disallowed by [lex.charset]p6
>> <http://eel.is/c++draft/lex.charset#6>? It isn't clear to me how to
>> apply that wording to shift-state encodings.
>>
>
> Nothing precludes shift state literal encodings, see note in the same
> paragraph.
>
> That note only applies to characters outside the basic literal character
> set. It doesn't apply (normatively or otherwise) to the scenario I
> presented.
>
> To elaborate, the scenario I had in mind concerns something we recently
> discussed; that '{' and '}' are mapped to 0xC0 and 0xD0 respectively in
> IBM-1047
> <https://icu4c-demos.unicode.org/icu-bin/convexp?conv=ibm-1047_P100-1995&s=IBM>,
> but mapped to 0x43 and 0xDC in IBM-273
> <https://icu4c-demos.unicode.org/icu-bin/convexp?conv=ibm-273_P100-1995&s=IBM>
> and 0x51 and 0x54 in IBM-297
> <https://icu4c-demos.unicode.org/icu-bin/convexp?conv=ibm-297_P100-1995&s=IBM>.
> When used with an ISO-2022 encoding that supports invoking those code pages
> via escape sequences, it is possible to encounter multiple encodings of
> those characters. However, I haven't been able to determine if any compiler
> that targets an EBCDIC environment supports such an encoding. IBM xlC
> supports SI/SO sequences for switching between single-byte and double-byte
> encoding, but I haven't found any documentation that suggests escape
> sequence invocation of code pages is supported. Perhaps Hubert can provide
> more information.
>
> A similar scenario applies for ISO-2022 encodings like ISO-2022-CN,
> ISO-2022-JP, and ISO-2022-KR though. An escape sequence could invoke the
> ASCII character set over GR such that '{' is encoded at both 0x7B and
> 0xFB. I don't know if that should be considered to violate [lex.charset]p6
> <http://eel.is/c++draft/lex.charset#6>.
>
That an abstract character can be mapped to a single code unit does not
imply multiple code unit sequences cannot represent said abstract character
(even in unicode, for example U+212B)

In any case, the existence of stateful encodings prevents a naive approach
that only looks at code units. (and comparison with a single char - should
you attempt that may or may not have the intended result).
In the presence of shift state encodings, the guarantees given by p6 seem
of limited usefulness (and yet are somewhat necessary for one to use any of
the ctype functions)

> Tom.
>
>
> Tom.
>>
>
>

Received on 2021-12-09 11:32:05