sg16: Re: [SG16] Agenda for the 2021-12-01 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 9 Dec 2021 17:08:07 -0500

On 12/9/21 12:31 PM, Corentin Jabot wrote:
>
>
> On Thu, Dec 9, 2021 at 5:39 PM Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 12/9/21 12:28 AM, Corentin Jabot wrote:
>>
>>
>> On Wed, Dec 8, 2021, 23:40 Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> On 12/5/21 2:26 PM, Jens Maurer wrote:
>>> On 05/12/2021 01.04, Tom Honermann wrote:
>>>> On 12/4/21 6:05 AM, Jens Maurer wrote:
>>>>> If we impose a requirement for a code unit -> code point decoder for the
>>>>> literal encoding at compile-time, we should make such a facility generally
>>>>> available instead of hiding it in the guts of the std::format parser.
>>>> I think JeanHeyd's work on P1629<https://wg21.link/p1629> <https://wg21.link/p1629> will fill this niche. It would be nice if the features he proposes in N2730<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2730.htm> <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2730.htm> were usable at compile-time as well, but that will likely have to await some kind of constexpr support in C.
>>> Why? We've made functions constexpr that are inherited from C
>>> before.
>>
>> Sure, we have, and could do so again. In this case, there are
>> behaviors that we would have to specify that should be
>> decided in conjunction with WG14. For example, the N2730 "mc"
>> and "mwc" function variants operate on the locale dependent
>> execution encoding. We would have to specify what that means
>> for compile-time evaluation. The obvious answer is, of
>> course, that it means the ordinary/wide literal encoding.
>> Since that encoding may differ from the run-time execution
>> encoding, this presumably means defining a locale (or at
>> least the LC_CTYPE locale category) for use at compile-time.
>> We would then have to tie the behavior to
>> std::is_constant_evaluated() (so that the separation of
>> compile-time vs run-time is rigorously defined) for which
>> there is presently no corresponding C facility.
>>
>> These are not necessarily simple functions that can be
>> readily be inlined or made builtins. As we've previously
>> discussed, EBCDIC code pages do not all consistently encode
>> '{' and'}'. An ISO-2022 escape mechanism that allows
>> switching character sets presumably would require the
>> implementation to track shift state and have access to
>> character set tables in order to recognize all encodings of
>> these characters. Though, perhaps such an encoding is
>> disallowed by [lex.charset]p6
>> <http://eel.is/c++draft/lex.charset#6>? It isn't clear to me
>> how to apply that wording to shift-state encodings.
>>
>>
>> Nothing precludes shift state literal encodings, see note in the
>> same paragraph.
>
> That note only applies to characters outside the basic literal
> character set. It doesn't apply (normatively or otherwise) to the
> scenario I presented.
>
> To elaborate, the scenario I had in mind concerns something we
> recently discussed; that '{' and '}' are mapped to 0xC0 and 0xD0
> respectively in IBM-1047
> <https://icu4c-demos.unicode.org/icu-bin/convexp?conv=ibm-1047_P100-1995&s=IBM>,
> but mapped to 0x43 and 0xDC in IBM-273
> <https://icu4c-demos.unicode.org/icu-bin/convexp?conv=ibm-273_P100-1995&s=IBM>
> and 0x51 and 0x54 in IBM-297
> <https://icu4c-demos.unicode.org/icu-bin/convexp?conv=ibm-297_P100-1995&s=IBM>.
> When used with an ISO-2022 encoding that supports invoking those
> code pages via escape sequences, it is possible to encounter
> multiple encodings of those characters. However, I haven't been
> able to determine if any compiler that targets an EBCDIC
> environment supports such an encoding. IBM xlC supports SI/SO
> sequences for switching between single-byte and double-byte
> encoding, but I haven't found any documentation that suggests
> escape sequence invocation of code pages is supported. Perhaps
> Hubert can provide more information.
>
> A similar scenario applies for ISO-2022 encodings like
> ISO-2022-CN, ISO-2022-JP, and ISO-2022-KR though. An escape
> sequence could invoke the ASCII character set over GR such that
> '{' is encoded at both 0x7B and 0xFB. I don't know if that should
> be considered to violate [lex.charset]p6
> <http://eel.is/c++draft/lex.charset#6>.
>
> That an abstract character can be mapped to a single code unit does
> not imply multiple code unit sequences cannot represent said abstract
> character (even in unicode, for example U+212B)
In general, sure. But [lex.charset]p6
<http://eel.is/c++draft/lex.charset#6> states, "encodes each element of
the basic literal character set as a single code unit with non-negative
value, distinct from the code unit for any other such element". If it
instead stated, "encodes each element of the basic literal character set
as a_one or more_ single code unit_s_ with non-negative value, distinct
from the code unit_s_ for any other such element", there would be no
question that the same element can have multiple encodings. But I guess
the absence of explicit prohibition is implicit allowance in this case.
>
> In any case, the existence of stateful encodings prevents a naive
> approach that only looks at code units. (and comparison with a single
> char - should you attempt that may or may not have the intended result).
> In the presence of shift state encodings, the guarantees given by p6
> seem of limited usefulness (and yet are somewhat necessary for one to
> use any of the ctype functions)

Agreed.

Tom.

> Tom.
>
>>
>> Tom.
>>
>

Received on 2021-12-09 16:08:10