sg16: Re: [SG16] Agenda for the 2021-12-01 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 9 Dec 2021 11:39:37 -0500

On 12/9/21 12:28 AM, Corentin Jabot wrote:
>
>
> On Wed, Dec 8, 2021, 23:40 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 12/5/21 2:26 PM, Jens Maurer wrote:
>> On 05/12/2021 01.04, Tom Honermann wrote:
>>> On 12/4/21 6:05 AM, Jens Maurer wrote:
>>>> If we impose a requirement for a code unit -> code point decoder for the
>>>> literal encoding at compile-time, we should make such a facility generally
>>>> available instead of hiding it in the guts of the std::format parser.
>>> I think JeanHeyd's work on P1629<https://wg21.link/p1629> <https://wg21.link/p1629> will fill this niche. It would be nice if the features he proposes in N2730<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2730.htm> <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2730.htm> were usable at compile-time as well, but that will likely have to await some kind of constexpr support in C.
>> Why? We've made functions constexpr that are inherited from C
>> before.
>
> Sure, we have, and could do so again. In this case, there are
> behaviors that we would have to specify that should be decided in
> conjunction with WG14. For example, the N2730 "mc" and "mwc"
> function variants operate on the locale dependent execution
> encoding. We would have to specify what that means for
> compile-time evaluation. The obvious answer is, of course, that it
> means the ordinary/wide literal encoding. Since that encoding may
> differ from the run-time execution encoding, this presumably means
> defining a locale (or at least the LC_CTYPE locale category) for
> use at compile-time. We would then have to tie the behavior to
> std::is_constant_evaluated() (so that the separation of
> compile-time vs run-time is rigorously defined) for which there is
> presently no corresponding C facility.
>
> These are not necessarily simple functions that can be readily be
> inlined or made builtins. As we've previously discussed, EBCDIC
> code pages do not all consistently encode '{' and'}'. An ISO-2022
> escape mechanism that allows switching character sets presumably
> would require the implementation to track shift state and have
> access to character set tables in order to recognize all encodings
> of these characters. Though, perhaps such an encoding is
> disallowed by [lex.charset]p6
> <http://eel.is/c++draft/lex.charset#6>? It isn't clear to me how
> to apply that wording to shift-state encodings.
>
>
> Nothing precludes shift state literal encodings, see note in the same
> paragraph.

That note only applies to characters outside the basic literal character
set. It doesn't apply (normatively or otherwise) to the scenario I
presented.

To elaborate, the scenario I had in mind concerns something we recently
discussed; that '{' and '}' are mapped to 0xC0 and 0xD0 respectively in
IBM-1047
<https://icu4c-demos.unicode.org/icu-bin/convexp?conv=ibm-1047_P100-1995&s=IBM>,
but mapped to 0x43 and 0xDC in IBM-273
<https://icu4c-demos.unicode.org/icu-bin/convexp?conv=ibm-273_P100-1995&s=IBM>
and 0x51 and 0x54 in IBM-297
<https://icu4c-demos.unicode.org/icu-bin/convexp?conv=ibm-297_P100-1995&s=IBM>.
When used with an ISO-2022 encoding that supports invoking those code
pages via escape sequences, it is possible to encounter multiple
encodings of those characters. However, I haven't been able to determine
if any compiler that targets an EBCDIC environment supports such an
encoding. IBM xlC supports SI/SO sequences for switching between
single-byte and double-byte encoding, but I haven't found any
documentation that suggests escape sequence invocation of code pages is
supported. Perhaps Hubert can provide more information.

A similar scenario applies for ISO-2022 encodings like ISO-2022-CN,
ISO-2022-JP, and ISO-2022-KR though. An escape sequence could invoke the
ASCII character set over GR such that '{' is encoded at both 0x7B and
0xFB. I don't know if that should be considered to violate
[lex.charset]p6 <http://eel.is/c++draft/lex.charset#6>.

Tom.

>
> Tom.
>

Received on 2021-12-09 10:39:41