sg16: Re: [SG16] Agenda for the 2021-12-01 SG16 telecon

From: Charlie Barto <Charles.Barto_at_[hidden]>
Date: Thu, 9 Dec 2021 22:16:55 +0000

[lex.charset]p6 says only that each element of the basic source character set needs to be distinct from each other, and representable in one code unit (prohibiting actual JIS X0208 but allowing CP932). It does not say that each basic source character needs to be one code unit that's distinct from all other possible valid code-unit sequences, just that it's distinct from all other basic source characters. This doesn't prohibit GBK based encodings, as they _do_ encode all the basic characters to just one byte (all ascii characters are unchanged in GBK), but some other characters may share the same first code-unit.

From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Tom Honermann via SG16
Sent: Thursday, December 9, 2021 2:08 PM
To: Corentin Jabot <corentinjabot_at_[hidden]>
Cc: Tom Honermann <tom_at_[hidden]>; SG16 <sg16_at_[hidden]>; Barry Revzin <barry.revzin_at_[hidden]>
Subject: Re: [SG16] Agenda for the 2021-12-01 SG16 telecon

On 12/9/21 12:31 PM, Corentin Jabot wrote:

On Thu, Dec 9, 2021 at 5:39 PM Tom Honermann <tom_at_[hidden]<mailto:tom_at_[hidden]>> wrote:
On 12/9/21 12:28 AM, Corentin Jabot wrote:

On Wed, Dec 8, 2021, 23:40 Tom Honermann <tom_at_[hidden]<mailto:tom_at_[hidden]>> wrote:
On 12/5/21 2:26 PM, Jens Maurer wrote:

On 05/12/2021 01.04, Tom Honermann wrote:

On 12/4/21 6:05 AM, Jens Maurer wrote:

If we impose a requirement for a code unit -> code point decoder for the

literal encoding at compile-time, we should make such a facility generally

available instead of hiding it in the guts of the std::format parser.

I think JeanHeyd's work on P1629 <https://wg21.link/p1629><https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwg21.link%2Fp1629&data=04%7C01%7CCharles.Barto%40microsoft.com%7C6fe21fd40c2a4b58388b08d9bb606879%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637746845023608392%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=oBPgDw%2FUAr09c0TtKaHBzbEV7XGiBBRptLG8b7Wt5Tk%3D&reserved=0> will fill this niche. It would be nice if the features he proposes in N2730 <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2730.htm><https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.open-std.org%2Fjtc1%2Fsc22%2Fwg14%2Fwww%2Fdocs%2Fn2730.htm&data=04%7C01%7CCharles.Barto%40microsoft.com%7C6fe21fd40c2a4b58388b08d9bb606879%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637746845023658369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AvPo6z61nivFZtgMMa%2FPoxgHIr%2FYYrxWLYv%2BFcF0HEk%3D&reserved=0> were usable at compile-time as well, but that will likely have to await some kind of constexpr support in C.

Why? We've made functions constexpr that are inherited from C

before.

Sure, we have, and could do so again. In this case, there are behaviors that we would have to specify that should be decided in conjunction with WG14. For example, the N2730 "mc" and "mwc" function variants operate on the locale dependent execution encoding. We would have to specify what that means for compile-time evaluation. The obvious answer is, of course, that it means the ordinary/wide literal encoding. Since that encoding may differ from the run-time execution encoding, this presumably means defining a locale (or at least the LC_CTYPE locale category) for use at compile-time. We would then have to tie the behavior to std::is_constant_evaluated() (so that the separation of compile-time vs run-time is rigorously defined) for which there is presently no corresponding C facility.

These are not necessarily simple functions that can be readily be inlined or made builtins. As we've previously discussed, EBCDIC code pages do not all consistently encode '{' and '}'. An ISO-2022 escape mechanism that allows switching character sets presumably would require the implementation to track shift state and have access to character set tables in order to recognize all encodings of these characters. Though, perhaps such an encoding is disallowed by [lex.charset]p6<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Flex.charset%236&data=04%7C01%7CCharles.Barto%40microsoft.com%7C6fe21fd40c2a4b58388b08d9bb606879%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637746845023658369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=NIp9XtRP6jx6exCQvMsXKkdmMZGwl1epkliEE%2BUcd7U%3D&reserved=0>? It isn't clear to me how to apply that wording to shift-state encodings.

Nothing precludes shift state literal encodings, see note in the same paragraph.

That note only applies to characters outside the basic literal character set. It doesn't apply (normatively or otherwise) to the scenario I presented.

To elaborate, the scenario I had in mind concerns something we recently discussed; that '{' and '}' are mapped to 0xC0 and 0xD0 respectively in IBM-1047<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficu4c-demos.unicode.org%2Ficu-bin%2Fconvexp%3Fconv%3Dibm-1047_P100-1995%26s%3DIBM&data=04%7C01%7CCharles.Barto%40microsoft.com%7C6fe21fd40c2a4b58388b08d9bb606879%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637746845023658369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4i2h%2BuPKhzJt3Ru6oo6cpm06MWrYmnWlEpoC8%2BHqIiU%3D&reserved=0>, but mapped to 0x43 and 0xDC in IBM-273<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficu4c-demos.unicode.org%2Ficu-bin%2Fconvexp%3Fconv%3Dibm-273_P100-1995%26s%3DIBM&data=04%7C01%7CCharles.Barto%40microsoft.com%7C6fe21fd40c2a4b58388b08d9bb606879%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637746845023658369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BidQdyeqlUq%2B26zbXq6SRg7TtsNrKhlWStidi7lmz94%3D&reserved=0> and 0x51 and 0x54 in IBM-297<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficu4c-demos.unicode.org%2Ficu-bin%2Fconvexp%3Fconv%3Dibm-297_P100-1995%26s%3DIBM&data=04%7C01%7CCharles.Barto%40microsoft.com%7C6fe21fd40c2a4b58388b08d9bb606879%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637746845023658369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=54I7%2FW4FFLgsMiMhvLVWVOenY7gafj%2FEscsnpW0W7dk%3D&reserved=0>. When used with an ISO-2022 encoding that supports invoking those code pages via escape sequences, it is possible to encounter multiple encodings of those characters. However, I haven't been able to determine if any compiler that targets an EBCDIC environment supports such an encoding. IBM xlC supports SI/SO sequences for switching between single-byte and double-byte encoding, but I haven't found any documentation that suggests escape sequence invocation of code pages is supported. Perhaps Hubert can provide more information.

A similar scenario applies for ISO-2022 encodings like ISO-2022-CN, ISO-2022-JP, and ISO-2022-KR though. An escape sequence could invoke the ASCII character set over GR such that '{' is encoded at both 0x7B and 0xFB. I don't know if that should be considered to violate [lex.charset]p6<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Flex.charset%236&data=04%7C01%7CCharles.Barto%40microsoft.com%7C6fe21fd40c2a4b58388b08d9bb606879%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637746845023658369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=NIp9XtRP6jx6exCQvMsXKkdmMZGwl1epkliEE%2BUcd7U%3D&reserved=0>.
That an abstract character can be mapped to a single code unit does not imply multiple code unit sequences cannot represent said abstract character (even in unicode, for example U+212B)
In general, sure. But [lex.charset]p6<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Feel.is%2Fc%2B%2Bdraft%2Flex.charset%236&data=04%7C01%7CCharles.Barto%40microsoft.com%7C6fe21fd40c2a4b58388b08d9bb606879%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637746845023658369%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=NIp9XtRP6jx6exCQvMsXKkdmMZGwl1epkliEE%2BUcd7U%3D&reserved=0> states, "encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element". If it instead stated, "encodes each element of the basic literal character set as aone or more single code units with non-negative value, distinct from the code units for any other such element", there would be no question that the same element can have multiple encodings. But I guess the absence of explicit prohibition is implicit allowance in this case.

In any case, the existence of stateful encodings prevents a naive approach that only looks at code units. (and comparison with a single char - should you attempt that may or may not have the intended result).
In the presence of shift state encodings, the guarantees given by p6 seem of limited usefulness (and yet are somewhat necessary for one to use any of the ctype functions)

Agreed.

Tom.

Tom.

Tom.

Received on 2021-12-09 16:16:59