Date: Thu, 9 Dec 2021 23:51:20 +0100
On 09/12/2021 17.39, Tom Honermann wrote:
> To elaborate, the scenario I had in mind concerns something we recently discussed; that '{' and '}' are mapped to 0xC0 and 0xD0 respectively in IBM-1047 <https://icu4c-demos.unicode.org/icu-bin/convexp?conv=ibm-1047_P100-1995&s=IBM>, but mapped to 0x43 and 0xDC in IBM-273 <https://icu4c-demos.unicode.org/icu-bin/convexp?conv=ibm-273_P100-1995&s=IBM> and 0x51 and 0x54 in IBM-297 <https://icu4c-demos.unicode.org/icu-bin/convexp?conv=ibm-297_P100-1995&s=IBM>. When used with an ISO-2022 encoding that supports invoking those code pages via escape sequences, it is possible to encounter multiple encodings of those characters. However, I haven't been able to determine if any compiler that targets an EBCDIC environment supports such an encoding. IBM xlC supports SI/SO sequences for switching between single-byte and double-byte encoding, but I haven't found any documentation that suggests escape sequence invocation of code pages is supported. Perhaps Hubert can provide more information.
>
> A similar scenario applies for ISO-2022 encodings like ISO-2022-CN, ISO-2022-JP, and ISO-2022-KR though. An escape sequence could invoke the ASCII character set over GR such that '{' is encoded at both 0x7B and 0xFB. I don't know if that should be considered to violate [lex.charset]p6 <http://eel.is/c++draft/lex.charset#6>.
I don't think a situation where a single character may be encoded using
different code unit sequences (excluding shift-in/shift-out no-ops)
has been considered when writing this text.
Jens
> To elaborate, the scenario I had in mind concerns something we recently discussed; that '{' and '}' are mapped to 0xC0 and 0xD0 respectively in IBM-1047 <https://icu4c-demos.unicode.org/icu-bin/convexp?conv=ibm-1047_P100-1995&s=IBM>, but mapped to 0x43 and 0xDC in IBM-273 <https://icu4c-demos.unicode.org/icu-bin/convexp?conv=ibm-273_P100-1995&s=IBM> and 0x51 and 0x54 in IBM-297 <https://icu4c-demos.unicode.org/icu-bin/convexp?conv=ibm-297_P100-1995&s=IBM>. When used with an ISO-2022 encoding that supports invoking those code pages via escape sequences, it is possible to encounter multiple encodings of those characters. However, I haven't been able to determine if any compiler that targets an EBCDIC environment supports such an encoding. IBM xlC supports SI/SO sequences for switching between single-byte and double-byte encoding, but I haven't found any documentation that suggests escape sequence invocation of code pages is supported. Perhaps Hubert can provide more information.
>
> A similar scenario applies for ISO-2022 encodings like ISO-2022-CN, ISO-2022-JP, and ISO-2022-KR though. An escape sequence could invoke the ASCII character set over GR such that '{' is encoded at both 0x7B and 0xFB. I don't know if that should be considered to violate [lex.charset]p6 <http://eel.is/c++draft/lex.charset#6>.
I don't think a situation where a single character may be encoded using
different code unit sequences (excluding shift-in/shift-out no-ops)
has been considered when writing this text.
Jens
Received on 2021-12-09 16:51:29