sg16: Re: [SG16] What do we want from source to internal conversion?

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 15 Jun 2020 13:21:49 -0400

On 6/15/20 12:50 PM, Corentin wrote:
>
>
> On Mon, 15 Jun 2020 at 18:41, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/15/20 12:30 PM, Corentin wrote:
>>
>>
>> On Mon, 15 Jun 2020 at 17:56, Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> On 6/15/20 11:38 AM, Corentin wrote:
>>>
>>>
>>> On Mon, 15 Jun 2020 at 17:17, Tom Honermann
>>> <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
>>>
>>> On 6/15/20 7:14 AM, Corentin via SG16 wrote:
>>>
>>> Hubert has specifically requested better support for
>>> unmappable characters, so I don't agree with the
>>> parenthetical.
>>>
>>> I don't think that's a fair characterisation. Again there is
>>> a mapping for all characters in ebcdic. That mapping is
>>> prescriptive rather than semantic, but both Unicode and IBM
>>> agree on that mapping ( the codepoints they map to do not
>>> have associated semantic whatsoever and are meant to be used
>>> that way). The wording trick will be to make sure we don't
>>> prevent that mapping.
>>
>> The claim that Unicode and IBM agree on this mapping seems
>> overreaching to me. Yes, there is a specification for how
>> EBCDIC code pages can be mapped to Unicode code points in a
>> way that preserves round tripping. I don't think that should
>> be read as an endorsement for conflating the semantic
>> meanings of those characters that represent distinct abstract
>> characters before/after such a mapping. I believe there have
>> been requests to be able to differentiate the presence of one
>> of these control characters in the source input and the
>> mapped Unicode code point being written as a UCN.
>>
>>
>> The Unicode characters they map to do no have associated semantic
>>
>> There are 65 code points set aside in the Unicode Standard for
>> compatibility with the C0 and C1 control codes defined in the
>> ISO/IEC 2022 framework. The ranges of these code points are
>> U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond to
>> the 8- bit controls 0016 to 1F16 (C0 controls), 7F16 (delete),
>> and 8016 to 9F16 (C1 controls), respectively. For example, the
>> 8-bit legacy control code character tabulation (or tab) is the
>> byte value 0916; the Unicode Standard encodes the corresponding
>> control code at U+0009. The Unicode Standard provides for the
>> intact interchange of these code points, neither adding to nor
>> subtracting from their semantics. The semantics of the control
>> codes are generally determined by the application with which they
>> are used. However, in the absence of specific application uses,
>> they may be interpreted according to the control function
>> semantics specified in ISO/IEC 6429:1992. In general, the use of
>> control codes constitutes a higher-level protocol and is beyond
>> the scope of the Unicode Standard. For example, the use of
>> ISO/IEC 6429 control sequences for controlling bidirectional
>> formatting would be a legitimate higher-level protocol layered on
>> top of the plain text of the Unicode Standard. Higher-level
>> protocols are not specified by the Unicode Standard; their
>> existence cannot be assumed without a separate agreement between
>> the parties interchanging such data.
>
> Yes, I'm aware, but this point still stands: I believe there have
> been requests to be able to differentiate the presence of one of
> these control characters in the source input vs the mapped Unicode
> code point being written as a UCN.
>
>
> What is the use case?

I'll defer to Hubert. In
https://lists.isocpp.org/sg16/2020/06/1465.php, he stated, "I would like
to allow characters not present in Unicode within character literals,
string literals, comments, and header names. More abstractly, I would
like to allow source -> encoding-used-for-output conversion."

That was almost a whole week ago, so given all of the recent discussion,
opinions may have changed :)

From my perspective, the standard should place the fewest restrictions
necessary, particularly where implementation-defined behavior is
necessary. I'm more interested in the raw literal aspect of this scenario.

>
> Either:
>
> * the escape sequence appears in a narrow or wide ebcdic encoded
> string or character literal, in which case the unicode C1
> character would not be representable and the program would be
> ill-formed (which would be easy for IBM to support if we only
> convert escape sequences when literals are formed btw)
> * the control character appears in an utf literal and the program
> would be ill formed because there is no other representation than
> the one prescribed by utf-ebcdic?
>
Why couldn't there be another representation?
>
> I would really like to know what is the problem being solved here,
> UTF-EBCDIC was written by unicode people and the distinction you
> suggest was supposedly not considered at the time?
I doubt that UTF-EBCDIC was written with C++ translation phase 1 in mind :)
> On an even more practical level, it doesn't seem like something clang
> would be able to support, if not at great cost?

I don't see why supporting this would pose a challenge for Clang. But,
that isn't really worth discussing until there is an actual proposal.

Tom.

Received on 2020-06-15 12:25:00