sg16: Re: [SG16] What do we want from source to internal conversion?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 15 Jun 2020 19:40:49 +0200

On Mon, Jun 15, 2020, 19:21 Tom Honermann <tom_at_[hidden]> wrote:

> On 6/15/20 12:50 PM, Corentin wrote:
>
>
>
> On Mon, 15 Jun 2020 at 18:41, Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 6/15/20 12:30 PM, Corentin wrote:
>>
>>
>>
>> On Mon, 15 Jun 2020 at 17:56, Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> On 6/15/20 11:38 AM, Corentin wrote:
>>>
>>>
>>>
>>> On Mon, 15 Jun 2020 at 17:17, Tom Honermann <tom_at_[hidden]> wrote:
>>>
>>>> On 6/15/20 7:14 AM, Corentin via SG16 wrote:
>>>>
>>> Hubert has specifically requested better support for unmappable
>>>> characters, so I don't agree with the parenthetical.
>>>>
>>> I don't think that's a fair characterisation. Again there is a mapping
>>> for all characters in ebcdic. That mapping is prescriptive rather than
>>> semantic, but both Unicode and IBM agree on that mapping ( the codepoints
>>> they map to do not have associated semantic whatsoever and are meant to be
>>> used that way). The wording trick will be to make sure we don't prevent
>>> that mapping.
>>>
>>> The claim that Unicode and IBM agree on this mapping seems overreaching
>>> to me. Yes, there is a specification for how EBCDIC code pages can be
>>> mapped to Unicode code points in a way that preserves round tripping. I
>>> don't think that should be read as an endorsement for conflating the
>>> semantic meanings of those characters that represent distinct abstract
>>> characters before/after such a mapping. I believe there have been requests
>>> to be able to differentiate the presence of one of these control characters
>>> in the source input and the mapped Unicode code point being written as a
>>> UCN.
>>>
>>
>> The Unicode characters they map to do no have associated semantic
>>
>> There are 65 code points set aside in the Unicode Standard for
>> compatibility with the C0 and C1 control codes defined in the ISO/IEC 2022
>> framework. The ranges of these code points are U+0000..U+001F, U+007F, and
>> U+0080..U+009F, which correspond to the 8- bit controls 0016 to 1F16 (C0
>> controls), 7F16 (delete), and 8016 to 9F16 (C1 controls), respectively. For
>> example, the 8-bit legacy control code character tabulation (or tab) is the
>> byte value 0916; the Unicode Standard encodes the corresponding control
>> code at U+0009. The Unicode Standard provides for the intact interchange of
>> these code points, neither adding to nor subtracting from their semantics.
>> The semantics of the control codes are generally determined by the
>> application with which they are used. However, in the absence of specific
>> application uses, they may be interpreted according to the control function
>> semantics specified in ISO/IEC 6429:1992. In general, the use of control
>> codes constitutes a higher-level protocol and is beyond the scope of the
>> Unicode Standard. For example, the use of ISO/IEC 6429 control sequences
>> for controlling bidirectional formatting would be a legitimate higher-level
>> protocol layered on top of the plain text of the Unicode Standard.
>> Higher-level protocols are not specified by the Unicode Standard; their
>> existence cannot be assumed without a separate agreement between the
>> parties interchanging such data.
>>
>> Yes, I'm aware, but this point still stands: I believe there have been
>> requests to be able to differentiate the presence of one of these control
>> characters in the source input vs the mapped Unicode code point being
>> written as a UCN.
>>
>
> What is the use case?
>
> I'll defer to Hubert. In https://lists.isocpp.org/sg16/2020/06/1465.php,
> he stated, "I would like to allow characters not present in Unicode within
> character literals, string literals, comments, and header names. More
> abstractly, I would like to allow source -> encoding-used-for-output
> conversion."
>
> That was almost a whole week ago, so given all of the recent discussion,
> opinions may have changed :)
>
> From my perspective, the standard should place the fewest restrictions
> necessary, particularly where implementation-defined behavior is
> necessary. I'm more interested in the raw literal aspect of this scenario.
>
>
> Either:
>
> - the escape sequence appears in a narrow or wide ebcdic encoded
> string or character literal, in which case the unicode C1 character would
> not be representable and the program would be ill-formed (which would be
> easy for IBM to support if we only convert escape sequences when literals
> are formed btw)
> - the control character appears in an utf literal and the program
> would be ill formed because there is no other representation than the one
> prescribed by utf-ebcdic?
>
> Why couldn't there be another representation?
>

No existing encoding can represent both?

>
> I would really like to know what is the problem being solved here,
> UTF-EBCDIC was written by unicode people and the distinction you suggest
> was supposedly not considered at the time?
>
> I doubt that UTF-EBCDIC was written with C++ translation phase 1 in mind :)
>

It was written with the intent to convert any encoding to and from EBCDIC
encoding without loss of information, phase 1 is not special.
(Also, I should say that we do not care about utf-ebcdic beyond its
describing of control character to C1 codepoints mapping, I wonder if
that's documented elsewhere)

> On an even more practical level, it doesn't seem like something clang
> would be able to support, if not at great cost?
>
> I don't see why supporting this would pose a challenge for Clang. But,
> that isn't really worth discussing until there is an actual proposal.
>
Because there is no (using existing terminology) other possible mapping to
a universal - character - names, if we accept the pua is for users. ( I
guess they could map to a different C0 or C1 character, nothing problematic
with that)

So supporting this use case would force the entire lexing to be done in the
source character set, which is a step backward, would have complicated
wording impact and would be against existing practices.
And in the end, clang would probably still use Unicode internally so the
benefits to IBM users would still be 0?

> Tom.
>

Received on 2020-06-15 12:44:09