C++ Logo


Advanced search

Re: [SG16] What do we want from source to internal conversion?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 15 Jun 2020 18:30:03 +0200
On Mon, 15 Jun 2020 at 17:56, Tom Honermann <tom_at_[hidden]> wrote:

> On 6/15/20 11:38 AM, Corentin wrote:
> On Mon, 15 Jun 2020 at 17:17, Tom Honermann <tom_at_[hidden]> wrote:
>> On 6/15/20 7:14 AM, Corentin via SG16 wrote:
> Hubert has specifically requested better support for unmappable
>> characters, so I don't agree with the parenthetical.
> I don't think that's a fair characterisation. Again there is a mapping for
> all characters in ebcdic. That mapping is prescriptive rather than
> semantic, but both Unicode and IBM agree on that mapping ( the codepoints
> they map to do not have associated semantic whatsoever and are meant to be
> used that way). The wording trick will be to make sure we don't prevent
> that mapping.
> The claim that Unicode and IBM agree on this mapping seems overreaching to
> me. Yes, there is a specification for how EBCDIC code pages can be mapped
> to Unicode code points in a way that preserves round tripping. I don't
> think that should be read as an endorsement for conflating the semantic
> meanings of those characters that represent distinct abstract characters
> before/after such a mapping. I believe there have been requests to be able
> to differentiate the presence of one of these control characters in the
> source input and the mapped Unicode code point being written as a UCN.

The Unicode characters they map to do no have associated semantic

There are 65 code points set aside in the Unicode Standard for
compatibility with the C0 and C1 control codes defined in the ISO/IEC 2022
framework. The ranges of these code points are U+0000..U+001F, U+007F, and
U+0080..U+009F, which correspond to the 8- bit controls 0016 to 1F16 (C0
controls), 7F16 (delete), and 8016 to 9F16 (C1 controls), respectively. For
example, the 8-bit legacy control code character tabulation (or tab) is the
byte value 0916; the Unicode Standard encodes the corresponding control
code at U+0009. The Unicode Standard provides for the intact interchange of
these code points, neither adding to nor subtracting from their semantics.
The semantics of the control codes are generally determined by the
application with which they are used. However, in the absence of specific
application uses, they may be interpreted according to the control function
semantics specified in ISO/IEC 6429:1992. In general, the use of control
codes constitutes a higher-level protocol and is beyond the scope of the
Unicode Standard. For example, the use of ISO/IEC 6429 control sequences
for controlling bidirectional formatting would be a legitimate higher-level
protocol layered on top of the plain text of the Unicode Standard.
Higher-level protocols are not specified by the Unicode Standard; their
existence cannot be assumed without a separate agreement between the
parties interchanging such data.

> Tom.

Received on 2020-06-15 11:33:24