Subject: Re: Agreeing with Corentin's point re: problem with strict use of abstract characters
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-06-14 13:57:14
On Sun, 14 Jun 2020 at 20:44, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
> Supposed I'm on an EBCDIC-only system.
> If I understand correctly, EBCDIC has a host of control characters
> that can't be represented in Unicode (while preserving semantics);
> the mapping that exists re-uses some Unicode control characters,
> but with different semantics.
Yes, the C1 control characters
Suppose one of those EBCDIC control characters is mapped to \u1234.
> What if \u1234 also appears as such in my source code, obviously
> intending to mean the Unicode semantics?
> I think I'd like to have at least the option of getting a
> syntax error (I asked for a Unicode control character that
> doesn't exist as such on EBCDIC), but it seems the mapping
> will give me the EBCDIC control character with different
> semantics. (All of this in a string literal, of course,
> so it hurts when performing output.)
The idea is that the C1 control characters on an EBCDIC platforms are
always considered to be the EBCDIC character they map to.
Unicode defines them as follow (Unicode 13, 23.1 Control Codes)
There are 65 code points set aside in the Unicode Standard for
compatibility with the C0 and C1 control codes defined in the ISO/IEC 2022
framework. The ranges of these code points are U+0000..U+001F, U+007F, and
U+0080..U+009F, which correspond to the 8- bit controls 0016 to 1F16 (C0
controls), 7F16 (delete), and 8016 to 9F16 (C1 controls), respectively. For
example, the 8-bit legacy control code character tabulation (or tab) is the
byte value 0916; the Unicode Standard encodes the corresponding control
code at U+0009. The Unicode Standard provides for the intact interchange of
these code points, neither adding to nor subtracting from their semantics.
The semantics of the control codes are generally determined by the
application with which they are used. However, in the absence of specific
application uses, they may be interpreted according to the control function
semantics specified in ISO/IEC 6429:1992. In general, the use of control
codes constitutes a higher-level protocol and is beyond the scope of the
Unicode Standard. For example, the use of ISO/IEC 6429 control sequences
for controlling bidirectional formatting would be a legitimate higher-level
protocol layered on top of the plain text of the Unicode Standard.
Higher-level protocols are not specified by the Unicode Standard; their
existence cannot be assumed without a separate agreement between the
parties interchanging such data.
> Hubert, is my understanding above correct?
> On 14/06/2020 20.03, Hubert Tong wrote:
> > On Sun, Jun 14, 2020 at 5:03 AM Corentin Jabot <corentinjabot_at_[hidden]
> <mailto:corentinjabot_at_[hidden]>> wrote:
> > On Sun, 14 Jun 2020 at 08:59, Jens Maurer via SG16 <
> sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> > On 11/06/2020 00.06, Hubert Tong wrote:
> > > On Wed, Jun 10, 2020 at 5:39 PM Jens Maurer <
> Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]> <mailto:
> Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>>> wrote:
> > >
> > > On 10/06/2020 23.23, Hubert Tong via SG16 wrote:
> > > > I agree with Corentin's point that the strict use of
> abstract characters introduces problems where a coded character set
> contains multiple values for a single abstract character/contains
> characters that are canonically the same but assigned different values.
> > To be clear, the statement I made above is an indication that I now
> believe that the notion of "coded character set" is necessary for making
> progress for our purposes.
> > >
> > > I have a hard time imagining such a thing. Can you give
> an example?
> > >
> > > Yes, U+FA9A as described in
> https://en.wikipedia.org/wiki/Han_unification has this situation with
> > > These characters are distinct as members of a coded character
> set, but as abstract characters, I do not believe we can easily say the
> > I would expect these to be two different abstract characters in
> the C++ sense.
> > Roughly, anything you can distinguish in the source should be a
> > "abstract character", if only for the benefit of raw string
> > "abstract character" is a notion for humans
> > +1
> > and to talk about mapping from one set to another.
> > After phase 1, C++ deals with code points such that two sequences of
> code points are identical if and only if they have the same values
> > This seems to be a source of getting hung up on terminology. I think
> this could help: The above sentence can be read as a tautology. A "code
> point" (within a coded character set) is synonymous with the value
> component of a coded character within that coded character set.
> Unfortunately, "value in the UCS codespace" is chosen as the "definition"
> for "code point" in ISO/IEC 10646.
> > I don't think we should entertain any notion of "same character"
> in C++,
> > beyond value comparisons in the execution encoding and
> "identity" as
> > needed for "same identifier".
> > We need to in/before phase 1, but I think we reached the consensus
> that we otherwise
> > shouldn't and wouldn't
> > To be clear, we need to make sure we are on the same page with respect
> to the meta (notion of) notion of "same character":
> > By "character", do we mean an "abstract character" or a "coded
> > For example, if some hypothetical input format differentiates
> red and
> > green letters that are otherwise "the same", I'd still expect a
> red A
> > to be a different abstract character than a green A. (Ok, that
> > work for the basic source character set, but should work for
> > beyond that.)
> > It doesn't work as there isn't any culture on earth that make that
> distinction such that there exist no universal-character-name to make that
> > It is best left to people of letter to decide whether colors carry
> meaning (and they sometimes do
> > If that means the term "character" or "abstract character" is
> too loaded
> > to be used here, so be it. (The terminology space is already
> > crowded due to Unicode, so it's hard to find unused phrases that
> give the
> > right connotation.)
> > The terminology used by Unicode people isn't Unicode specific. In
> particular, "abstract character" is meaningful independently of
> > any computer system.
> > I think that the relationships between terms represent an ideal that is
> not met in practice. "Abstract character" is a meaningful notion; however,
> the ideal that coded character sets are a bijective function between values
> in a codespace and abstract characters has not been clearly attained.
SG16 list run by firstname.lastname@example.org