C++ Logo

SG16

Advanced search

Subject: Re: Terminology
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-06-15 04:02:59


My understanding of abstract character was incorrect, here is a
clarification from the Unicode Mailing list:

*Now the abstract character A-diaresis (Ä) is encoded by a single code
point and also has a canonically equivalent representation by a combining
sequence. In effect, the whole sequence "encodes" a single abstract
character, but that is formally not how Unicode defines it.A diaeresis is a
recognizable item of the writing system; if used as an umlaut, it tends to
act as a decoration of character that is more-or-less seen as a new entity
(particularly in Swedish) and less a modified letter A. If used as a
diaeresis, it acts more like a punctuation mark that has a function of its
own (forcing separate pronunciation). Even though it's graphically applied
to a vowel, it can be understood as its own abstract character.*

*Treating the diaeresis as its own independent abstract character makes
logical and not just formal sense. That may not be the case equally for all
types of diacritical marks. However, since they can all be named, and thus
arguably exist as their own concepts at least on a descriptive level, it
becomes effectively a non-problem.*

*The way combining marks are treated in other scripts, they can all be on
different points of the scale as logically independent entities, and some
are even on different points of the scale in terms of graphically combining
(they may be graphically indistinguishable from regular spacing letters).*

*To recap, an "abstract" character is a conceptual character, something
that forms the atom of a writing system (smallest divisible particle) as
viewed from the process of encoding, which associates with it a single code
point. "Abstract" characters may exist that are not encoded; and some of
them can be analyzed as series of smaller abstract characters, and thus be
represented as code point sequences.Some abstract characters are more like
small molecules; they can be encoded as such, or they can also have a more
atomic sequence that represents them. The rationale for allowing this dual
nature is historical compatibility, not logical necessity, hence the model
is in some ways not "pure" (just practical).*

On Thu, 11 Jun 2020 at 15:22, Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

> As I said on the telecon, here is my understanding of how it all works.
>
> I make no guarantee that my understanding is accurate, but I'm doing my
> best!
>
>
> I should preface by saying that
>
> - It's all very complicated, as our understanding of text has changed
> a lot in the course of a century, while some of the terminologies are the
> same so the same terms have different meanings depending on who uses it in
> which context and time period. and text is complicated.
> - Similarly, multiple terms are used to describe the same thing
> - The Unicode terminology, which is the most complete, doesn't just
> refer to Unicode, but as to how the Unicode people think about text. And
> given they have a more refined mental model, I'll refer to Unicode
> terminology most of the time.
> - To the best I can tell the Unicode terminology was introduced around
> Unicode 3.0, I'll explain why
>
>
> Very short history
>
>
> Up until the early 90s, the model was fairly simple:
>
> Some bit pattern (which fits in some variable or fixed numbers of bits) =>
> some abstract character.
>
> I use "abstract character" in the modern sense here, that varies depending
> on systems, sometimes the mapping was directly to a glyph, sometimes the
> mapping was done manually (early telegraph).
>
>
> In this model, the character encoding designates a character set, where
> the set of characters corresponds to the characters that can be encoded.
>
> It's all very tautological. In this model, the value of a coded character
> is its bit pattern. Values are necessary to define an order.
>
>
> The issue with that model is that a given set of characters can only be
> encoded one way and, changing the encoding duplicates the character set.
>
> It is also a fairly inflexible model, adding characters to unused bit
> patterns often requires duplicating the encoding.
>
>
> This became a problem in the early 90s as
>
> - Some people thought that 2 bytes were way too wasteful, which led to
> Ken Thompson drafting UTF-8 on a napkin
> - Some people thought that 2 bytes were not enough at all to represent
> characters, which let to Unicode code space being extended to 21 bits, the
> surrogate mechanism, utf-16 and utf-32
>
> At that point, there are multiple encodings describing the same character
> set... and just like that, the notion diverged.
>
>
> AFAICT, Unicode / Universal Coded Character Set (different specification,
> same character set), and GB18030 are the two character sets that have
> multiple encodings and for which
>
> the distinction between encoded and Coded Character Set matters
>
>
> For any other encodings, the terms encoding, character set, and coded
> character set are interchangeable.
>
>
> For any encoding there exists a character set - There are some subtlety
> there, as GB18030 and Unicode are tantalizing close to being isomorphic but
> not quite,
>
> UTF-8 for example can encode either GB18030 or Unicode. But ignoring that
> difference, 1 encoding => 1 character set
>
>
> Inversely, a character set can be represented by 1 or more encodings.
>
>
> Definitions:
>
>
> Abstract Character and Character Repertoire
>
>
> An Abstract Character is what people would colloquially refer to as a
> character outside of the context of computers.
>
> They do carry some semantic, but they do not have a value or any
> representation.
>
>
> The notion of Abstract Character is useful to compare character sets
> between one another.
>
>
> A *character Repertoire* is a set of abstract characters.
>
>
> A coded character set, which I have colloquially referred to as character
> set, although there is a slight difference between the two,
>
> is a set of abstract characters assigned to a value - referred to as
> codepoint. BUT:
>
>
> * The same abstract character can be assigned multiple values - this is
> usually done for compatibility reasons
>
> * Multiple different abstract characters can be assigned the same value -
> this is notably the Han unification. In which case the character set or
> encoding thereof isn't sufficient to convey the exact semantic meaning of a
> piece of text or to convert that codepoint to a glyph - more context such
> as knowing the script or language used is necessary.
>
> * A single abstract character can be assigned multiple codepoints. This is
> the case for some Latin letters with diacritics, emojis
>
> * A single abstract character can be represented by different sequences of
> codepoints of different sizes
>
>
> A *coded character set *is the result of such mapping.
>
>
> For example, a repertoire may contain the letter "Ê" which might be
> represented in a character set by a codepoint for E, one for the circumflex
> accent, and maybe one for the combined form,
>
> maybe for compatibility purposes.
>
> Then a *Coded* Character Set may decide to assign the number 1 to E, the
> number 47 to Ê, and the number 622 to the circumflex accent.
>
>
> In practice, Character Sets are always *Coded* Character Set, and both
> terms are colloquially interchangeable as the goal is to design something
> representable by computers.
>
> Notice that white the definitions of *Coded* Character Set and Character
> Set are distinct there exists no term to describe the individual elements
> of a character set which is not a *Coded* Character Set.
>
> As such there exist no character set which is not a coded character set,
> and while it might be useful to define character set properly somewhere
> one, I am not sure the distinction is ever necessary for our purpose. Even
> when we don't care about what the values are, the values exist. Values
> which are necessary both to define an order, bytewise equality and talk
> about Unicode properties.
>
>
> A character encoding in Unicode parlance is mapping to a coded character
> sets to some serialized form.
>
> With the exception of Unicode and GB18030, a text encoding is also a
> mapping to a character repertoire, as the character set and the character
> repertoires are isomorphic.
>
>
> Character Encoding, Character Encoding Form, And Character Encoding Scheme
>
>
> These are Unicode specific terms, which I do not think we care about much,
> and exist because Unicode defines encoding with different endianness:
>
> They first map a codepoint to a sequence of *code units* (where code
> units are 8, 16, or 32 bits), then convert these to a sequence of 8 bits
> bytes applying byte swapping to obtain the desired endian order.
>
> I do not think these distinctions matter in the standard at all - and I
> recommend using the term *character encoding *(which applies to all
> character encodings, whereas CEF/CES are Unicode specific), BUT we may want
> to specify the endianness of UTF-16 and UTF-32 to be implementation-defined.
>
>
> A code unit is the minimal unit that can represent a character in a
> multi-byte encoding (7 for ASCII, 8 for utf8, 16 for utf16, etc)
>
> These maps to C++ character types (char, char8_t, char16_t, etc)
>
>
> *Code units* and *Code points* are Unicode terms, which can be used to
> describe any encoding, including non-Unicode encodings.
>
> Not all code units sequence represent codepoints, not all code-points
> represent abstract characters.
>
>
>
> These are the main terms, let me know if I should clarify further.
>
>
>
> In the context of C++
>
> - *Abstract character *is useful when talking about conversion between
> character sets. This is notably the case in phase one where "physical
> source file characters" and "The set of physical source file characters"
> do, I believe, refer to abstract characters and character repertoire
> respectively. This might change if we want to say something specific about
> UTF-8 and normalization form. But talking about "Abstract Character
> Sequence" here lets us not care at all about memory representation. A jpg
> of text is still an abstract character sequence.
> - (Phase 1 assumes each "physical source file character" maps to
> exactly one member of the "basic source character" or one ucn", which is
> not a correct assumption.)
> - The rest of the lexing is clearly done on a character set, as there
> is no ambiguous mapping of characters to grammar elements. There is exactly
> one way to represent the sequence "constexpr". In particular, normalization
> of UCN sequences remains constant through phase 2-4.
> - Abstract character is also useful in phase 5 when more conversion is
> done to talk about the representability of characters in the execution
> encoding.
> - It is true that until character literals are formed (modulo a weird
> thing we can deprecate in the preprocessor ), we do not _care_ about the
> order or values of codepoints, nor which encoding is used by the internal
> representation. But it doesn't seem useful to pretend there isn't a
> representation either. Which is why I recommend not distinguishing coded
> character sets and character sets
> - The Basic Source and Basic Execution Characters sets are clearly
> Repertoires, but the Execution Character and Execution Wide Character Sets
> are character sets ( the standard cares about a value existing, not
> necessarily what that value is). And again the existence of an encoding
> implies the existence of the corresponding character set.
> - The constraints on the value of "0" to "9" etc in [lex.charset]
> actually applies to code *units*
>
>
> Did I miss anything?
>
> Should I clarify further?
>
> I hope it helps!
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, 10 Jun 2020 at 21:59, Peter Brett via SG16 <sg16_at_[hidden]>
> wrote:
>
> Also pretty much the whole of [character.seq]
> <http://eel.is/c++draft/character.seq> needs to be looked at.
>
>
>
> *From:* SG16 <sg16-bounces_at_[hidden]> *On Behalf Of *Tom Honermann
> via SG16
>
> *Sent:* 10 June 2020 20:28
>
> *To:* SG16 <sg16_at_[hidden]>
>
> *Cc:* Tom Honermann <tom_at_[hidden]>
>
> *Subject:* [SG16] Terminology
>
>
>
> EXTERNAL MAIL
>
> I'm sending the following as a potential guide for discussion in today's
> SG16 telecon. My apologies for the short notice.
>
> The following lists "things" that we may need (new) names for. For those
> already present in the standard, the current terms used are included in
> parenthesis. If you can think of others, please reply.
>
> - *The encoding of source files.*
> - (Physical source file character set; [lex.phases]p1.1
> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.phases*1.1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHb3louvw$>
> )
> - *The source character repertoire.*
> - (Basic source character set; [lex.charset]p1
> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHEw91o17Q$>
> )
> - *The compiler's internal character encoding.*
> - (Internal encoding; [lex.phases]p1
> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.phases*1.1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHb3louvw$>
> )
> - *The character set requirements for the encoding of character and
> string literals.*
> - (basic execution character set; [lex.charset]p3
> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
> )
> - *The character set requirements for the encoding of wide character
> and string literals.*
> - (basic execution wide-character set; [lex.charset]p3
> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
> )
> - *The encoding of character and string literals.*
> - (execution character set; [lex.charset]p3
> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
> )
> - *The encoding of wide character and string literals.*
> - (execution wide-character set; [lex.charset]p3
> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
> )
> - *The encoding of character literals when used in conditional
> preprocessing directive.*
> - (; [cpp.cond]p12
> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/cpp.cond*12__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHF9PoQrXA$>
> )
> - *The encoding of wide character literals when used in conditional
> preprocessing directive.*
> - (; [cpp.cond]p12
> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/cpp.cond*12__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHF9PoQrXA$>
> )
> - *The encoding of file names.*
> - (Native encoding; [fs.path.type.cvt]p1
> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/fs.path.type.cvt*1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHFf8GuPcg$>
> )
> - *The encoding of wide file names.*
> - (Native encoding; [fs.path.type.cvt]p1
> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/fs.path.type.cvt*1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHFf8GuPcg$>
> )
> - *The Unicode character set.*
> - (ISO/IEC 10646; [lex.charset]p2
> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*2__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHF_BXXcPg$>
> )
> - *The encoding of characters and strings at run-time.*
> - ()
> - *The terminal/console encoding*
> - ()
>
> Tom.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
> SG16 mailing list
>
> SG16_at_[hidden]
>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>
>
>
>
>
>
>
>
>
>
> On Wed, 10 Jun 2020 at 21:59, Peter Brett via SG16 <sg16_at_[hidden]>
> wrote:
>
>> Also pretty much the whole of [character.seq]
>> <http://eel.is/c++draft/character.seq> needs to be looked at.
>>
>>
>>
>> *From:* SG16 <sg16-bounces_at_[hidden]> *On Behalf Of *Tom
>> Honermann via SG16
>> *Sent:* 10 June 2020 20:28
>> *To:* SG16 <sg16_at_[hidden]>
>> *Cc:* Tom Honermann <tom_at_[hidden]>
>> *Subject:* [SG16] Terminology
>>
>>
>>
>> EXTERNAL MAIL
>>
>> I'm sending the following as a potential guide for discussion in today's
>> SG16 telecon. My apologies for the short notice.
>>
>> The following lists "things" that we may need (new) names for. For those
>> already present in the standard, the current terms used are included in
>> parenthesis. If you can think of others, please reply.
>>
>> - *The encoding of source files.*
>> (Physical source file character set; [lex.phases]p1.1
>> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.phases*1.1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHb3louvw$>
>> )
>> - *The source character repertoire.*
>> (Basic source character set; [lex.charset]p1
>> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHEw91o17Q$>
>> )
>> - *The compiler's internal character encoding.*
>> (Internal encoding; [lex.phases]p1
>> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.phases*1.1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHb3louvw$>
>> )
>> - *The character set requirements for the encoding of character and
>> string literals.*
>> (basic execution character set; [lex.charset]p3
>> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
>> )
>> - *The character set requirements for the encoding of wide character
>> and string literals.*
>> (basic execution wide-character set; [lex.charset]p3
>> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
>> )
>> - *The encoding of character and string literals.*
>> (execution character set; [lex.charset]p3
>> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
>> )
>> - *The encoding of wide character and string literals.*
>> (execution wide-character set; [lex.charset]p3
>> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
>> )
>> - *The encoding of character literals when used in conditional
>> preprocessing directive.*
>> (; [cpp.cond]p12
>> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/cpp.cond*12__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHF9PoQrXA$>
>> )
>> - *The encoding of wide character literals when used in conditional
>> preprocessing directive.*
>> (; [cpp.cond]p12
>> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/cpp.cond*12__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHF9PoQrXA$>
>> )
>> - *The encoding of file names.*
>> (Native encoding; [fs.path.type.cvt]p1
>> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/fs.path.type.cvt*1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHFf8GuPcg$>
>> )
>> - *The encoding of wide file names.*
>> (Native encoding; [fs.path.type.cvt]p1
>> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/fs.path.type.cvt*1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHFf8GuPcg$>
>> )
>> - *The Unicode character set.*
>> (ISO/IEC 10646; [lex.charset]p2
>> <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*2__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHF_BXXcPg$>
>> )
>> - *The encoding of characters and strings at run-time.*
>> ()
>> - *The terminal/console encoding*
>> ()
>>
>> Tom.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>



SG16 list run by sg16-owner@lists.isocpp.org