As I said on the telecon, here is my understanding of how it all works.

I make no guarantee that my understanding is accurate, but I'm doing my best!


I should preface by saying that

  • It's all very complicated, as our understanding of text has changed a lot in the course of a century, while some of the terminologies are the same so the same terms have different meanings depending on who uses it in which context and time period. and text is complicated.
  • Similarly, multiple terms are used to describe the same thing
  • The Unicode terminology, which is the most complete, doesn't just refer to Unicode, but as to how the Unicode people think about text. And given they have a more refined mental model, I'll refer to Unicode terminology most of the time.
  • To the best I can tell the Unicode terminology was introduced around Unicode 3.0, I'll explain why


Very short history


Up until the early 90s, the model was fairly simple:

Some bit pattern (which fits in some variable or fixed numbers of bits) => some abstract character. 

I use "abstract character" in the modern sense here, that varies depending on systems, sometimes the mapping was directly to a glyph, sometimes the mapping was done manually (early telegraph).


In this model, the character encoding designates a character set, where the set of characters corresponds to the characters that can be encoded.

It's all very tautological. In this model, the value of a coded character is its bit pattern. Values are necessary to define an order.


The issue with that model is that a given set of characters can only be encoded one way and, changing the encoding duplicates the character set.

It is also a fairly inflexible model, adding characters to unused bit patterns often requires duplicating the encoding.


This became a problem in the early 90s as

  • Some people thought that 2 bytes were way too wasteful, which led to Ken Thompson drafting UTF-8 on a napkin
  • Some people thought that 2 bytes were not enough at all to represent characters, which let to Unicode code space being extended to 21 bits, the surrogate mechanism, utf-16 and utf-32

At that point, there are multiple encodings describing the same character set... and just like that, the notion diverged.


AFAICT, Unicode / Universal Coded Character Set (different specification, same character set), and GB18030 are the two character sets that have multiple encodings and for which

the distinction between encoded and Coded Character Set matters


For any other encodings, the terms encoding, character set, and coded character set are interchangeable.


For any encoding there exists a character set - There are some subtlety there, as GB18030 and Unicode are tantalizing close to being isomorphic but not quite,

UTF-8 for example can encode either GB18030 or Unicode. But ignoring that difference, 1 encoding => 1 character set


Inversely, a character set can be represented by 1 or more encodings.


Definitions:


Abstract Character and Character Repertoire 


An Abstract Character is what people would colloquially refer to as a character outside of the context of computers.

They do carry some semantic, but they do not have a value or any representation.


The notion of Abstract Character is useful to compare character sets between one another.


A character Repertoire is a set of abstract characters.


A coded character set, which I have colloquially referred to as character set, although there is a slight difference between the two,

is a set of abstract characters assigned to a value - referred to as codepoint. BUT:


* The same abstract character can be assigned multiple values - this is usually done for compatibility reasons

* Multiple different abstract characters can be assigned the same value - this is notably the Han unification. In which case the character set or encoding thereof isn't sufficient to convey the exact semantic meaning of a piece of text or to convert that codepoint to a glyph - more context such as knowing the script or language used is necessary.

* A single abstract character can be assigned multiple codepoints. This is the case for some Latin letters with diacritics, emojis

* A single abstract character can be represented by different sequences of codepoints of different sizes


A coded character set is the result of such mapping.


For example, a repertoire may contain the letter "Ê" which might be represented in a character set by a codepoint for E, one for the circumflex accent, and maybe one for the combined form,

maybe for compatibility purposes.

Then a Coded Character Set may decide to assign the number 1 to E, the number 47 to Ê, and the number 622 to the circumflex accent.


In practice, Character Sets are always Coded Character Set, and both terms are colloquially interchangeable as the goal is to design something representable by computers.

Notice that white the definitions of Coded Character Set and Character Set are distinct there exists no term to describe the individual elements of a character set which is not a Coded Character Set.

As such there exist no character set which is not a coded character set, and while it might be useful to define character set properly somewhere one, I am not sure the distinction is ever necessary for our purpose. Even when we don't care about what the values are, the values exist. Values which are necessary both to define an order, bytewise equality and talk about Unicode properties.


A character encoding in Unicode parlance is mapping to a coded character sets to some serialized form.

With the exception of Unicode and GB18030, a text encoding is also a mapping to a character repertoire, as the character set and the character repertoires are isomorphic.


Character Encoding, Character Encoding Form, And Character Encoding Scheme


These are Unicode specific terms, which I do not think we care about much, and exist because Unicode defines encoding with different endianness:

They first map a codepoint to a sequence of code units (where code units are 8, 16, or 32 bits), then convert these to a sequence of 8 bits bytes applying byte swapping to obtain the desired endian order.

I do not think these distinctions matter in the standard at all - and I recommend using the term character encoding (which applies to all character encodings, whereas CEF/CES are Unicode specific), BUT we may want to specify the endianness of UTF-16 and UTF-32 to be implementation-defined.


A code unit is the minimal unit that can represent a character in a multi-byte encoding (7 for ASCII, 8 for utf8, 16 for utf16, etc)

These maps to C++ character types (char, char8_t, char16_t, etc)


Code units and Code points are Unicode terms, which can be used to describe any encoding, including non-Unicode encodings.

Not all code units sequence represent codepoints, not all code-points represent abstract characters.



These are the main terms, let me know if I should clarify further.



In the context of C++

  • Abstract character is useful when talking about conversion between character sets. This is notably the case in phase one where "physical source file characters" and "The set of physical source file characters" do, I believe, refer to abstract characters and character repertoire respectively. This might change if we want to say something specific about UTF-8 and normalization form. But talking about "Abstract Character Sequence" here lets us not care at all about memory representation. A jpg of text is still an abstract character sequence. 
  • (Phase 1 assumes each "physical source file character" maps to exactly one member of the "basic source character" or one ucn", which is not a correct assumption.)   
  • The rest of the lexing is clearly done on a character set, as there is no ambiguous mapping of characters to grammar elements. There is exactly one way to represent the sequence "constexpr". In particular, normalization of UCN sequences remains constant through phase 2-4.
  • Abstract character is also useful in phase 5 when more conversion is done to talk about the representability of characters in the execution encoding. 
  • It is true that until character literals are formed (modulo a weird thing we can deprecate in the preprocessor ), we do not _care_ about the order or values of codepoints, nor which encoding is used by the internal representation. But it doesn't seem useful to pretend there isn't a representation either. Which is why I recommend not distinguishing coded character sets and character sets
  • The Basic Source and Basic Execution Characters sets are clearly Repertoires, but the Execution Character and Execution Wide Character Sets are character sets ( the standard cares about a value existing, not necessarily what that value is). And again the existence of an encoding implies the existence of the corresponding character set.
  • The constraints on the value of "0" to "9" etc in [lex.charset] actually applies to code units


Did I miss anything?

Should I clarify further?

I hope it helps!











  


On Wed, 10 Jun 2020 at 21:59, Peter Brett via SG16 <sg16@lists.isocpp.org> wrote:

Also pretty much the whole of  [character.seq] needs to be looked at.

 

From: SG16 <sg16-bounces@lists.isocpp.orgOn Behalf Of Tom Honermann via SG16

 Sent: 10 June 2020 20:28

 To: SG16 <sg16@lists.isocpp.org>

 Cc: Tom Honermann <tom@honermann.net>

 Subject: [SG16] Terminology

 

EXTERNAL MAIL

I'm sending the following as a potential guide for discussion in today's SG16 telecon. My apologies for the short notice.

The following lists "things" that we may need (new) names for. For those already present in the standard, the current terms used are included in parenthesis. If you can think of others, please reply.

  • The encoding of source files.
  •  (Physical source file character set;  [lex.phases]p1.1)
  • The source character repertoire.
  •  (Basic source character set;  [lex.charset]p1)
  • The compiler's internal character encoding.
  •  (Internal encoding;  [lex.phases]p1)
  • The character set requirements for the encoding of character and string literals.
  •  (basic execution character set;  [lex.charset]p3)
  • The character set requirements for the encoding of wide character and string literals.
  •  (basic execution wide-character set;  [lex.charset]p3)
  • The encoding of character and string literals.
  •  (execution character set;  [lex.charset]p3)
  • The encoding of wide character and string literals.
  •  (execution wide-character set;  [lex.charset]p3)
  • The encoding of character literals when used in conditional preprocessing directive.
  •  (;  [cpp.cond]p12)
  • The encoding of wide character literals when used in conditional preprocessing directive.
  •  (;  [cpp.cond]p12)
  • The encoding of file names.
  •  (Native encoding;  [fs.path.type.cvt]p1)
  • The encoding of wide file names.
  •  (Native encoding;  [fs.path.type.cvt]p1)
  • The Unicode character set.
  •  (ISO/IEC 10646;  [lex.charset]p2)
  • The encoding of characters and strings at run-time.
  •  ()
  • The terminal/console encoding
  •  ()

Tom.


 

 

 

 

 

 

 

 

 

 

-- 

 SG16 mailing list

 SG16@lists.isocpp.org

 https://lists.isocpp.org/mailman/listinfo.cgi/sg16









  

On Wed, 10 Jun 2020 at 21:59, Peter Brett via SG16 <sg16@lists.isocpp.org> wrote:

Also pretty much the whole of [character.seq] needs to be looked at.

 

From: SG16 <sg16-bounces@lists.isocpp.org> On Behalf Of Tom Honermann via SG16
Sent: 10 June 2020 20:28
To: SG16 <sg16@lists.isocpp.org>
Cc: Tom Honermann <tom@honermann.net>
Subject: [SG16] Terminology

 

EXTERNAL MAIL

I'm sending the following as a potential guide for discussion in today's SG16 telecon.  My apologies for the short notice.

The following lists "things" that we may need (new) names for.  For those already present in the standard, the current terms used are included in parenthesis.  If you can think of others, please reply.

  • The encoding of source files.
    (Physical source file character set; [lex.phases]p1.1)
  • The source character repertoire.
    (Basic source character set; [lex.charset]p1)
  • The compiler's internal character encoding.
    (Internal encoding; [lex.phases]p1)
  • The character set requirements for the encoding of character and string literals.
    (basic execution character set; [lex.charset]p3)
  • The character set requirements for the encoding of wide character and string literals.
    (basic execution wide-character set; [lex.charset]p3)
  • The encoding of character and string literals.
    (execution character set; [lex.charset]p3)
  • The encoding of wide character and string literals.
    (execution wide-character set; [lex.charset]p3)
  • The encoding of character literals when used in conditional preprocessing directive.
    (; [cpp.cond]p12)
  • The encoding of wide character literals when used in conditional preprocessing directive.
    (; [cpp.cond]p12)
  • The encoding of file names.
    (Native encoding; [fs.path.type.cvt]p1)
  • The encoding of wide file names.
    (Native encoding; [fs.path.type.cvt]p1)
  • The Unicode character set.
    (ISO/IEC 10646; [lex.charset]p2)
  • The encoding of characters and strings at run-time.
    ()
  • The terminal/console encoding
    ()

Tom.











--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16