As I said on the telecon, here is my understanding of how it all works.
I make no guarantee that my understanding is accurate, but I'm doing my best!
I should preface by saying that
Very short history
Up until the early 90s, the model was fairly simple:
Some bit pattern (which fits in some variable or fixed numbers of bits) => some abstract character.
I use "abstract character" in the modern sense here, that varies depending on systems, sometimes the mapping was directly to a glyph, sometimes the mapping was done manually (early telegraph).
In this model, the character encoding designates a character set, where the set of characters corresponds to the characters that can be encoded.
It's all very tautological. In this model, the value of a coded character is its bit pattern. Values are necessary to define an order.
The issue with that model is that a given set of characters can only be encoded one way and, changing the encoding duplicates the character set.
It is also a fairly inflexible model, adding characters to unused bit patterns often requires duplicating the encoding.
This became a problem in the early 90s as
At that point, there are multiple encodings describing the same character set... and just like that, the notion diverged.
AFAICT, Unicode / Universal Coded Character Set (different specification, same character set), and GB18030 are the two character sets that have multiple encodings and for which
the distinction between encoded and Coded Character Set matters
For any other encodings, the terms encoding, character set, and coded character set are interchangeable.
For any encoding there exists a character set - There are some subtlety there, as GB18030 and Unicode are tantalizing close to being isomorphic but not quite,
UTF-8 for example can encode either GB18030 or Unicode. But ignoring that difference, 1 encoding => 1 character set
Inversely, a character set can be represented by 1 or more encodings.
Definitions:
Abstract Character and Character Repertoire
An Abstract Character is what people would colloquially refer to as a character outside of the context of computers.
They do carry some semantic, but they do not have a value or any representation.
The notion of Abstract Character is useful to compare character sets between one another.
A character Repertoire is a set of abstract characters.
A coded character set, which I have colloquially referred to as character set, although there is a slight difference between the two,
is a set of abstract characters assigned to a value - referred to as codepoint. BUT:
* The same abstract character can be assigned multiple values - this is usually done for compatibility reasons
* Multiple different abstract characters can be assigned the same value - this is notably the Han unification. In which case the character set or encoding thereof isn't sufficient to convey the exact semantic meaning of a piece of text or to convert that codepoint to a glyph - more context such as knowing the script or language used is necessary.
* A single abstract character can be assigned multiple codepoints. This is the case for some Latin letters with diacritics, emojis
* A single abstract character can be represented by different sequences of codepoints of different sizes
A coded character set is the result of such mapping.
For example, a repertoire may contain the letter "Ê" which might be represented in a character set by a codepoint for E, one for the circumflex accent, and maybe one for the combined form,
maybe for compatibility purposes.
Then a Coded Character Set may decide to assign the number 1 to E, the number 47 to Ê, and the number 622 to the circumflex accent.
In practice, Character Sets are always Coded Character Set, and both terms are colloquially interchangeable as the goal is to design something representable by computers.
Notice that white the definitions of Coded Character Set and Character Set are distinct there exists no term to describe the individual elements of a character set which is not a Coded Character Set.
As such there exist no character set which is not a coded character set, and while it might be useful to define character set properly somewhere one, I am not sure the distinction is ever necessary for our purpose. Even when we don't care about what the values are, the values exist. Values which are necessary both to define an order, bytewise equality and talk about Unicode properties.
A character encoding in Unicode parlance is mapping to a coded character sets to some serialized form.
With the exception of Unicode and GB18030, a text encoding is also a mapping to a character repertoire, as the character set and the character repertoires are isomorphic.
Character Encoding, Character Encoding Form, And Character Encoding Scheme
These are Unicode specific terms, which I do not think we care about much, and exist because Unicode defines encoding with different endianness:
They first map a codepoint to a sequence of code units (where code units are 8, 16, or 32 bits), then convert these to a sequence of 8 bits bytes applying byte swapping to obtain the desired endian order.
I do not think these distinctions matter in the standard at all - and I recommend using the term character encoding (which applies to all character encodings, whereas CEF/CES are Unicode specific), BUT we may want to specify the endianness of UTF-16 and UTF-32 to be implementation-defined.
A code unit is the minimal unit that can represent a character in a multi-byte encoding (7 for ASCII, 8 for utf8, 16 for utf16, etc)
These maps to C++ character types (char, char8_t, char16_t, etc)
Code units and Code points are Unicode terms, which can be used to describe any encoding, including non-Unicode encodings.
Not all code units sequence represent codepoints, not all code-points represent abstract characters.
These are the main terms, let me know if I should clarify further.
In the context of C++
Did I miss anything?
Should I clarify further?
I hope it helps!
On Wed, 10 Jun 2020 at 21:59, Peter Brett via SG16 <sg16@lists.isocpp.org> wrote:
Also pretty much the whole of [character.seq] needs to be looked at.
From: SG16 <sg16-bounces@lists.isocpp.org> On Behalf Of Tom Honermann via SG16
Sent: 10 June 2020 20:28
To: SG16 <sg16@lists.isocpp.org>
Cc: Tom Honermann <tom@honermann.net>
Subject: [SG16] Terminology
EXTERNAL MAIL
I'm sending the following as a potential guide for discussion in today's SG16 telecon. My apologies for the short notice.
The following lists "things" that we may need (new) names for. For those already present in the standard, the current terms used are included in parenthesis. If you can think of others, please reply.
Tom.
--
SG16 mailing list
--Also pretty much the whole of [character.seq] needs to be looked at.
From: SG16 <sg16-bounces@lists.isocpp.org> On Behalf Of Tom Honermann via SG16
Sent: 10 June 2020 20:28
To: SG16 <sg16@lists.isocpp.org>
Cc: Tom Honermann <tom@honermann.net>
Subject: [SG16] Terminology
EXTERNAL MAIL
I'm sending the following as a potential guide for discussion in today's SG16 telecon. My apologies for the short notice.
The following lists "things" that we may need (new) names for. For those already present in the standard, the current terms used are included in parenthesis. If you can think of others, please reply.
- The encoding of source files.
(Physical source file character set; [lex.phases]p1.1)- The source character repertoire.
(Basic source character set; [lex.charset]p1)- The compiler's internal character encoding.
(Internal encoding; [lex.phases]p1)- The character set requirements for the encoding of character and string literals.
(basic execution character set; [lex.charset]p3)- The character set requirements for the encoding of wide character and string literals.
(basic execution wide-character set; [lex.charset]p3)- The encoding of character and string literals.
(execution character set; [lex.charset]p3)- The encoding of wide character and string literals.
(execution wide-character set; [lex.charset]p3)- The encoding of character literals when used in conditional preprocessing directive.
(; [cpp.cond]p12)- The encoding of wide character literals when used in conditional preprocessing directive.
(; [cpp.cond]p12)- The encoding of file names.
(Native encoding; [fs.path.type.cvt]p1)- The encoding of wide file names.
(Native encoding; [fs.path.type.cvt]p1)- The Unicode character set.
(ISO/IEC 10646; [lex.charset]p2)- The encoding of characters and strings at run-time.
()- The terminal/console encoding
()Tom.
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16