C++ Logo

sg16

Advanced search

Re: [SG16] Terminology

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Thu, 11 Jun 2020 15:22:08 +0200
As I said on the telecon, here is my understanding of how it all works.

I make no guarantee that my understanding is accurate, but I'm doing my
best!


I should preface by saying that

   - It's all very complicated, as our understanding of text has changed a
   lot in the course of a century, while some of the terminologies are the
   same so the same terms have different meanings depending on who uses it in
   which context and time period. and text is complicated.
   - Similarly, multiple terms are used to describe the same thing
   - The Unicode terminology, which is the most complete, doesn't just
   refer to Unicode, but as to how the Unicode people think about text. And
   given they have a more refined mental model, I'll refer to Unicode
   terminology most of the time.
   - To the best I can tell the Unicode terminology was introduced around
   Unicode 3.0, I'll explain why


Very short history


Up until the early 90s, the model was fairly simple:

Some bit pattern (which fits in some variable or fixed numbers of bits) =>
some abstract character.

I use "abstract character" in the modern sense here, that varies depending
on systems, sometimes the mapping was directly to a glyph, sometimes the
mapping was done manually (early telegraph).


In this model, the character encoding designates a character set, where the
set of characters corresponds to the characters that can be encoded.

It's all very tautological. In this model, the value of a coded character
is its bit pattern. Values are necessary to define an order.


The issue with that model is that a given set of characters can only be
encoded one way and, changing the encoding duplicates the character set.

It is also a fairly inflexible model, adding characters to unused bit
patterns often requires duplicating the encoding.


This became a problem in the early 90s as

   - Some people thought that 2 bytes were way too wasteful, which led to
   Ken Thompson drafting UTF-8 on a napkin
   - Some people thought that 2 bytes were not enough at all to represent
   characters, which let to Unicode code space being extended to 21 bits, the
   surrogate mechanism, utf-16 and utf-32

At that point, there are multiple encodings describing the same character
set... and just like that, the notion diverged.


AFAICT, Unicode / Universal Coded Character Set (different specification,
same character set), and GB18030 are the two character sets that have
multiple encodings and for which

the distinction between encoded and Coded Character Set matters


For any other encodings, the terms encoding, character set, and coded
character set are interchangeable.


For any encoding there exists a character set - There are some subtlety
there, as GB18030 and Unicode are tantalizing close to being isomorphic but
not quite,

UTF-8 for example can encode either GB18030 or Unicode. But ignoring that
difference, 1 encoding => 1 character set


Inversely, a character set can be represented by 1 or more encodings.


Definitions:


Abstract Character and Character Repertoire


An Abstract Character is what people would colloquially refer to as a
character outside of the context of computers.

They do carry some semantic, but they do not have a value or any
representation.


The notion of Abstract Character is useful to compare character sets
between one another.


A *character Repertoire* is a set of abstract characters.


A coded character set, which I have colloquially referred to as character
set, although there is a slight difference between the two,

is a set of abstract characters assigned to a value - referred to as
codepoint. BUT:


* The same abstract character can be assigned multiple values - this is
usually done for compatibility reasons

* Multiple different abstract characters can be assigned the same value -
this is notably the Han unification. In which case the character set or
encoding thereof isn't sufficient to convey the exact semantic meaning of a
piece of text or to convert that codepoint to a glyph - more context such
as knowing the script or language used is necessary.

* A single abstract character can be assigned multiple codepoints. This is
the case for some Latin letters with diacritics, emojis

* A single abstract character can be represented by different sequences of
codepoints of different sizes


A *coded character set *is the result of such mapping.


For example, a repertoire may contain the letter "Ê" which might be
represented in a character set by a codepoint for E, one for the circumflex
accent, and maybe one for the combined form,

maybe for compatibility purposes.

Then a *Coded* Character Set may decide to assign the number 1 to E, the
number 47 to Ê, and the number 622 to the circumflex accent.


In practice, Character Sets are always *Coded* Character Set, and both
terms are colloquially interchangeable as the goal is to design something
representable by computers.

Notice that white the definitions of *Coded* Character Set and Character
Set are distinct there exists no term to describe the individual elements
of a character set which is not a *Coded* Character Set.

As such there exist no character set which is not a coded character set,
and while it might be useful to define character set properly somewhere
one, I am not sure the distinction is ever necessary for our purpose. Even
when we don't care about what the values are, the values exist. Values
which are necessary both to define an order, bytewise equality and talk
about Unicode properties.


A character encoding in Unicode parlance is mapping to a coded character
sets to some serialized form.

With the exception of Unicode and GB18030, a text encoding is also a
mapping to a character repertoire, as the character set and the character
repertoires are isomorphic.


Character Encoding, Character Encoding Form, And Character Encoding Scheme


These are Unicode specific terms, which I do not think we care about much,
and exist because Unicode defines encoding with different endianness:

They first map a codepoint to a sequence of *code units* (where code units
are 8, 16, or 32 bits), then convert these to a sequence of 8 bits bytes
applying byte swapping to obtain the desired endian order.

I do not think these distinctions matter in the standard at all - and I
recommend using the term *character encoding *(which applies to all
character encodings, whereas CEF/CES are Unicode specific), BUT we may want
to specify the endianness of UTF-16 and UTF-32 to be implementation-defined.


A code unit is the minimal unit that can represent a character in a
multi-byte encoding (7 for ASCII, 8 for utf8, 16 for utf16, etc)

These maps to C++ character types (char, char8_t, char16_t, etc)


*Code units* and *Code points* are Unicode terms, which can be used to
describe any encoding, including non-Unicode encodings.

Not all code units sequence represent codepoints, not all code-points
represent abstract characters.



These are the main terms, let me know if I should clarify further.



In the context of C++

   - *Abstract character *is useful when talking about conversion between
   character sets. This is notably the case in phase one where "physical
   source file characters" and "The set of physical source file characters"
   do, I believe, refer to abstract characters and character repertoire
   respectively. This might change if we want to say something specific about
   UTF-8 and normalization form. But talking about "Abstract Character
   Sequence" here lets us not care at all about memory representation. A jpg
   of text is still an abstract character sequence.
   - (Phase 1 assumes each "physical source file character" maps to exactly
   one member of the "basic source character" or one ucn", which is not a
   correct assumption.)
   - The rest of the lexing is clearly done on a character set, as there is
   no ambiguous mapping of characters to grammar elements. There is exactly
   one way to represent the sequence "constexpr". In particular, normalization
   of UCN sequences remains constant through phase 2-4.
   - Abstract character is also useful in phase 5 when more conversion is
   done to talk about the representability of characters in the execution
   encoding.
   - It is true that until character literals are formed (modulo a weird
   thing we can deprecate in the preprocessor ), we do not _care_ about the
   order or values of codepoints, nor which encoding is used by the internal
   representation. But it doesn't seem useful to pretend there isn't a
   representation either. Which is why I recommend not distinguishing coded
   character sets and character sets
   - The Basic Source and Basic Execution Characters sets are clearly
   Repertoires, but the Execution Character and Execution Wide Character Sets
   are character sets ( the standard cares about a value existing, not
   necessarily what that value is). And again the existence of an encoding
   implies the existence of the corresponding character set.
   - The constraints on the value of "0" to "9" etc in [lex.charset]
   actually applies to code *units*


Did I miss anything?

Should I clarify further?

I hope it helps!














On Wed, 10 Jun 2020 at 21:59, Peter Brett via SG16 <sg16_at_[hidden]>
wrote:

Also pretty much the whole of [character.seq]
<http://eel.is/c++draft/character.seq> needs to be looked at.



*From:* SG16 <sg16-bounces_at_[hidden]> *On Behalf Of *Tom Honermann
via SG16

 *Sent:* 10 June 2020 20:28

 *To:* SG16 <sg16_at_[hidden]>

 *Cc:* Tom Honermann <tom_at_[hidden]>

 *Subject:* [SG16] Terminology



EXTERNAL MAIL

I'm sending the following as a potential guide for discussion in today's
SG16 telecon. My apologies for the short notice.

The following lists "things" that we may need (new) names for. For those
already present in the standard, the current terms used are included in
parenthesis. If you can think of others, please reply.

   - *The encoding of source files.*
   - (Physical source file character set; [lex.phases]p1.1
   <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.phases*1.1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHb3louvw$>
   )
   - *The source character repertoire.*
   - (Basic source character set; [lex.charset]p1
   <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHEw91o17Q$>
   )
   - *The compiler's internal character encoding.*
   - (Internal encoding; [lex.phases]p1
   <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.phases*1.1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHb3louvw$>
   )
   - *The character set requirements for the encoding of character and
   string literals.*
   - (basic execution character set; [lex.charset]p3
   <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
   )
   - *The character set requirements for the encoding of wide character and
   string literals.*
   - (basic execution wide-character set; [lex.charset]p3
   <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
   )
   - *The encoding of character and string literals.*
   - (execution character set; [lex.charset]p3
   <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
   )
   - *The encoding of wide character and string literals.*
   - (execution wide-character set; [lex.charset]p3
   <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
   )
   - *The encoding of character literals when used in conditional
   preprocessing directive.*
   - (; [cpp.cond]p12
   <https://urldefense.com/v3/__http:/eel.is/c**Adraft/cpp.cond*12__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHF9PoQrXA$>
   )
   - *The encoding of wide character literals when used in conditional
   preprocessing directive.*
   - (; [cpp.cond]p12
   <https://urldefense.com/v3/__http:/eel.is/c**Adraft/cpp.cond*12__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHF9PoQrXA$>
   )
   - *The encoding of file names.*
   - (Native encoding; [fs.path.type.cvt]p1
   <https://urldefense.com/v3/__http:/eel.is/c**Adraft/fs.path.type.cvt*1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHFf8GuPcg$>
   )
   - *The encoding of wide file names.*
   - (Native encoding; [fs.path.type.cvt]p1
   <https://urldefense.com/v3/__http:/eel.is/c**Adraft/fs.path.type.cvt*1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHFf8GuPcg$>
   )
   - *The Unicode character set.*
   - (ISO/IEC 10646; [lex.charset]p2
   <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*2__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHF_BXXcPg$>
   )
   - *The encoding of characters and strings at run-time.*
   - ()
   - *The terminal/console encoding*
   - ()

Tom.






















-- 
 SG16 mailing list
 SG16_at_[hidden]
 https://lists.isocpp.org/mailman/listinfo.cgi/sg16
On Wed, 10 Jun 2020 at 21:59, Peter Brett via SG16 <sg16_at_[hidden]>
wrote:
> Also pretty much the whole of [character.seq]
> <http://eel.is/c++draft/character.seq> needs to be looked at.
>
>
>
> *From:* SG16 <sg16-bounces_at_[hidden]> *On Behalf Of *Tom Honermann
> via SG16
> *Sent:* 10 June 2020 20:28
> *To:* SG16 <sg16_at_[hidden]>
> *Cc:* Tom Honermann <tom_at_[hidden]>
> *Subject:* [SG16] Terminology
>
>
>
> EXTERNAL MAIL
>
> I'm sending the following as a potential guide for discussion in today's
> SG16 telecon.  My apologies for the short notice.
>
> The following lists "things" that we may need (new) names for.  For those
> already present in the standard, the current terms used are included in
> parenthesis.  If you can think of others, please reply.
>
>    - *The encoding of source files.*
>    (Physical source file character set; [lex.phases]p1.1
>    <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.phases*1.1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHb3louvw$>
>    )
>    - *The source character repertoire.*
>    (Basic source character set; [lex.charset]p1
>    <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHEw91o17Q$>
>    )
>    - *The compiler's internal character encoding.*
>    (Internal encoding; [lex.phases]p1
>    <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.phases*1.1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHb3louvw$>
>    )
>    - *The character set requirements for the encoding of character and
>    string literals.*
>    (basic execution character set; [lex.charset]p3
>    <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
>    )
>    - *The character set requirements for the encoding of wide character
>    and string literals.*
>    (basic execution wide-character set; [lex.charset]p3
>    <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
>    )
>    - *The encoding of character and string literals.*
>    (execution character set; [lex.charset]p3
>    <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
>    )
>    - *The encoding of wide character and string literals.*
>    (execution wide-character set; [lex.charset]p3
>    <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*3__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHHHcd_gbg$>
>    )
>    - *The encoding of character literals when used in conditional
>    preprocessing directive.*
>    (; [cpp.cond]p12
>    <https://urldefense.com/v3/__http:/eel.is/c**Adraft/cpp.cond*12__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHF9PoQrXA$>
>    )
>    - *The encoding of wide character literals when used in conditional
>    preprocessing directive.*
>    (; [cpp.cond]p12
>    <https://urldefense.com/v3/__http:/eel.is/c**Adraft/cpp.cond*12__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHF9PoQrXA$>
>    )
>    - *The encoding of file names.*
>    (Native encoding; [fs.path.type.cvt]p1
>    <https://urldefense.com/v3/__http:/eel.is/c**Adraft/fs.path.type.cvt*1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHFf8GuPcg$>
>    )
>    - *The encoding of wide file names.*
>    (Native encoding; [fs.path.type.cvt]p1
>    <https://urldefense.com/v3/__http:/eel.is/c**Adraft/fs.path.type.cvt*1__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHFf8GuPcg$>
>    )
>    - *The Unicode character set.*
>    (ISO/IEC 10646; [lex.charset]p2
>    <https://urldefense.com/v3/__http:/eel.is/c**Adraft/lex.charset*2__;Kysj!!EHscmS1ygiU1lA!Qhd9pnNpEhOhXSwz5Re0aGGBDzJlM2IceYV5pU-s8iETaVP8ZwgvwHF_BXXcPg$>
>    )
>    - *The encoding of characters and strings at run-time.*
>    ()
>    - *The terminal/console encoding*
>    ()
>
> Tom.
>
>
>
>
>
>
>
>
>
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-06-11 08:25:35