sg16: Re: [SG16] Terminology

From: Henri Sivonen <hsivonen_at_[hidden]>
Date: Sun, 14 Jun 2020 12:00:03 +0300

(Sorry about breaking threading: I got removed from the mailing list
for bounces again, so I need to copypaste from the archive.)

> At that point, there are multiple encodings describing the same character
> set... and just like that, the notion diverged.

(JIS X 0208 had multiple encodings before Unicode had multiple encodings.)

> AFAICT, Unicode / Universal Coded Character Set (different specification,
> same character set), and GB18030 are the two character sets that have
> multiple encodings and for which the distinction between encoded and
> Coded Character Set matters

It also matters for JIS X 0208 and, to a lesser practical extent, to
other CJK coded character sets with the same structure. However, this
isn't relevant to C++, because C++ doesn't need to deal with JIS X
0208 itself--only with converting its encodings to and from Unicode.

> For any encoding there exists a character set - There are some subtlety
> there, as GB18030 and Unicode are tantalizing close to being isomorphic but
> not quite,

Are you referring to GB18030 assigning semantics to a handful of
Unicode private-use scalar values or to the de facto Web-flavored
GB18030 encoding being unable to encode one Unicode scalar value?

I'm a bit worried about C++ discussion putting too much weight on
perceived distinctions between GB18030 and Unicode when de jure
GB18030-the-encoding seeks to encode the Unicode scalar value space
(with the de facto exception of one scalar value) and
GB18030-the-repertoire identifies a subset of Unicode whose support
can be evaluated for fonts, layout capability, and such, but that
isn't relevant to carve out on the C++ level.

> UTF-8 for example can encode either GB18030 or Unicode. But ignoring that
> difference, 1 encoding => 1 character set

(Arguably, various legacy CJK encodings encode at least two coded
character sets: one ASCII-ish and one grid-like. Some encode more
coded character sets.)

> Multiple different abstract characters can be assigned the same value -
> this is notably the Han unification.

Isn't it rather the point that characters that have locale-dependent
glyph variation were analyzed to be the _same_ abstract character in
Han unification?

> As such there exist no character set which is not a coded character set,
> and while it might be useful to define character set properly somewhere
> one, I am not sure the distinction is ever necessary for our purpose.

I agree the distinction doesn't matter for C++.

> With the exception of Unicode and GB18030, a text encoding is also a
> mapping to a character repertoire, as the character set and the character
> repertoires are isomorphic.

How are Unicode and GB18030 different in this regard if you view the
set of possible Unicode scalar values as the repertoire, some of which
are unassigned? Legacy encodings also have unassigned code space.

> Character Encoding, Character Encoding Form, And Character Encoding Scheme
>
> These are Unicode specific terms, which I do not think we care about much,
> and exist because Unicode defines encoding with different endianness:
>
> They first map a codepoint to a sequence of *code units* (where code units
> are 8, 16, or 32 bits), then convert these to a sequence of 8 bits bytes
> applying byte swapping to obtain the desired endian order.

I think this analysis, while suggested by Unicode, leads to confusion
about the non-UTF-8 cases and, worse, can lead API design astray.

Specifically, the domain modeling error that I'd like C++ to avoid is
to have an API with unaligned 8-bit units that identifies an encoding
as the pair (UTF-16, little-endian) rather than identifying it as the
single item UTF-16LE.

I think a more useful way of looking at these is:

A character encoding scheme always uses 8-bit bytes as its code unit,
has no alignment requirements, and, therefore, can be used for
byte-based I/O.

A character encoding form can have code units that are larger than a
byte (and aligned accordingly) and, when that is the case, can exist
in RAM but are unsuited for byte-based I/O.

Unfortunately, Unicode gives overlapping names so that there is an
encoding form called UTF-16 and an encoding scheme called UTF-16 in
addition to the encoding schemes called UTF-16LE and UTF-16BE.

The way Microsoft implemented things led to the different reality that
is documented in the Encoding Standard for the Web Platform and
probably is the applicable reality to Microsoft-influenced non-Web
things as well. See the note in the "Encodings" section of the
Encoding Standard:
https://encoding.spec.whatwg.org/#encodings

> I do not think these distinctions matter in the standard at all - and I
> recommend using the term *character encoding *(which applies to all
> character encodings, whereas CEF/CES are Unicode specific), BUT we may want
> to specify the endianness of UTF-16 and UTF-32 to be implementation-defined.

I disagree. I think making the distinction that encoding schemes are
what you use for I/O and encoding forms are what you use in RAM is
important if you want to get APIs right.

Also, in RAM operations generally don't want to remove the BOM, but
I/O operations do.

> A code unit is the minimal unit that can represent a character in a
> multi-byte encoding (7 for ASCII, 8 for utf8, 16 for utf16, etc)

I think this definition can lead API domain modeling astray when
applied to UTF-16LE or UTF-32LE. Specifically an API that deals with
UTF-16LE (among other encodings) should not care at all in the API
signature about 16-bit groups having any internal significance in
UTF-16LE. (The Unicode Glossary supports your definition but leaves
the application of the definition to encoding schemes as fuzzy.)

I think it makes sense to either only define "code unit" for encoding
forms or to say that it's the smallest addressable unit (so that a
code unit for UTF-16LE can be 8 bits even if it's too small to
represent any single character in UTF-16LE).

> *Code units* and *Code points* are Unicode terms, which can be used to
> describe any encoding, including non-Unicode encodings.

Sadly, Unicode made a mess of "code point" by introducing surrogates,
so that when discussing Unicode, most often the right thing is "scalar
value". This suggests that non-Unicode should be discussed as scalar
values, except 1) it's not customary and 2) code points in JIS X 0208
and inspired code character sets are not scalars but pairs of scalars
(row and column). :-(

> - *Abstract character *is useful when talking about conversion between
> character sets. This is notably the case in phase one where "physical
> source file characters" and "The set of physical source file characters"
> do, I believe, refer to abstract characters and character repertoire
> respectively. This might change if we want to say something specific about
> UTF-8 and normalization form. But talking about "Abstract Character
> Sequence" here lets us not care at all about memory representation. A jpg
> of text is still an abstract character sequence.

I think "abstract character" is a distraction in specs whose core
domain isn't analyzing what text units to assign numbers for. Since
C++ isn't in that business, it is probably more worthwhile to find
ways not to talk about "abstract character" in the C++ spec _at all_
but to talk only about Unicode scalar values and processes that map
sequences of bytes, char8_t, char16_t or char32_t to sequences of
Unicode scalar values and vice versa.

-- 
Henri Sivonen
hsivonen_at_[hidden]
https://hsivonen.fi/

Received on 2020-06-14 04:03:28