Re: [SG16] Terminology

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sun, 14 Jun 2020 17:55:36 +0200
On Sun, 14 Jun 2020 at 17:39, Henri Sivonen via SG16 <sg16_at_[hidden]>

> (Sorry about breaking threading: I got removed from the mailing list
> for bounces again, so I need to copypaste from the archive.)
> > At that point, there are multiple encodings describing the same character
> > set... and just like that, the notion diverged.
> (JIS X 0208 had multiple encodings before Unicode had multiple encodings.)

I wasn't aware, thanks

> > AFAICT, Unicode / Universal Coded Character Set (different specification,
> > same character set), and GB18030 are the two character sets that have
> > multiple encodings and for which the distinction between encoded and
> > Coded Character Set matters
> It also matters for JIS X 0208 and, to a lesser practical extent, to
> other CJK coded character sets with the same structure. However, this
> isn't relevant to C++, because C++ doesn't need to deal with JIS X
> 0208 itself--only with converting its encodings to and from Unicode.

Yeah we are just trying to talk about which terminology to use i the core

> > For any encoding there exists a character set - There are some subtlety
> > there, as GB18030 and Unicode are tantalizing close to being isomorphic
> but
> > not quite,
> Are you referring to GB18030 assigning semantics to a handful of
> Unicode private-use scalar values or to the de facto Web-flavored
> GB18030 encoding being unable to encode one Unicode scalar value?


I'm a bit worried about C++ discussion putting too much weight on
> perceived distinctions between GB18030 and Unicode when de jure
> GB18030-the-encoding seeks to encode the Unicode scalar value space
> (with the de facto exception of one scalar value) and
> GB18030-the-repertoire identifies a subset of Unicode whose support
> can be evaluated for fonts, layout capability, and such, but that
> isn't relevant to carve out on the C++ level.


> > UTF-8 for example can encode either GB18030 or Unicode. But ignoring that
> > difference, 1 encoding => 1 character set
> (Arguably, various legacy CJK encodings encode at least two coded
> character sets: one ASCII-ish and one grid-like. Some encode more
> coded character sets.)
> > Multiple different abstract characters can be assigned the same value -
> > this is notably the Han unification.
> Isn't it rather the point that characters that have locale-dependent
> glyph variation were analyzed to be the _same_ abstract character in
> Han unification?

Yes, but that analysis was not always correct afaik ( they are actively
adding characters to fix these mistakes)

> > As such there exist no character set which is not a coded character set,
> > and while it might be useful to define character set properly somewhere
> > one, I am not sure the distinction is ever necessary for our purpose.
> I agree the distinction doesn't matter for C++.
> > With the exception of Unicode and GB18030, a text encoding is also a
> > mapping to a character repertoire, as the character set and the character
> > repertoires are isomorphic.
> How are Unicode and GB18030 different in this regard if you view the
> set of possible Unicode scalar values as the repertoire, some of which
> are unassigned? Legacy encodings also have unassigned code space.

Do they? or are they just invalid code unit sequences?

> > Character Encoding, Character Encoding Form, And Character Encoding
> Scheme
> >
> > These are Unicode specific terms, which I do not think we care about
> much,
> > and exist because Unicode defines encoding with different endianness:
> >
> > They first map a codepoint to a sequence of *code units* (where code
> units
> > are 8, 16, or 32 bits), then convert these to a sequence of 8 bits bytes
> > applying byte swapping to obtain the desired endian order.
> I think this analysis, while suggested by Unicode, leads to confusion
> about the non-UTF-8 cases and, worse, can lead API design astray.

I agree

> Specifically, the domain modeling error that I'd like C++ to avoid is
> to have an API with unaligned 8-bit units that identifies an encoding
> as the pair (UTF-16, little-endian) rather than identifying it as the
> single item UTF-16LE.
> I think a more useful way of looking at these is:
> A character encoding scheme always uses 8-bit bytes as its code unit,
> has no alignment requirements, and, therefore, can be used for
> byte-based I/O.
> A character encoding form can have code units that are larger than a
> byte (and aligned accordingly) and, when that is the case, can exist
> in RAM but are unsuited for byte-based I/O.

We are talking about terminology for the core language.
I did not address some comments below as they are ( very relevant ) library
that do not seem useful to describe the encoding of the various character
and string literals in a compiled

> Unfortunately, Unicode gives overlapping names so that there is an
> encoding form called UTF-16 and an encoding scheme called UTF-16 in
> addition to the encoding schemes called UTF-16LE and UTF-16BE.
> The way Microsoft implemented things led to the different reality that
> is documented in the Encoding Standard for the Web Platform and
> probably is the applicable reality to Microsoft-influenced non-Web
> things as well. See the note in the "Encodings" section of the
> Encoding Standard:
> https://encoding.spec.whatwg.org/#encodings
> > I do not think these distinctions matter in the standard at all - and I
> > recommend using the term *character encoding *(which applies to all
> > character encodings, whereas CEF/CES are Unicode specific), BUT we may
> want
> > to specify the endianness of UTF-16 and UTF-32 to be
> implementation-defined.
> I disagree. I think making the distinction that encoding schemes are
> what you use for I/O and encoding forms are what you use in RAM is
> important if you want to get APIs right.
> Also, in RAM operations generally don't want to remove the BOM, but
> I/O operations do.
> > A code unit is the minimal unit that can represent a character in a
> > multi-byte encoding (7 for ASCII, 8 for utf8, 16 for utf16, etc)
> I think this definition can lead API domain modeling astray when
> applied to UTF-16LE or UTF-32LE. Specifically an API that deals with
> UTF-16LE (among other encodings) should not care at all in the API
> signature about 16-bit groups having any internal significance in
> UTF-16LE. (The Unicode Glossary supports your definition but leaves
> the application of the definition to encoding schemes as fuzzy.)
> I think it makes sense to either only define "code unit" for encoding
> forms or to say that it's the smallest addressable unit (so that a
> code unit for UTF-16LE can be 8 bits even if it's too small to
> represent any single character in UTF-16LE).
> > *Code units* and *Code points* are Unicode terms, which can be used to
> > describe any encoding, including non-Unicode encodings.
> Sadly, Unicode made a mess of "code point" by introducing surrogates,
> so that when discussing Unicode, most often the right thing is "scalar
> value". This suggests that non-Unicode should be discussed as scalar
> values, except 1) it's not customary and 2) code points in JIS X 0208
> and inspired code character sets are not scalars but pairs of scalars
> (row and column). :-(

Yes, i did expect that to come up sooner in the discussion!

> > - *Abstract character *is useful when talking about conversion between
> > character sets. This is notably the case in phase one where "physical
> > source file characters" and "The set of physical source file
> characters"
> > do, I believe, refer to abstract characters and character repertoire
> > respectively. This might change if we want to say something specific
> about
> > UTF-8 and normalization form. But talking about "Abstract Character
> > Sequence" here lets us not care at all about memory representation. A
> jpg
> > of text is still an abstract character sequence.
> I think "abstract character" is a distraction in specs whose core
> domain isn't analyzing what text units to assign numbers for. Since
> C++ isn't in that business, it is probably more worthwhile to find
> ways not to talk about "abstract character" in the C++ spec _at all_
> but to talk only about Unicode scalar values and processes that map
> sequences of bytes, char8_t, char16_t or char32_t to sequences of
> Unicode scalar values and vice versa.

Agreed, but it might be useful in phase 1 of translation dealing with
different characters repertoire, and where we might want to avoid
mandating that source files are actually source files haha.
I don't believe it's useful elsewhere, especially as there is some
agreement that conversion to string literal during compilation
should be done on a per code point basis rather than a per grapheme basis.

