C++ Logo

sg16

Advanced search

Re: Core formal unicode classifications functions

From: Thiago Macieira <thiago_at_[hidden]>
Date: Sun, 03 Sep 2023 19:00:41 -0700
On Sunday, 3 September 2023 17:59:21 PDT Steve Downey wrote:
> I know I don't mess up checking for lead before trail, and have to look up
> high vs low. But I don't really have a strong opinion, just an appeal to
> authority.

I understand, sympathise and even agree that their names are better and more
descriptive. Even to reply to this thread, I had to look up code I knew where
this was in use, so I'd refresh my memory.

However, that's not the name Unicode gave them. They are clearly labelled
"High Surrogates" and "Low Surrogates" in the charts:
https://www.unicode.org/charts/

And the fact that the High Surrogates have lower code unit values means that
it can't be about values, but something else. Instead, they carry the high
portion of the UTF-32 code point and, like most encodings and protocols
designed without thought to the prevailing hardware, they come first in memory
order (Big Endian).

(Note that this effectively means we have a PDP or Boustrophedon order when
encoding UTF-16 in Little Endian)

I am all for using names that convey meaning and use in APIs, but I am also
for using established practices where cross-domain knowledge is useful. So the
question is whether having a different name from other APIs benefits us more
than it detracts. For example, someone who already knew about UTF-16 encoding
might look up "high" somewhere in the API to find it and would fail; looking up
"surrogate" would present two concepts they aren't familiar with, leading and
trailing.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel DCAI Cloud Engineering

Received on 2023-09-04 02:00:43