Re: Core formal unicode classifications functions

Contemporary messages sorted:

From: Thiago Macieira <thiago_at_[hidden]>
Date: Sun, 03 Sep 2023 12:17:54 -0700

On Saturday, 2 September 2023 23:18:50 PDT Jens Maurer via SG16 wrote:
> If not, and those are your own inventions,
> is_lead, is_trail -> add _surrogate
> and maybe "is_leading_surrogate"

Everyone knows them as high and low surrogates. Those are poor names because
it's unclear which one comes first and which one comes second, but the names
are now established. This means porting code over would need to check which
one is which and could lead to subtle mistakes in porting.

Sticking to the standard Unicode names, however poor they are, is probably
best.

> Is there any 8-bit value >= 0xc2 that is NOT a lead byte?

Yes, quite a few:

0xfe & 0xfe - never permitted
0xfc & 0xfd - 6-byte UTF-8 sequences (non-Unicode)
0xf8 - 0xfb - 5- byte UTF-8 sequences (non-Unicode)
0xf5 - 0xf7 - 4-byte UTF-8 sequences outside Unicode range (above 0x110000)

Whether 0xf5 through 0xfd can be considered "non lead byte" or just a failed
decode is a different story. But it's the same determination as 0xc0 and 0xc1.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel DCAI Cloud Engineering

Received on 2023-09-03 19:17:55