sg16: Re: [SG16-Unicode] [isocpp-core] What is the proper term for the locale dependent run-time character set/encoding used for the character classification and conversion functions?

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Thu, 15 Aug 2019 00:00:22 +0100

> For everyone except Niall, the 16-bit codepage is always UTF-16 (he'll tell
> you it's actually binary 16-bit, with no surrogate interpretation).

Oh for the GUI layers, COM layers, font rendering etc, it's UTF-16 alright.

For Win32 there's a reasonable attempt at UTF-16, with corner cases.

But for the NT kernel, it really is byte arrays, and nothing but byte
arrays. struct UNICODE_STRING takes a byte length for its wchar_t*
input, not a character length.

The NT kernel provides a comprehensive suite of functions for testing
byte arrays with one another for equivalence or ordering according to
varying metrics. UTF-16 is but one of those comparison metrics, off the
top of my head there is also UTF-8, ASCII, ANSI, OEM, and of course
bitwise. The difference between this and other approaches is that said
byte arrays have no encoding-correctness to them, and nothing assumes
that they do. Even a character unit may be split, due to the byte length.

In terms of surrogate interpretation, recent NT kernels have fairly up
to date tables. Case insensitive string comparison is done by getting
their front part converted to upper case and indexed, lookups shortlist
the matches based on the upper-cased front part, and then perform
RtlEqualUnicodeString() as needed.

In that sense, yes the NT kernel supports UTF-16. But if accepting that
sense, then it also supports UTF-8, ASCII, ANSI, OEM and so on, and
which is chosen, and how it is implemented, depends on the driver in
question.

So really one ought to speak of a specific bit in the NT kernel, because
each bit has its own algorithm. Linux is similar in this regard in fact
(though it hides the variance better), and Linux is in the process of
breaking out the encoding stuff away from drivers and into a self
contained kernel layer. This will enforce much needed consistency across
the Linux kernel, once everything gets ported over.

I can't say for sure, but given Windows 10's ever improving UTF-8
support, I wouldn't be surprised if Windows goes mixed UTF-8/16 in the
near future. Stuff like NTFS will remain UTF-16, but other NT kernel
stuff could use UTF-8 and nothing would notice. The kernel itself is
agnostic.

If you're feeling brave, you can already legally configure your Windows
10 to be default UTF-8 for all 8 bit encodings, and it appears to
actually work. That means that main() gets UTF-8 as argv, and so on,
just like on POSIX. I would assume (Billy?) that VS2019's runtime is
UTF-8 clean for char by now.

Niall

Received on 2019-08-15 01:00:24