C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] Questions about some corner cases of proposed std::basic_text encoding implementation

From: Lyberta <lyberta_at_[hidden]>
Date: Sat, 02 Nov 2019 12:11:00 +0000
Ansel Sermersheim:
> 1) There was some discussion about whether or not char32_t is guaranteed
> to be a Unicode Code Point. JeanHeyd pointed me to
> https://wg21.link/p1041, which makes it clear that for string literals
> at least this is guaranteed.

Yes, char32_t is a bad type along with char8_t and char16_t. For that
reason I'm proposing strong types with proper guarantees:

https://github.com/Lyberta/cpp-unicode/blob/master/Fundamental.md

You can put ill-formed Unicode in string literals via escape codes. This
is also bad.

>
> However, this is not sufficiently specified for all cases. For instance,
> a GB 18030 encoding *must* use codepoints in the PUA. If a string
> literal contains a PUA code point, how can you know the interpretation?
> Making this a compile error seems problematic, but the right answer is
> not clear to me.

Can probably be solved by custom instance of
std::unicode::character_database.
>
> 2) The issue of PUA usage also comes up in the implementation of
> Encoding Objects. It seems likely that the current direction will
> necessitate some third party library to handle encodings other than the
> main UTF ones. That seems reasonable. But without some sort of standard
> mechanism that at least enumerates other common interpretations, and
> allows third party libraries to declare their support for such, there
> will be a combinatorial explosion of mutually incompatible encodings.

I think providing conversions to and from Unicode scalar values is enough.

>
> 3) By a similar construction and often overlapping concerns, the
> availability of a standardized way for encodings to declare which
> version of unicode they support is quite important. It's also not clear
> how some of the round trip encodings can possibly be fully specified in
> the type system. For example, how could I properly encode "UTF-8 Unicode
> version 10" text containing emoji into "UTF-16 Unicode version 5" text
> using the PUA for representation for display on OS X 10.7?

Different versions of Unicode and PUA are a job for
std::unicode::character_database.

>
> 4) The behavior of std::basic_text with respect to null termination is
> valid but seems potentially risky. As I understand it, std::basic_text
> will be null terminated if the underlying container is the default
> std::basic_string. This seems likely to result in encoding
> implementations which inadvertently assume null termination on their
> operands. Our work on early versions of the CsString library persuaded
> us that optional null termination is the source of some really obscure
> bugs of the buffer overrun variety, and we eventually elected to force
> null termination for all strings.

I think null termination is just bad design. Pointer + length is the way
to go.


Received on 2019-11-02 13:11:27