C++ Logo

SG16

Advanced search

Subject: Re: [SG16-Unicode] Questions about some corner cases of proposed std::basic_text encoding implementation
From: JeanHeyd Meneide (phdofthehouse_at_[hidden])
Date: 2019-11-08 18:21:42


Dear Lyberta and Ansel,

     Thank you for the comments sent here! My responses will be below. Do
let me know if I accidentally misunderstood or messed something up in my
understanding, as I wasn't exactly sure I captured all of the concerns
properly.

On Sat, Nov 2, 2019 at 12:11 PM Lyberta <lyberta_at_[hidden]> wrote:

> Ansel Sermersheim:
> > 1) There was some discussion about whether or not char32_t is guaranteed
> > to be a Unicode Code Point. JeanHeyd pointed me to
> > https://wg21.link/p1041, which makes it clear that for string literals
> > at least this is guaranteed.
>
> Yes, char32_t is a bad type along with char8_t and char16_t. For that
> reason I'm proposing strong types with proper guarantees:
>
> https://github.com/Lyberta/cpp-unicode/blob/master/Fundamental.md
>
> You can put ill-formed Unicode in string literals via escape codes. This
> is also bad.
>

Responding to both bites at once: you can do bad things with escape codes,
but I am okay with that because you've written it using _escape codes_.
Non-escaped text and \u and \U qualified text is mandated to be well-formed
(and people are working on tightening the constraints here). I am perfectly
okay with people having back doors and ways out, as long as the back doors
and ways out are sufficiently grep-able / easy to identify.

> >
> > However, this is not sufficiently specified for all cases. For instance,
> > a GB 18030 encoding *must* use codepoints in the PUA. If a string
> > literal contains a PUA code point, how can you know the interpretation?
> > Making this a compile error seems problematic, but the right answer is
> > not clear to me.
>
> Can probably be solved by custom instance of
> std::unicode::character_database.
>

I use char32_t solely out of the problem not being fully solved by SG16 at
the moment. There was great interest in providing strong types for
unicode_code_point; even in my implementation, I use the name
unicode_code_point. Right now it defaults to an alias of char32_t since we
have not fully decided which direction is worth taking here:

   - people with field experience and existing codebases want char32_t /
   uint32_t here;
   - previous implementations of text_view/text from Tom use an
   implementation where each encoding gets its own strong code point type;
   - other implementations just use char32_t and deem that to be fine
   enough.

I am heavily leaning towards char32_t representing a UTF32 code point, with
the caveat that certain encodings which may use 32-bit types to represent
its PUA characters will still be able to define a strong code_point type on
their encoding and then use the various levers present in the paper and
implementation to make it clear that their "code point" is different from
the typical unicode code point, because it carries different semantic
meaning. For example, a gb18030 and gb18030_web encoding object would use a
gb_code_point type, which is more or less morally equivalent to a char32_t
save for some different interpretations with PUA characters. This allows
the type to clearly differentiate between "gb18030 code points" and "normal
unicode code points". This may also be where -- as Lyberta has pointed out
-- room for a custom character database in use with algorithms
(normalization, collation, etc.) will come in handy.

All in all, Encoding Objects have a code_point type for exactly this
reason. I suspect the majority will want to use char32_t for interopability
and cohesion with the ecosystem at large. If someone uses special Private
Use Area characters but still wants to present as a normal "unicode code
point", they can do that. If they want a far more strict conversion policy,
they can put a stronger type in there. The flexibility here is important to
serve all use cases, and I think will work well here.

> >
> > 2) The issue of PUA usage also comes up in the implementation of
> > Encoding Objects. It seems likely that the current direction will
> > necessitate some third party library to handle encodings other than the
> > main UTF ones. That seems reasonable. But without some sort of standard
> > mechanism that at least enumerates other common interpretations, and
> > allows third party libraries to declare their support for such, there
> > will be a combinatorial explosion of mutually incompatible encodings.
>
> I think providing conversions to and from Unicode scalar values is enough.
>

My long-term plan is that the entire slew of WHATWG encodings should end up
in the Standard Library. But that is not feasible or rational as a "Step
0"; it is important to remember that as far as Unicode support goes, C++
and C are at Step -1 in the Standard. utf8/16/32, ascii,
narrow_dynamic_encoding/wide_dynamic_encoding first, so we can get people
from String Literal / Platform Strings -> Unicode, and then we can address
the large body of encodings that exist outside of what is currently present
and ships with "The System". The design, as pointed out in the last answer,
is intentionally open to make sure we are not cutting CJK encodings or PUA
encodings off at the legs here and asking them to walk when it comes time
to make sure we include a broad suite of encodings.

> 3) By a similar construction and often overlapping concerns, the
> > availability of a standardized way for encodings to declare which
> > version of unicode they support is quite important. It's also not clear
> > how some of the round trip encodings can possibly be fully specified in
> > the type system. For example, how could I properly encode "UTF-8 Unicode
> > version 10" text containing emoji into "UTF-16 Unicode version 5" text
> > using the PUA for representation for display on OS X 10.7?
>
> Different versions of Unicode and PUA are a job for
> std::unicode::character_database.
>

Perhaps it is a job for the database, but I want to be clear: what this
proposal wants to deal with are encodings and -- potentially --
Normalization Forms. Encodings do not affect the interpretation of Emoji,
and Normalization Forms have forward-compatibility guarantees since Unicode
version 3.x. If emojis not defined in Unicode Version 5 are given to an
application that only has a knowledge of Unicode Version 5, then the
default answer is {however the application handles error state in its
text}. For example, web browsers that do not understand X emoji display the
codepoint value boxes. Other applications display "?". Some display the
missing-value "�". It is up to the application to process characters that
are within the 21 bits allotted by Unicode but have no understanding to
process it how they see fit. It's on an application, its text renderer,
etc.

The encoding layer is only to check for valid encodings (e.g., "fits in the
21-bit of Unicode and is not an invalid sequence according to the
specification of this encoding"): subscribing meaning to that is a layer
above this and something every application has different answers to. Some
error and crash, some display replacement text, some ignore the text
entirely, some tank the HTTP request, etc.

>
> > 4) The behavior of std::basic_text with respect to null termination is
> > valid but seems potentially risky. As I understand it, std::basic_text
> > will be null terminated if the underlying container is the default
> > std::basic_string. This seems likely to result in encoding
> > implementations which inadvertently assume null termination on their
> > operands. Our work on early versions of the CsString library persuaded
> > us that optional null termination is the source of some really obscure
> > bugs of the buffer overrun variety, and we eventually elected to force
> > null termination for all strings.
>
> I think null termination is just bad design. Pointer + length is the way
> to go.
>

Whether or not Null Termination is bad design, C and C++ are about
compatibility with enormous hulking old beasts of existing practice and
operating system APIs, many of which work in null termination land.

The current design does optionally null terminate. Many people have
problems with the null terminator and are desperately trying to design
performance around it (
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1072r5.html). It
may be dangerous, but the choice should be given to the user.
std::vector<T> is as much a valid backing code unit storage as
std::basic_string, and I do not think I should force users to only have
basic_string-like semantics.

Absolutely, the default should be basic_string. I plan to make sure that
`.c_str()` is a member of a text object if and only if the backing storage
is basic_string. Otherwise, there will be no `.c_str()` argument, and code
that uses .c_str() will stop compiling the moment someone changes the
underlying container's string. Calls to .data() may still yet be
problematic. I am glad for CsString's experience here; I will need to be
very careful and bring this up to the Design Groups.



SG16 list run by herb.sutter at gmail.com