C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] Questions about some corner cases of proposed std::basic_text encoding implementation

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 6 Nov 2019 13:35:36 +0000
Hi, Ansel. I just wanted to offer a quick thank you for the email and
apology for no follow up yet. We've been busy preparing for the Belfast
meeting.

Tom.

On 11/2/19 5:16 AM, Ansel Sermersheim wrote:
> Hello all,
>
> This email is an attempt to summarize for the mailing list some areas of
> concern I had after JeanHeyd's very helpful and explanatory presentation
> at CppCon regarding some of the current thinking on standardizing the
> Unicode story in C++. I hope these concerns are either unfounded, or
> developments since our conversation have rendered them moot.
> Nevertheless, I thought it would be beneficial to bring them up to this
> group for consideration.
>
> 1) There was some discussion about whether or not char32_t is guaranteed
> to be a Unicode Code Point. JeanHeyd pointed me to
> https://wg21.link/p1041, which makes it clear that for string literals
> at least this is guaranteed.
>
> However, this is not sufficiently specified for all cases. For instance,
> a GB 18030 encoding *must* use codepoints in the PUA. If a string
> literal contains a PUA code point, how can you know the interpretation?
> Making this a compile error seems problematic, but the right answer is
> not clear to me.
>
> 2) The issue of PUA usage also comes up in the implementation of
> Encoding Objects. It seems likely that the current direction will
> necessitate some third party library to handle encodings other than the
> main UTF ones. That seems reasonable. But without some sort of standard
> mechanism that at least enumerates other common interpretations, and
> allows third party libraries to declare their support for such, there
> will be a combinatorial explosion of mutually incompatible encodings.
>
> 3) By a similar construction and often overlapping concerns, the
> availability of a standardized way for encodings to declare which
> version of unicode they support is quite important. It's also not clear
> how some of the round trip encodings can possibly be fully specified in
> the type system. For example, how could I properly encode "UTF-8 Unicode
> version 10" text containing emoji into "UTF-16 Unicode version 5" text
> using the PUA for representation for display on OS X 10.7?
>
> 4) The behavior of std::basic_text with respect to null termination is
> valid but seems potentially risky. As I understand it, std::basic_text
> will be null terminated if the underlying container is the default
> std::basic_string. This seems likely to result in encoding
> implementations which inadvertently assume null termination on their
> operands. Our work on early versions of the CsString library persuaded
> us that optional null termination is the source of some really obscure
> bugs of the buffer overrun variety, and we eventually elected to force
> null termination for all strings.
>
> Thanks for reading and I hope these comments are of value to inform the
> eventual standard,
>
> Ansel Sermersheim
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode

Received on 2019-11-06 14:35:40