Date: Fri, 1 Nov 2019 22:16:14 -0700
Hello all,
This email is an attempt to summarize for the mailing list some areas of
concern I had after JeanHeyd's very helpful and explanatory presentation
at CppCon regarding some of the current thinking on standardizing the
Unicode story in C++. I hope these concerns are either unfounded, or
developments since our conversation have rendered them moot.
Nevertheless, I thought it would be beneficial to bring them up to this
group for consideration.
1) There was some discussion about whether or not char32_t is guaranteed
to be a Unicode Code Point. JeanHeyd pointed me to
https://wg21.link/p1041, which makes it clear that for string literals
at least this is guaranteed.
However, this is not sufficiently specified for all cases. For instance,
a GB 18030 encoding *must* use codepoints in the PUA. If a string
literal contains a PUA code point, how can you know the interpretation?
Making this a compile error seems problematic, but the right answer is
not clear to me.
2) The issue of PUA usage also comes up in the implementation of
Encoding Objects. It seems likely that the current direction will
necessitate some third party library to handle encodings other than the
main UTF ones. That seems reasonable. But without some sort of standard
mechanism that at least enumerates other common interpretations, and
allows third party libraries to declare their support for such, there
will be a combinatorial explosion of mutually incompatible encodings.
3) By a similar construction and often overlapping concerns, the
availability of a standardized way for encodings to declare which
version of unicode they support is quite important. It's also not clear
how some of the round trip encodings can possibly be fully specified in
the type system. For example, how could I properly encode "UTF-8 Unicode
version 10" text containing emoji into "UTF-16 Unicode version 5" text
using the PUA for representation for display on OS X 10.7?
4) The behavior of std::basic_text with respect to null termination is
valid but seems potentially risky. As I understand it, std::basic_text
will be null terminated if the underlying container is the default
std::basic_string. This seems likely to result in encoding
implementations which inadvertently assume null termination on their
operands. Our work on early versions of the CsString library persuaded
us that optional null termination is the source of some really obscure
bugs of the buffer overrun variety, and we eventually elected to force
null termination for all strings.
Thanks for reading and I hope these comments are of value to inform the
eventual standard,
Ansel Sermersheim
This email is an attempt to summarize for the mailing list some areas of
concern I had after JeanHeyd's very helpful and explanatory presentation
at CppCon regarding some of the current thinking on standardizing the
Unicode story in C++. I hope these concerns are either unfounded, or
developments since our conversation have rendered them moot.
Nevertheless, I thought it would be beneficial to bring them up to this
group for consideration.
1) There was some discussion about whether or not char32_t is guaranteed
to be a Unicode Code Point. JeanHeyd pointed me to
https://wg21.link/p1041, which makes it clear that for string literals
at least this is guaranteed.
However, this is not sufficiently specified for all cases. For instance,
a GB 18030 encoding *must* use codepoints in the PUA. If a string
literal contains a PUA code point, how can you know the interpretation?
Making this a compile error seems problematic, but the right answer is
not clear to me.
2) The issue of PUA usage also comes up in the implementation of
Encoding Objects. It seems likely that the current direction will
necessitate some third party library to handle encodings other than the
main UTF ones. That seems reasonable. But without some sort of standard
mechanism that at least enumerates other common interpretations, and
allows third party libraries to declare their support for such, there
will be a combinatorial explosion of mutually incompatible encodings.
3) By a similar construction and often overlapping concerns, the
availability of a standardized way for encodings to declare which
version of unicode they support is quite important. It's also not clear
how some of the round trip encodings can possibly be fully specified in
the type system. For example, how could I properly encode "UTF-8 Unicode
version 10" text containing emoji into "UTF-16 Unicode version 5" text
using the PUA for representation for display on OS X 10.7?
4) The behavior of std::basic_text with respect to null termination is
valid but seems potentially risky. As I understand it, std::basic_text
will be null terminated if the underlying container is the default
std::basic_string. This seems likely to result in encoding
implementations which inadvertently assume null termination on their
operands. Our work on early versions of the CsString library persuaded
us that optional null termination is the source of some really obscure
bugs of the buffer overrun variety, and we eventually elected to force
null termination for all strings.
Thanks for reading and I hope these comments are of value to inform the
eventual standard,
Ansel Sermersheim
Received on 2019-11-02 06:16:19