C++ Logo

sg16

Advanced search

[SG16] WG14 N2761: Length modifiers for Unicode character and string types

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 11 Jul 2021 11:50:21 -0400
FYI, WG14 N2761
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2761.pdf> proposes
extending the printf() format syntax to allow passing char16_t and
char32_t strings. From the abstract:

> The formatting family of functions (printf, scanf, etc; hereafter
> referred to as "format functions") have
> supported the l type modifier for the c and s format specifiers for a
> while, the point of the l modifier is to
> add support for "wide" characters/strings.
>
> The concept of "wide" characters/strings comes from Unicode's history,
> back in the late 80s/early 90s
> when it was thought that Unicode could be contained within 65535
> characters, which has not been true
> since the advent of the UTF-16 and UTF-32 encoding forms, which were
> presented as part of Unicode
> 2.0 in July 1996.
>
> On the internet, according to Web Technology Surveys, Unicode and it's
> derivatives/ancestors (ISO-
> 8859-1, ASCII, Windows-1252) make up 98.0% of webpages, as of December
> 2020.
>
> Combine that fact, with the fact that the only 16 and 32 bit character
> sets I can find in my research are
> UTF-16, UCS-2, UTF-32, and UCS-4; All Unicode encodings. UCS-4 is an
> alias for UTF-32, and UCS-2 is
> an ancestor encoding that UTF-16 superseded by adding Surrogate Pairs,
> and Surrogate Pairs are
> encoded in such a way that it can't affect the de/en coding of
> char16_t codepoints by any Unicode
> compatible codec from the last quarter century, and it becomes clear
> that in use, char16_t and char32_t
> can ONLY contain Unicode.
>
> The C standard it's self, as of C11 introduces typedefs for these
> characters in uchar.h, char16_t and
> char32_t.
>
> But there is still problems today with wide characters and strings and
> Unicode character and string types.
> Take this simple program for example [0] This simple program produces
> no output on my computer,
> compiled with Clang 11 on MacOS or Windows.
>
> In short, wide characters and strings are a broken and obsolete feature.
>
> But, I'm not here to wrestle with the committee about removing or even
> deprecating wide characters/string
> support from the standard library.
>
> Instead, I'm here to propose a more sane solution to this mess: Add
> two length modifiers to format
> specifiers, l16 and l32 for c and s specifier types for UTF-16 and
> UTF-32 support respectively.
>
> So format specifiers would look like %l16c, %l16s, %l32c, %l32s
>
> I've implemented support for the 16 and 32 extension to the l (ell)
> length modifier in Clang already.
>
Tom.

Received on 2021-07-11 10:50:25