C++ Logo

sg16

Advanced search

[SG16-Unicode] Draft: char8_t backward compatibility remediation paper

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 5 Dec 2018 12:54:10 -0500
Attached is a (very) rough draft of a paper intended for the Kona
pre-meeting mailing. I had hoped to get it into better shape than this
before sharing it, but am sharing now so that we can discuss in today's
SG16 teleconference.

It proposes some minor changes to the standard library to address some
unintended and unnecessary impact from P0482R6. Most of the paper
discusses options projects can use to workaround the (intended) backward
compatibility impact from P0482R6.

Tom.


char8_t backward compatibility remediation

char8_t backward compatibility remediation

Introduction

The support for char8_t as adopted for C++20 via P0482R6 [P0482R6] affects backward compatibility for existing programs in at least the following ways:

This paper presents a set of remediation strategies for addressing backward compatibility issues as well as a few minor changes to the C++ standard to better facilitate migration to C++20.

Examples

Code C++17 C++20 with P0482R6 C++20 with this proposal
std::cout << u8'x';
std::cout << u8"text";
Writes a sequence of UTF-8 code units as characters.
(mojibake if the execution character encoding is not UTF-8)
Writes an integer or pointer value.
(consistent with handling of char16_t and char32_t literals)
Ill-formed.
(for all of char8_t, char16_t, and char32_t literals)
std::string s(u8"text");
Constructs a string object with UTF-8 encoded data. Ill-formed. Ill-formed.
std::filesystem::path p = ...;
std::string s = p.u8string();
Constructs a string object with UTF-8 encoded data. Ill-formed. Ill-formed.
std::filesystem::path(u8"filename");
Constructs a path object with a UTF-8 filename. Ill-formed. Constructs a path object with a UTF-8 filename.

Proposal

Remediation strategies

A single aproach to addressing backward compatibility impact is unlikely to be the best approach for all projects. This section presents a number of options to address various types of backward compatibility impact. In some cases, the best solution may involve a mix of these options.

Shut this off, shut these all off

The simplest possible solution in the short term is to simply disable the new features completely. Clang and gcc will allow disabling char8_t features in both the language and standard library, via a -fno-char8_t option. It is expected that Microsoft and EDG based compilers will offer a similar option.

This option should be considered a short-term solution to enable testing existing C++17 code compiled as C++20 with minimal effort. This isn't a viable long-term option as continued use would complicate interoperation with code that depends on the new features.

New keyword and std members

The lack of a standard char8_t type has prompted some projects to define their own char8_t type alias and corresponding u8string type. For open source projects reviewed by the author, switching to the new standard features is straight forward at the source level, though binary compatibility may be affected. Such projects can retain binary compatibility by continuing to use a type alias, but with a name other than char8_t.

Changed return type for std::filesystem::path member functions

FIXME

UTF-8 literals remediation

Each of these strategies assumes a requirement for continued use of UTF-8 encoded literals with char based types. For most projects, such a requirement is expected to be temporary while the project is fully migrated to C++20. However, some projects may retain a sustained need for such literals. For those projects, the complex remediation approach provides a long-term solution.

Simple remediation for common scenarios

Common uses of u8 literals can be handled in a backward compatible manner through use of reinterpret_cast or by adding new function overloads. Note that use of reinterpret_cast is ok in these situations since lvalues of type char may be used to access values of other types.

This approach may suffice when there are just a few uses of UTF-8 literals that need to be addressed. In general, sprinkling reinterpret_cast all over a code base is not desirable.

Before After
const char &r = u8’x';
const char &r = reinterpret_cast<const char &>(u8’x');
const char *p = u8"text";
const char *p = reinterpret_cast<const char *>(u8"text");
template<int N>
int ft(const char(&)[N]);



ft(u8"text");
template<int N>
int ft(const char(&)[N]);
template<int N>
int ft(const char8_t(&)[N]);

ft(u8"text");
int operator ""_udl(const char*, unsigned long);


int v = u8"text"_udl;
int operator ""_udl(const char*, unsigned long);
int operator ""_udl(const char8_t*, unsigned long);

int v = u8"text"_udl;

Complex remediation for uncommon scenarios

The techniques applied here also apply to the common scenerios discussed in the prior section. This approach makes use of P0732 to enable constexpr UTF-8 encoded char based literals using a user defined literal. The example code below defines overloaded character and string UDL operators named _as_char. These UDLs can then be used in place of existing UTF-8 character and string literals.

#include <utility>

template<std::size_t N>
struct char8_t_string_literal {
  static constexpr inline std::size_t size = N;
  template<std::size_t... I>
  constexpr char8_t_string_literal(
    const char8_t (&r)[N],
    std::index_sequence<I...>)
  :
    s{r[I]...}
  {}
  constexpr char8_t_string_literal(
    const char8_t (&r)[N])
  :
    char8_t_string_literal(r, std::make_index_sequence<N>())
  {}
  auto operator <=>(const char8_t_string_literal&) = default;
  char8_t s[N];
};

template<char8_t_string_literal L, std::size_t... I>
constexpr inline const char as_char_buffer[sizeof...(I)] =
  { static_cast<char>(L.s[I])... };

template<char8_t_string_literal L, std::size_t... I>
constexpr auto& make_as_char_buffer(std::index_sequence<I...>) {
  return as_char_buffer<L, I...>;
}

constexpr char operator ""_as_char(char8_t c) {
  return c;
}

template<char8_t_string_literal L>
constexpr auto& operator""_as_char() {
  return make_as_char_buffer<L>(std::make_index_sequence<decltype(L)::size>());
}

Before After
constexpr const char &r = u8’x';
constexpr const char &r = u8’x'_as_char;
constexpr const char *p = u8"text";
constexpr const char *p = u8"text"_as_char;
// gcc extension in C++17; standard C++ doesn't permit conversion
// to arrays of unknown bound.
constexpr const char (&r)[] = u8"text";
// Ok in C++20 with P0388R2 [P0388R2]

constexpr const char (&r)[] = u8"text"_as_char;

When wrapped in macros, the above UDL can be used to retain source compatibility across C++17 and C++20 for all known scenarios except for array initialization.

#include <utility>

#if defined(__cpp_char8_t)
#define U8(x) u8##x##_as_char
#else
#define U8(x) u8##x
#endif

constexpr char c = U8('x');
constexpr const char &rc = U8('x');
constexpr const char *ps = U8("text");
constexpr const char (&rac)[] = U8("text"); // Ok (with gcc extension or P0388).

Array initialization

Before After
char a[] = u8"text";
FIXME
constexpr char a[] = u8"text";
FIXME

Formal wording

Hide deleted text

These changes are relative to N4762 [N4762]

Library wording

Change in table 35 of 16.3.1 [support.limits.general] paragraph 3:

Table 35 — Standard library feature-test macros
Macro name Value Header(s)
[…] […] […]
__cpp_lib_char8_t 201811201812L ** placeholder ** <atomic> <filesystem> <istream> <limits> <locale> <ostream> <string> <string_view>
[…] […] […]

Drafting note: the final value for the __cpp_lib_char8_t feature test macro will be selected by the project editor to reflect the date of approval.

Append new paragraphs in 28.7.5.2.4 [ostream.inserters.character]:

template<class traits>
  basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, char8_t c) = delete;
template<class traits>
  basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, char16_t c) = delete;
template<class traits>
  basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, char32_t c) = delete;
6. [ Note: These overloads prevent formatting character values as numeric values. — end note ]
template<class traits>
  basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, const char8_t* s) = delete;
template<class traits>
  basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, const char16_t* s) = delete;
template<class traits>
  basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, const char32_t* s) = delete;
7. [ Note: These overloads prevent formatting strings as pointer values. — end note ]

Annex C Compatibility wording

Change in C.5.11 [diff.cpp17.input.output] paragraph 2:

Affected subclause: 27.7.5.2.4
Change: Overload resolution for ostream inserters used with UTF-8 literals.
Rationale: Required for new features.
Effect on original feature: Valid ISO C++ 2017 code that passes UTF-8 literals to basic_ostream<char, ...>::operator<< no longer calls character related overloadsis now ill-formed.
std::cout << u8"text";       // Previously called operator<<(const char*) and printed a string.
                             // Now calls operator<<(const void*) and prints a pointer valueill-formed.
std::cout << u8'X';          // Previously called operator<<(char) and printed a character.
                             // Now calls operator<<(int) and prints an integer valueill-formed.

Add a new paragraph after C.5.11 [diff.cpp17.input.output] paragraph 2:

Affected subclause: 27.7.5.2.4
Change: Overload resolution for ostream inserters used with char16_t and char32_t types.
Rationale: Removal of surprising behavior.
Effect on original feature: Valid ISO C++ 2017 code that passes char16_t and char32_t characters or strings to basic_ostream<char, ...>::operator<< is now ill-formed.
std::cout << u"text";        // Previously called operator<<(const void*) and printed a pointer value.
                             // Now ill-formed.
std::cout << u'X';           // Previously called operator<<(int) and printed an integer value.
                             // Now ill-formed.

Annex D Compatibility features wording

Change in D.16 [depr.fs.path.factory] paragraph 1:

Requires: The source and [first, last) sequences are UTF-8 encoded. The value type of Source and InputIterator is char or char8_t. Source meets the requirements specified in 27.11.7.3.

References

[N4762] "Working Draft, Standard for Programming Language C++", N4762, 2018.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/n4762.pdf
[P0388R2] Robert Haberlach, "Permit conversions to arrays of unknown bound", P0388R2, 2018.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0388r2.html
[P0482R6] Tom Honermann, "char8_t: A type for UTF-8 characters and strings (Revision 6)", P0482R6, 2018.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r6.html
[P0732R2] Jeff Snyder and Louis Dionne, "Class Types in Non-Type Template Parameters", P0732R2, 2018.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0732r2.pdf

Received on 2018-12-05 19:01:43