Date: Wed, 5 Dec 2018 12:54:10 -0500
pre-meeting mailing. I had hoped to get it into better shape than this
before sharing it, but am sharing now so that we can discuss in today's
SG16 teleconference.
It proposes some minor changes to the standard library to address some
unintended and unnecessary impact from P0482R6. Most of the paper
discusses options projects can use to workaround the (intended) backward
compatibility impact from P0482R6.
Tom.
Document Number: | PXXXXR0 Draft |
---|---|
Date: | 2018-12-05 |
Audience: | Library Evolution Working Group |
Reply-to: | Tom Honermann <tom@honermann.net> |
char8_t backward compatibility remediation
Introduction
The support for char8_t as adopted for C++20 via P0482R6 [P0482R6] affects backward compatibility for existing programs in at least the following ways:
- Introduction of a new char8_t keyword.
- Change of type for u8 character and string literals.
- Introduction of new std::u8string, std::u8string_view, std::u8streampos type aliases and std::mbrtoc8 and std::c8rtomb functions; these names may conflict with existing uses of these names due to ADL or use of using namespace std.
- Change of return type for std::filesystem::path member functions u8string and generic_u8string.
This paper presents a set of remediation strategies for addressing backward compatibility issues as well as a few minor changes to the C++ standard to better facilitate migration to C++20.
Examples
Code | C++17 | C++20 with P0482R6 | C++20 with this proposal |
---|---|---|---|
Writes a sequence of UTF-8 code units as characters. (mojibake if the execution character encoding is not UTF-8) |
Writes an integer or pointer value. (consistent with handling of char16_t and char32_t literals) |
Ill-formed. (for all of char8_t, char16_t, and char32_t literals) |
|
Constructs a string object with UTF-8 encoded data. | Ill-formed. | Ill-formed. | |
Constructs a string object with UTF-8 encoded data. | Ill-formed. | Ill-formed. | |
Constructs a path object with a UTF-8 filename. | Ill-formed. | Constructs a path object with a UTF-8 filename. |
Proposal
- Add deleted overloads of basic_ostream<char, ...>::operator<< for char8_t character and string types. This avoids the silent and surprising behavior change introduced by P0482R6 [P0482R6] that resulted in UTF-8 character being formatted as numeric values and UTF-8 strings being formatted as pointers.
- Add deleted overloads of basic_ostream<char, ...>::operator<< for char16_t and char32_t character and string types. This removes surprising behavior that has been present since C++11; that characters are formatted as numeric values and that strings are formatted as pointers.
- Modify std::filesystem::u8path to accept ranges and iterators
with char8_t value types. This allows existing code that passes
UTF-8 string literals to remain well-formed.
u8path(u8"filename"); // Ok; ill-formed following P0482R6 [P0482R6]. - Update the __cpp_lib_char8_t feature test macro to reflect proposed changes in library behavior.
Remediation strategies
A single aproach to addressing backward compatibility impact is unlikely to be the best approach for all projects. This section presents a number of options to address various types of backward compatibility impact. In some cases, the best solution may involve a mix of these options.
Shut this off, shut these all off
The simplest possible solution in the short term is to simply disable the new features completely. Clang and gcc will allow disabling char8_t features in both the language and standard library, via a -fno-char8_t option. It is expected that Microsoft and EDG based compilers will offer a similar option.
This option should be considered a short-term solution to enable testing existing C++17 code compiled as C++20 with minimal effort. This isn't a viable long-term option as continued use would complicate interoperation with code that depends on the new features.
New keyword and std members
The lack of a standard char8_t type has prompted some projects to define their own char8_t type alias and corresponding u8string type. For open source projects reviewed by the author, switching to the new standard features is straight forward at the source level, though binary compatibility may be affected. Such projects can retain binary compatibility by continuing to use a type alias, but with a name other than char8_t.
Changed return type for std::filesystem::path member functions
FIXME
UTF-8 literals remediation
Each of these strategies assumes a requirement for continued use of UTF-8 encoded literals with char based types. For most projects, such a requirement is expected to be temporary while the project is fully migrated to C++20. However, some projects may retain a sustained need for such literals. For those projects, the complex remediation approach provides a long-term solution.
Simple remediation for common scenarios
Common uses of u8 literals can be handled in a backward compatible manner through use of reinterpret_cast or by adding new function overloads. Note that use of reinterpret_cast is ok in these situations since lvalues of type char may be used to access values of other types.
This approach may suffice when there are just a few uses of UTF-8 literals that need to be addressed. In general, sprinkling reinterpret_cast all over a code base is not desirable.
Before | After |
---|---|
Complex remediation for uncommon scenarios
The techniques applied here also apply to the common scenerios discussed in the prior section. This approach makes use of P0732 to enable constexpr UTF-8 encoded char based literals using a user defined literal. The example code below defines overloaded character and string UDL operators named _as_char. These UDLs can then be used in place of existing UTF-8 character and string literals.
Before | After |
---|---|
When wrapped in macros, the above UDL can be used to retain source compatibility across C++17 and C++20 for all known scenarios except for array initialization.
Array initialization
Before | After |
---|---|
Formal wording
Hide deleted textThese changes are relative to N4762 [N4762]
Library wording
Change in table 35 of 16.3.1 [support.limits.general] paragraph 3:
Table 35 — Standard library feature-test macros
Macro name Value Header(s) […] […] […] __cpp_lib_char8_t 201811201812L ** placeholder **<atomic> <filesystem> <istream> <limits> <locale> <ostream> <string> <string_view> […] […] […]
Drafting note: the final value for the __cpp_lib_char8_t feature test macro will be selected by the project editor to reflect the date of approval.
Append new paragraphs in 28.7.5.2.4 [ostream.inserters.character]:
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, char8_t c) = delete;
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, char16_t c) = delete;
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, char32_t c) = delete;
6. [ Note: These overloads prevent formatting character values as numeric values. — end note ]
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, const char8_t* s) = delete;
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, const char16_t* s) = delete;
template<class traits>
basic_ostream<char, traits>& operator<<(basic_ostream<char, traits>& out, const char32_t* s) = delete;
7. [ Note: These overloads prevent formatting strings as pointer values. — end note ]
Annex C Compatibility wording
Change in C.5.11 [diff.cpp17.input.output] paragraph 2:
Affected subclause: 27.7.5.2.4
Change: Overload resolution for ostream inserters used with UTF-8 literals.
Rationale: Required for new features.
Effect on original feature: Valid ISO C++ 2017 code that passes UTF-8 literals to basic_ostream<char, ...>::operator<<no longer calls character related overloadsis now ill-formed.
std::cout << u8"text"; // Previously called operator<<(const char*) and printed a string. // Nowcalls operator<<(const void*) and prints a pointer valueill-formed. std::cout << u8'X'; // Previously called operator<<(char) and printed a character. // Nowcalls operator<<(int) and prints an integer valueill-formed.
Add a new paragraph after C.5.11 [diff.cpp17.input.output] paragraph 2:
Affected subclause: 27.7.5.2.4
Change: Overload resolution for ostream inserters used with char16_t and char32_t types.
Rationale: Removal of surprising behavior.
Effect on original feature: Valid ISO C++ 2017 code that passes char16_t and char32_t characters or strings to basic_ostream<char, ...>::operator<< is now ill-formed.
std::cout << u"text"; // Previously called operator<<(const void*) and printed a pointer value. // Now ill-formed. std::cout << u'X'; // Previously called operator<<(int) and printed an integer value. // Now ill-formed.
Annex D Compatibility features wording
Change in D.16 [depr.fs.path.factory] paragraph 1:
Requires: The source and [first, last) sequences are UTF-8 encoded. The value type of Source and InputIterator is char or char8_t. Source meets the requirements specified in 27.11.7.3.
References
[N4762] |
"Working Draft, Standard for Programming Language C++", N4762, 2018. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/n4762.pdf |
[P0388R2] |
Robert Haberlach,
"Permit conversions to arrays of unknown bound", P0388R2, 2018. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0388r2.html |
[P0482R6] |
Tom Honermann,
"char8_t: A type for UTF-8 characters and strings (Revision 6)", P0482R6, 2018. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r6.html |
[P0732R2] |
Jeff Snyder and Louis Dionne,
"Class Types in Non-Type Template Parameters", P0732R2, 2018. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0732r2.pdf |
Received on 2018-12-05 19:01:43