Formatters converting sequences of char to sequences of wchar_t

From: Mark de Wever <koraq_at_[hidden]>
Date: Wed, 31 May 2023 18:20:37 +0200
I noticed some interesting features introduced by the range based
formatters in C++23

  // Ill-formed in C++20 and C++23
  const char* cstr = "hello";
  char* str = const_cast<char*>(cstr);
  std::format(L"{}", str);

  // Ill-formed in C++20
  // In C++23 they give L"['h', 'e', 'l', 'l', 'o']"
  std::format(L"{}", "hello"); // A libc++ bug prevents this from working.
  std::format(L"{}", std::string_view("hello"));
  std::format(L"{}", std::string("hello"));
  std::format(L"{}", std::vector{'h', 'e', 'l', 'l', 'o'});

An example is shown here [1]. This only shows libc++ since libstdc++ and
MSVC STL have not implemented the formatting ranges papers yet.

The difference between C++20 and C++23 is the existence of range
formatters. These formatters use the formatter specialization
formatter<char, wchar_t> which converts the sequence of chars to a
sequence of wchar_ts.

In this conversion same_as<char, charT> is false, thus the requirements
of the range-type s and ?s ([tab:formatter.range.type]) aren't met. So
the following is ill-formed:

  std::format(L"{:s}", std::string("hello")); // Not L"hello"

It is surprising that some string types can be formatted as a sequence
of wide-characters, but others not. A sequence of characters can be a
sequence UTF-8 code units. This is explicitly supported in the width
estimation of string types. The conversion of char to wchar_t will
convert the individual code units, which will give incorrect results for
multi-byte code points. It will not transcode UTF-8 to UTF-16/32. The
current behavior is not in line with the note in

  [Note 1: Specializations such as formatter<wchar_t, char> and
  formatter<const char*, wchar_t> that would require implicit
  multibyte / wide string or character conversion are
  disabled. — end note]

Disabling this could be done by explicitly disabling the char to wchar_t
sequence formatter. Something along the lines of

  template <ranges::input_range R>
    requires(format_kind<R> == range_format::sequence &&
             same_as<remove_cvref_t<ranges::range_reference_t<R>>, char>)
  struct formatter<R, wchar_t> : __disabled_formatter {};

where __disabled_formatter satisfies [format.formatter.spec]/5, would
do the trick. This disables the conversion for all sequences not only
the string types. So vector, array, span, etc. would be disabled.

This does not disable the conversion in the range_formatter. This allows
users to explicitly opt in to this formatter for their own

An alternative would be to only disable this conversion for string type
specializations ([format.formatter.spec]/2.2) where char to wchar_t is

  template<size_t N> struct formatter<charT[N], charT>;
  template<class traits, class Allocator>
    struct formatter<basic_string<charT, traits, Allocator>, charT>;
  template<class traits>
    struct formatter<basic_string_view<charT, traits>, charT>;

Disabling following the following two is not strictly required:

  template<> struct formatter<char*, wchar_t>;
  template<> struct formatter<const char*, wchar_t>;

However, if (const) char* becomes an input_range in a future version
C++, these formatters would become enabled. Disabling all five instead
of the three required specializations seems like a future proof

Since there is no enabled narrowing formatter specialization
  template<> struct formatter<wchar_t, char>;
there are no issues for wchar_t to char conversions.

Before filing an LWG issue I would like to get some feedback on which
direction we want to go, specifically:

- Do we want to allow string types of chars to be formatted as
  sequences of wchar_ts?
- Do we want to allow non string type sequences of chars to be
  formatted as sequences of wchar_ts?
- Should we disable char to wchar_t conversion in the range_formatter?

Personally I vote no for these three questions.

[1] https://godbolt.org/z/P9E6TK3YW


