C++ Logo

sg16

Advanced search

Re: Formatters converting sequences of char to sequences of wchar_t

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Wed, 31 May 2023 10:29:09 -0700
Thanks for catching this, it looks like a serious problem.

- Do we want to allow string types of chars to be formatted as
  sequences of wchar_ts?

Definitely not. In general this requires transcoding and goes against the
design of std::format.

- Do we want to allow non string type sequences of chars to be
  formatted as sequences of wchar_ts?

Similarly no.

- Should we disable char to wchar_t conversion in the range_formatter?

Probably yes but I haven't looked in detail yet.

Please open an LWG issue and we'll need to discuss it in SG16.

- Victor



On Wed, May 31, 2023 at 9:20 AM Mark de Wever <koraq_at_[hidden]> wrote:

> I noticed some interesting features introduced by the range based
> formatters in C++23
>
> // Ill-formed in C++20 and C++23
> const char* cstr = "hello";
> char* str = const_cast<char*>(cstr);
> std::format(L"{}", str);
> std::format(L"{}",cstr);
>
> // Ill-formed in C++20
> // In C++23 they give L"['h', 'e', 'l', 'l', 'o']"
> std::format(L"{}", "hello"); // A libc++ bug prevents this from working.
> std::format(L"{}", std::string_view("hello"));
> std::format(L"{}", std::string("hello"));
> std::format(L"{}", std::vector{'h', 'e', 'l', 'l', 'o'});
>
> An example is shown here [1]. This only shows libc++ since libstdc++ and
> MSVC STL have not implemented the formatting ranges papers yet.
>
> The difference between C++20 and C++23 is the existence of range
> formatters. These formatters use the formatter specialization
> formatter<char, wchar_t> which converts the sequence of chars to a
> sequence of wchar_ts.
>
> In this conversion same_as<char, charT> is false, thus the requirements
> of the range-type s and ?s ([tab:formatter.range.type]) aren't met. So
> the following is ill-formed:
>
> std::format(L"{:s}", std::string("hello")); // Not L"hello"
>
> It is surprising that some string types can be formatted as a sequence
> of wide-characters, but others not. A sequence of characters can be a
> sequence UTF-8 code units. This is explicitly supported in the width
> estimation of string types. The conversion of char to wchar_t will
> convert the individual code units, which will give incorrect results for
> multi-byte code points. It will not transcode UTF-8 to UTF-16/32. The
> current behavior is not in line with the note in
> [format.formatter.spec]/2
>
> [Note 1: Specializations such as formatter<wchar_t, char> and
> formatter<const char*, wchar_t> that would require implicit
> multibyte / wide string or character conversion are
> disabled. — end note]
>
> Disabling this could be done by explicitly disabling the char to wchar_t
> sequence formatter. Something along the lines of
>
> template <ranges::input_range R>
> requires(format_kind<R> == range_format::sequence &&
> same_as<remove_cvref_t<ranges::range_reference_t<R>>, char>)
> struct formatter<R, wchar_t> : __disabled_formatter {};
>
> where __disabled_formatter satisfies [format.formatter.spec]/5, would
> do the trick. This disables the conversion for all sequences not only
> the string types. So vector, array, span, etc. would be disabled.
>
> This does not disable the conversion in the range_formatter. This allows
> users to explicitly opt in to this formatter for their own
> specializations.
>
> An alternative would be to only disable this conversion for string type
> specializations ([format.formatter.spec]/2.2) where char to wchar_t is
> used:
>
> template<size_t N> struct formatter<charT[N], charT>;
> template<class traits, class Allocator>
> struct formatter<basic_string<charT, traits, Allocator>, charT>;
> template<class traits>
> struct formatter<basic_string_view<charT, traits>, charT>;
>
> Disabling following the following two is not strictly required:
>
> template<> struct formatter<char*, wchar_t>;
> template<> struct formatter<const char*, wchar_t>;
>
> However, if (const) char* becomes an input_range in a future version
> C++, these formatters would become enabled. Disabling all five instead
> of the three required specializations seems like a future proof
> solution.
>
> Since there is no enabled narrowing formatter specialization
> template<> struct formatter<wchar_t, char>;
> there are no issues for wchar_t to char conversions.
>
> Before filing an LWG issue I would like to get some feedback on which
> direction we want to go, specifically:
>
> - Do we want to allow string types of chars to be formatted as
> sequences of wchar_ts?
> - Do we want to allow non string type sequences of chars to be
> formatted as sequences of wchar_ts?
> - Should we disable char to wchar_t conversion in the range_formatter?
>
> Personally I vote no for these three questions.
>
>
> [1] https://godbolt.org/z/P9E6TK3YW
>
> Cheers,
> Mark
>

Received on 2023-05-31 17:29:22