C++ Logo

sg16

Advanced search

Re: [isocpp-lib-ext] Formatters converting sequences of char to sequences of wchar_t

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 1 Jun 2023 12:20:59 -0400
On 5/31/23 1:29 PM, Victor Zverovich via Lib-Ext wrote:
> Thanks for catching this, it looks like a serious problem.
>
> - Do we want to allow string types of chars to be formatted as
> sequences of wchar_ts?
>
> Definitely not. In general this requires transcoding and goes against
> the design of std::format.
>
> - Do we want to allow non string type sequences of chars to be
> formatted as sequences of wchar_ts?
>
> Similarly no.
>
> - Should we disable char to wchar_t conversion in the range_formatter?
>
> Probably yes but I haven't looked in detail yet.
My intuition matches Victor's for all three questions.
>
> Please open an LWG issue and we'll need to discuss it in SG16.

Please copy me on the LWG issue request and I'll schedule it for
discussion in SG16 once it has been created.

Tom.

>
> - Victor
>
>
>
> On Wed, May 31, 2023 at 9:20 AM Mark de Wever <koraq_at_[hidden]> wrote:
>
> I noticed some interesting features introduced by the range based
> formatters in C++23
>
> // Ill-formed in C++20 and C++23
> const char* cstr = "hello";
> char* str = const_cast<char*>(cstr);
> std::format(L"{}", str);
> std::format(L"{}",cstr);
>
> // Ill-formed in C++20
> // In C++23 they give L"['h', 'e', 'l', 'l', 'o']"
> std::format(L"{}", "hello"); // A libc++ bug prevents this from
> working.
> std::format(L"{}", std::string_view("hello"));
> std::format(L"{}", std::string("hello"));
> std::format(L"{}", std::vector{'h', 'e', 'l', 'l', 'o'});
>
> An example is shown here [1]. This only shows libc++ since
> libstdc++ and
> MSVC STL have not implemented the formatting ranges papers yet.
>
> The difference between C++20 and C++23 is the existence of range
> formatters. These formatters use the formatter specialization
> formatter<char, wchar_t> which converts the sequence of chars to a
> sequence of wchar_ts.
>
> In this conversion same_as<char, charT> is false, thus the
> requirements
> of the range-type s and ?s ([tab:formatter.range.type]) aren't met. So
> the following is ill-formed:
>
> std::format(L"{:s}", std::string("hello")); // Not L"hello"
>
> It is surprising that some string types can be formatted as a sequence
> of wide-characters, but others not. A sequence of characters can be a
> sequence UTF-8 code units. This is explicitly supported in the width
> estimation of string types. The conversion of char to wchar_t will
> convert the individual code units, which will give incorrect
> results for
> multi-byte code points. It will not transcode UTF-8 to UTF-16/32. The
> current behavior is not in line with the note in
> [format.formatter.spec]/2
>
> [Note 1: Specializations such as formatter<wchar_t, char> and
> formatter<const char*, wchar_t> that would require implicit
> multibyte / wide string or character conversion are
> disabled. — end note]
>
> Disabling this could be done by explicitly disabling the char to
> wchar_t
> sequence formatter. Something along the lines of
>
> template <ranges::input_range R>
> requires(format_kind<R> == range_format::sequence &&
> same_as<remove_cvref_t<ranges::range_reference_t<R>>, char>)
> struct formatter<R, wchar_t> : __disabled_formatter {};
>
> where __disabled_formatter satisfies [format.formatter.spec]/5, would
> do the trick. This disables the conversion for all sequences not only
> the string types. So vector, array, span, etc. would be disabled.
>
> This does not disable the conversion in the range_formatter. This
> allows
> users to explicitly opt in to this formatter for their own
> specializations.
>
> An alternative would be to only disable this conversion for string
> type
> specializations ([format.formatter.spec]/2.2) where char to wchar_t is
> used:
>
> template<size_t N> struct formatter<charT[N], charT>;
> template<class traits, class Allocator>
> struct formatter<basic_string<charT, traits, Allocator>, charT>;
> template<class traits>
> struct formatter<basic_string_view<charT, traits>, charT>;
>
> Disabling following the following two is not strictly required:
>
> template<> struct formatter<char*, wchar_t>;
> template<> struct formatter<const char*, wchar_t>;
>
> However, if (const) char* becomes an input_range in a future version
> C++, these formatters would become enabled. Disabling all five instead
> of the three required specializations seems like a future proof
> solution.
>
> Since there is no enabled narrowing formatter specialization
> template<> struct formatter<wchar_t, char>;
> there are no issues for wchar_t to char conversions.
>
> Before filing an LWG issue I would like to get some feedback on which
> direction we want to go, specifically:
>
> - Do we want to allow string types of chars to be formatted as
> sequences of wchar_ts?
> - Do we want to allow non string type sequences of chars to be
> formatted as sequences of wchar_ts?
> - Should we disable char to wchar_t conversion in the range_formatter?
>
> Personally I vote no for these three questions.
>
>
> [1] https://godbolt.org/z/P9E6TK3YW
>
> Cheers,
> Mark
>
>
> _______________________________________________
> Lib-Ext mailing list
> Lib-Ext_at_[hidden]
> Subscription:https://lists.isocpp.org/mailman/listinfo.cgi/lib-ext
> Link to this post:http://lists.isocpp.org/lib-ext/2023/05/25189.php

Received on 2023-06-01 16:21:01