Date: Fri, 3 May 2024 17:05:28 -0400
On 4/29/24 7:08 PM, Peter Dimov via SG16 wrote:
> Tom Honermann wrote:
>>> And we don't want to make std::cout << u8"..." do that, because it
>>> can, in principle, do better?
>> Not because it can do better, but because there is more uncertainty about
>> what the user might expect. If the user writes std::cout << std::format(...),
>> then that is an explicit opt in to the behavior that
>> std::format() exhibits. But they might also want to just write UTF-8 bytes
>> unmodified regardless of what the ordinary literal encoding is. Or they might
>> expect implicit transcoding to either the current locale or the environment
>> locale or even the terminal locale. By not providing a default behavior, we give
>> the programmer the opportunity to think about what they are actually trying
>> to do.
> I'm not sure I buy all that. Once format() returns, we are left with a string
> in the literal encoding. That string goes to std::cout. There's not much
> difference between sending a string in the literal encoding to std::cout,
> and sending a string in UTF-8 to std::cout, especially when the literal encoding
> is UTF-8, but also in principle.
I responded to these concerns elsewhere with a tl;dr of, no,
std::format() does not necessarily produce a string in the ordinary
literal encoding; even when it is working as intended.
>
> Namely,
>
>> iostreams implicitly consults either an imbued locale facet or the global locale
>> for formatting operations.
> this remains true for either of our string encodings. There's absolutely no
> guarantee that the imbued locale facet is more suitable for outputting the
> literal encoding than it's for outputting UTF-8. In fact it may very well be less
> suitable.
We need to be a little careful with what we mean by "locale facet" when
discussing encodings. A locale has an associated encoding (that is not
exposed as a facet, though perhaps we should consider doing that; I
think Jonathan Wakely might be planning to do so already for internal
use in libstdc++) that is distinct from whatever encodings a
std::codecvt facet is setup to conver between.
If "imbued locale facet" was intended to mean std::codecvt, then I agree.
>
>> In the latter case, we have to assume that some_std_string holds text in the
>> encoding expected on the other end of the stream.
> Again, I don't see why that would be true. If you are going to invoke CP437
> in the UTF-8 case, I don't see why we suddenly need to ignore its existence
> in the literal encoding case.
>
> There's nothing stopping us from making std::cout << u8"..." _at least as
> good as_ std::cout << std::format( "{}", u8"..." ) - we just make it transcode
> to the literal encoding. Yes, it's potentially possible to do better than that,
> but it needn't be any worse, and in the common case of the literal
> encoding being UTF-8, both will be as good as can be achieved.
I responded elsewhere; I think this neglects long standing use of
locales with code page based encodings.
>
> Now, had the proposal on the table been std::print( u8"{}", u8"..." )... that's
> another story altogether. But we aren't talking about that.
Not yet anyway :)
Tom.
> Tom Honermann wrote:
>>> And we don't want to make std::cout << u8"..." do that, because it
>>> can, in principle, do better?
>> Not because it can do better, but because there is more uncertainty about
>> what the user might expect. If the user writes std::cout << std::format(...),
>> then that is an explicit opt in to the behavior that
>> std::format() exhibits. But they might also want to just write UTF-8 bytes
>> unmodified regardless of what the ordinary literal encoding is. Or they might
>> expect implicit transcoding to either the current locale or the environment
>> locale or even the terminal locale. By not providing a default behavior, we give
>> the programmer the opportunity to think about what they are actually trying
>> to do.
> I'm not sure I buy all that. Once format() returns, we are left with a string
> in the literal encoding. That string goes to std::cout. There's not much
> difference between sending a string in the literal encoding to std::cout,
> and sending a string in UTF-8 to std::cout, especially when the literal encoding
> is UTF-8, but also in principle.
I responded to these concerns elsewhere with a tl;dr of, no,
std::format() does not necessarily produce a string in the ordinary
literal encoding; even when it is working as intended.
>
> Namely,
>
>> iostreams implicitly consults either an imbued locale facet or the global locale
>> for formatting operations.
> this remains true for either of our string encodings. There's absolutely no
> guarantee that the imbued locale facet is more suitable for outputting the
> literal encoding than it's for outputting UTF-8. In fact it may very well be less
> suitable.
We need to be a little careful with what we mean by "locale facet" when
discussing encodings. A locale has an associated encoding (that is not
exposed as a facet, though perhaps we should consider doing that; I
think Jonathan Wakely might be planning to do so already for internal
use in libstdc++) that is distinct from whatever encodings a
std::codecvt facet is setup to conver between.
If "imbued locale facet" was intended to mean std::codecvt, then I agree.
>
>> In the latter case, we have to assume that some_std_string holds text in the
>> encoding expected on the other end of the stream.
> Again, I don't see why that would be true. If you are going to invoke CP437
> in the UTF-8 case, I don't see why we suddenly need to ignore its existence
> in the literal encoding case.
>
> There's nothing stopping us from making std::cout << u8"..." _at least as
> good as_ std::cout << std::format( "{}", u8"..." ) - we just make it transcode
> to the literal encoding. Yes, it's potentially possible to do better than that,
> but it needn't be any worse, and in the common case of the literal
> encoding being UTF-8, both will be as good as can be achieved.
I responded elsewhere; I think this neglects long standing use of
locales with code page based encodings.
>
> Now, had the proposal on the table been std::print( u8"{}", u8"..." )... that's
> another story altogether. But we aren't talking about that.
Not yet anyway :)
Tom.
Received on 2024-05-03 21:05:31