C++ Logo


Advanced search

Re: [SG16] LWG 3565 (Handling of encodings in localized formatting of chrono types is underspecified)

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 30 Jul 2021 11:38:10 -0400
Avoiding multiple localization mechanisms is desirable.

I think the problem we're having boils down to this: Do we want
std::format() (and the proposed std::print()) to manipulate strings
(NTBSs with ambiguous or polyglot encoding; e.g., mojibake) or text
(well formed code unit sequences for a particular encoding). The
existing locale facilities do not support the latter because there are
multiple possible encodings at play (the ordinary literal encoding or
the locale encoding, neither of which necessarily matches the
programmers intent; the programmer may be using UTF-8 encoded strings
with a literal encoding of Windows-1252 running in a Windows-1251
locale). The PR for the issue tries to split the difference by choosing
the former if the literal encoding is not UTF-8 and the latter
otherwise. This inconsistency is concerning to some.

Speaking solely for myself, I'm leaning towards these utilities
manipulating strings (not text) in all existing cases. This puts the
burden of producing valid text on the programmer (e.g., if the format
string is UTF-8 and the locale provides Windows-1251, then it is up to
the programmer to accept the mojibake possibility or do something
explicit to prevent it). This is consistent with how the existing locale
facilities work and allows these utilities to function as drop in
replacements for printf(); including support for formatting binary data.

A possible way forward would be to allow the programmer to express
encoding intent by passing a P1885 <https://wg21.link/p1885> encoding
identifier so that formatting functions can produce text in the expected
encoding. This doesn't necessarily eliminate all encoding confusion
however; should the format string be interpreted using the literal
encoding or the explicitly provided encoding? When the literal encoding
is Windows-1252, how should something like
std::format(std::text_encoding::UTF8, "téxt) be handled (note that the
encoding of "é" is different in Windows-1252 vs UTF-8)? In this case,
it seems rather obvious that the implementation should use Windows-1252
to interpret the format string and then transcode it to UTF-8. Note
that such transcoding would have to be performed a fragment at a time
since not all fragments necessarily originate in the same encoding.
This would, of course, impose overhead, but only on an opt-in basis.


On 7/30/21 9:59 AM, Howard Hinnant wrote:
> The intent here is that the implementor uses the same machinery as for http://eel.is/c++draft/locale.time.put. I do not think we want to burden the std::lib with two independent localization mechanisms.
> Howard
> On Jul 30, 2021, at 8:46 AM, Jonathan Wakely via Lib <lib_at_[hidden]> wrote:
>> On Fri, 30 Jul 2021 at 13:45, Corentin via Lib <lib_at_[hidden]> wrote:
>> We decided we want a paper to deal with the issue.
>> We definitely want to postpone!
>> OK, thanks.
>> On Fri, Jul 30, 2021 at 1:05 PM Jeff Garland <jeff_at_[hidden]> wrote:
>> Thanks Tom —
>> Are there wiki notes or anything? We may want to defer discussion until you’ve had more time.
>> Jeff
>>> On Jul 29, 2021, at 11:41 PM, Tom Honermann <tom_at_[hidden]> wrote:
>>> Hi, Jeff. SG16 did discuss LWG 3565 this week. We haven’t reached a conclusion yet but the consensus appears to be heading in a direction that will lead to a different resolution than what is proposed in the issue. I’ll follow up more once I have the meeting summary and polls posted.
>>> Tom.
>>>> On Jul 29, 2021, at 8:10 PM, Jeff Garland via Lib <lib_at_[hidden]> wrote:
>>>> Apologies for the late notice. All new papers for this week:
>>>> P1072 basic_string::resize_and_overwrite
>>>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1072r8.html
>>>> P2372R1 (LWG 3547) Fixing locale handling in chrono formatters ** c++20 bug fix **
>>>> https://wg21.link/P2372R1
>>>> related issues:
>>>> LWG 3547 Time formatters should not be locale sensitive by default
>>>> https://cplusplus.github.io/LWG/issue3547
>>>> LWG 3565 Handling of encodings in localized formatting of chrono types is underspecified
>>>> https://cplusplus.github.io/LWG/issue3565
>>>> P1636 Formatters for Library Types
>>>> https://wg21.link/p1636r2
>>>> ——
>>>> The zoom details for this meeting (and all following LWG meetings) are:
>>>> Join from PC, Mac, Linux, iOS or Android: https://iso.zoom.us/j/99098440581?pwd=K01lM0VyVTB1NjRJN2lRbzFMTit3QT09
>>>> Password: template
>>>> Or iPhone one-tap :
>>>> US: +12532158782,,99098440581# or +13017158592,,99098440581#
>>>> Or Telephone:
>>>> Dial(for higher quality, dial a number based on your current location):
>>>> US: +1 253 215 8782 or +1 301 715 8592 or +1 312 626 6799 or +1 346 248 7799 or +1 408 638 0968 or +1 646 876 9923 or +1 669 900 6833 or 877 853 5247 (Toll Free)
>>>> Meeting ID: 990 9844 0581
>>>> Password: 07955058
>>>> International numbers available: https://iso.zoom.us/u/a4YcGUHwU
>>>> Or Skype for Business (Lync):
>>>> https://iso.zoom.us/skype/99098440581
>>>> _______________________________________________
>>>> Lib mailing list
>>>> Lib_at_[hidden]
>>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
>>>> Link to this post: http://lists.isocpp.org/lib/2021/07/19950.php
>> _______________________________________________
>> Lib mailing list
>> Lib_at_[hidden]
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
>> Link to this post: http://lists.isocpp.org/lib/2021/07/19954.php
>> _______________________________________________
>> Lib mailing list
>> Lib_at_[hidden]
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
>> Link to this post: http://lists.isocpp.org/lib/2021/07/19955.php

Received on 2021-07-30 10:38:14