Date: Fri, 30 Jul 2021 17:56:03 +0200
On Fri, Jul 30, 2021 at 5:38 PM Tom Honermann <tom_at_[hidden]> wrote:
> Avoiding multiple localization mechanisms is desirable.
>
> I think the problem we're having boils down to this: Do we want
> std::format() (and the proposed std::print()) to manipulate strings
> (NTBSs with ambiguous or polyglot encoding; e.g., mojibake) or text (well
> formed code unit sequences for a particular encoding). The existing locale
> facilities do not support the latter because there are multiple possible
> encodings at play (the ordinary literal encoding or the locale encoding,
> neither of which necessarily matches the programmers intent; the programmer
> may be using UTF-8 encoded strings with a literal encoding of Windows-1252
> running in a Windows-1251 locale). The PR for the issue tries to split the
> difference by choosing the former if the literal encoding is not UTF-8 and
> the latter otherwise. This inconsistency is concerning to some.
>
> Speaking solely for myself, I'm leaning towards these utilities
> manipulating strings (not text) in all existing cases. This puts the
> burden of producing valid text on the programmer (e.g., if the format
> string is UTF-8 and the locale provides Windows-1251, then it is up to the
> programmer to accept the mojibake possibility or do something explicit to
> prevent it). This is consistent with how the existing locale facilities
> work and allows these utilities to function as drop in replacements for
> printf(); including support for formatting binary data.
>
> A possible way forward would be to allow the programmer to express
> encoding intent by passing a P1885 <https://wg21.link/p1885> encoding
> identifier so that formatting functions can produce text in the expected
> encoding. This doesn't necessarily eliminate all encoding confusion
> however; should the format string be interpreted using the literal encoding
> or the explicitly provided encoding? When the literal encoding is
> Windows-1252, how should something like std::format(std::text_encoding::UTF8,
> "téxt) be handled (note that the encoding of "é" is different in
> Windows-1252 vs UTF-8)? In this case, it seems rather obvious that the
> implementation should use Windows-1252 to interpret the format string and
> then transcode it to UTF-8. Note that such transcoding would have to be
> performed a fragment at a time since not all fragments necessarily
> originate in the same encoding. This would, of course, impose overhead,
> but only on an opt-in basis.
>
I also think having a single localization facility would be best - and
whatever fix we provide to this specific issue will not change that.
That being said, by asking "The russian name for Monday" you are definitely
and unambiguously asking for text,
and it stands to reason that the burden to ensure that this text is
delivered in an encoding that is compatible with the rest of your system
should fall on the standard.
It would be incredibly hostile if our long term solution is to force user
to write code along the lines of
std::locale russian("ru-RU");
std::format("День недели: {}", transcode(utf8, russian.encoding(),
format(russian, "{:L}", std::chrono::Monday)));
The current locale facilities conflate encoding an localization which is
one (but not the sole) short coming they have
wg21.link/P2020 goes into more details
>
> Tom.
>
> On 7/30/21 9:59 AM, Howard Hinnant wrote:
>
> The intent here is that the implementor uses the same machinery as for http://eel.is/c++draft/locale.time.put. I do not think we want to burden the std::lib with two independent localization mechanisms.
>
> Howard
>
> On Jul 30, 2021, at 8:46 AM, Jonathan Wakely via Lib <lib_at_[hidden]> <lib_at_[hidden]> wrote:
>
> On Fri, 30 Jul 2021 at 13:45, Corentin via Lib <lib_at_[hidden]> <lib_at_[hidden]> wrote:
> We decided we want a paper to deal with the issue.
> We definitely want to postpone!
>
> OK, thanks.
>
>
>
> On Fri, Jul 30, 2021 at 1:05 PM Jeff Garland <jeff_at_[hidden]> <jeff_at_[hidden]> wrote:
> Thanks Tom —
>
> Are there wiki notes or anything? We may want to defer discussion until you’ve had more time.
>
> Jeff
>
>
> On Jul 29, 2021, at 11:41 PM, Tom Honermann <tom_at_[hidden]> <tom_at_[hidden]> wrote:
>
> Hi, Jeff. SG16 did discuss LWG 3565 this week. We haven’t reached a conclusion yet but the consensus appears to be heading in a direction that will lead to a different resolution than what is proposed in the issue. I’ll follow up more once I have the meeting summary and polls posted.
>
> Tom.
>
>
> On Jul 29, 2021, at 8:10 PM, Jeff Garland via Lib <lib_at_[hidden]> <lib_at_[hidden]> wrote:
>
>
> Apologies for the late notice. All new papers for this week:
>
>
> P1072 basic_string::resize_and_overwritehttp://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1072r8.html
>
>
> P2372R1 (LWG 3547) Fixing locale handling in chrono formatters ** c++20 bug fix **https://wg21.link/P2372R1
>
> related issues:
> LWG 3547 Time formatters should not be locale sensitive by defaulthttps://cplusplus.github.io/LWG/issue3547
>
> LWG 3565 Handling of encodings in localized formatting of chrono types is underspecifiedhttps://cplusplus.github.io/LWG/issue3565
>
> P1636 Formatters for Library Typeshttps://wg21.link/p1636r2
>
> ——
>
> The zoom details for this meeting (and all following LWG meetings) are:
>
> Join from PC, Mac, Linux, iOS or Android: https://iso.zoom.us/j/99098440581?pwd=K01lM0VyVTB1NjRJN2lRbzFMTit3QT09
> Password: template
>
> Or iPhone one-tap :
> US: +12532158782,,99098440581# or +13017158592,,99098440581#
> Or Telephone:
> Dial(for higher quality, dial a number based on your current location):
> US: +1 253 215 8782 or +1 301 715 8592 or +1 312 626 6799 or +1 346 248 7799 or +1 408 638 0968 or +1 646 876 9923 or +1 669 900 6833 or 877 853 5247 (Toll Free)
> Meeting ID: 990 9844 0581
> Password: 07955058
> International numbers available: https://iso.zoom.us/u/a4YcGUHwU
>
> Or Skype for Business (Lync):
> https://iso.zoom.us/skype/99098440581
> _______________________________________________
> Lib mailing listLib_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2021/07/19950.php
>
> _______________________________________________
> Lib mailing listLib_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2021/07/19954.php
> _______________________________________________
> Lib mailing listLib_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2021/07/19955.php
>
>
>
> Avoiding multiple localization mechanisms is desirable.
>
> I think the problem we're having boils down to this: Do we want
> std::format() (and the proposed std::print()) to manipulate strings
> (NTBSs with ambiguous or polyglot encoding; e.g., mojibake) or text (well
> formed code unit sequences for a particular encoding). The existing locale
> facilities do not support the latter because there are multiple possible
> encodings at play (the ordinary literal encoding or the locale encoding,
> neither of which necessarily matches the programmers intent; the programmer
> may be using UTF-8 encoded strings with a literal encoding of Windows-1252
> running in a Windows-1251 locale). The PR for the issue tries to split the
> difference by choosing the former if the literal encoding is not UTF-8 and
> the latter otherwise. This inconsistency is concerning to some.
>
> Speaking solely for myself, I'm leaning towards these utilities
> manipulating strings (not text) in all existing cases. This puts the
> burden of producing valid text on the programmer (e.g., if the format
> string is UTF-8 and the locale provides Windows-1251, then it is up to the
> programmer to accept the mojibake possibility or do something explicit to
> prevent it). This is consistent with how the existing locale facilities
> work and allows these utilities to function as drop in replacements for
> printf(); including support for formatting binary data.
>
> A possible way forward would be to allow the programmer to express
> encoding intent by passing a P1885 <https://wg21.link/p1885> encoding
> identifier so that formatting functions can produce text in the expected
> encoding. This doesn't necessarily eliminate all encoding confusion
> however; should the format string be interpreted using the literal encoding
> or the explicitly provided encoding? When the literal encoding is
> Windows-1252, how should something like std::format(std::text_encoding::UTF8,
> "téxt) be handled (note that the encoding of "é" is different in
> Windows-1252 vs UTF-8)? In this case, it seems rather obvious that the
> implementation should use Windows-1252 to interpret the format string and
> then transcode it to UTF-8. Note that such transcoding would have to be
> performed a fragment at a time since not all fragments necessarily
> originate in the same encoding. This would, of course, impose overhead,
> but only on an opt-in basis.
>
I also think having a single localization facility would be best - and
whatever fix we provide to this specific issue will not change that.
That being said, by asking "The russian name for Monday" you are definitely
and unambiguously asking for text,
and it stands to reason that the burden to ensure that this text is
delivered in an encoding that is compatible with the rest of your system
should fall on the standard.
It would be incredibly hostile if our long term solution is to force user
to write code along the lines of
std::locale russian("ru-RU");
std::format("День недели: {}", transcode(utf8, russian.encoding(),
format(russian, "{:L}", std::chrono::Monday)));
The current locale facilities conflate encoding an localization which is
one (but not the sole) short coming they have
wg21.link/P2020 goes into more details
>
> Tom.
>
> On 7/30/21 9:59 AM, Howard Hinnant wrote:
>
> The intent here is that the implementor uses the same machinery as for http://eel.is/c++draft/locale.time.put. I do not think we want to burden the std::lib with two independent localization mechanisms.
>
> Howard
>
> On Jul 30, 2021, at 8:46 AM, Jonathan Wakely via Lib <lib_at_[hidden]> <lib_at_[hidden]> wrote:
>
> On Fri, 30 Jul 2021 at 13:45, Corentin via Lib <lib_at_[hidden]> <lib_at_[hidden]> wrote:
> We decided we want a paper to deal with the issue.
> We definitely want to postpone!
>
> OK, thanks.
>
>
>
> On Fri, Jul 30, 2021 at 1:05 PM Jeff Garland <jeff_at_[hidden]> <jeff_at_[hidden]> wrote:
> Thanks Tom —
>
> Are there wiki notes or anything? We may want to defer discussion until you’ve had more time.
>
> Jeff
>
>
> On Jul 29, 2021, at 11:41 PM, Tom Honermann <tom_at_[hidden]> <tom_at_[hidden]> wrote:
>
> Hi, Jeff. SG16 did discuss LWG 3565 this week. We haven’t reached a conclusion yet but the consensus appears to be heading in a direction that will lead to a different resolution than what is proposed in the issue. I’ll follow up more once I have the meeting summary and polls posted.
>
> Tom.
>
>
> On Jul 29, 2021, at 8:10 PM, Jeff Garland via Lib <lib_at_[hidden]> <lib_at_[hidden]> wrote:
>
>
> Apologies for the late notice. All new papers for this week:
>
>
> P1072 basic_string::resize_and_overwritehttp://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1072r8.html
>
>
> P2372R1 (LWG 3547) Fixing locale handling in chrono formatters ** c++20 bug fix **https://wg21.link/P2372R1
>
> related issues:
> LWG 3547 Time formatters should not be locale sensitive by defaulthttps://cplusplus.github.io/LWG/issue3547
>
> LWG 3565 Handling of encodings in localized formatting of chrono types is underspecifiedhttps://cplusplus.github.io/LWG/issue3565
>
> P1636 Formatters for Library Typeshttps://wg21.link/p1636r2
>
> ——
>
> The zoom details for this meeting (and all following LWG meetings) are:
>
> Join from PC, Mac, Linux, iOS or Android: https://iso.zoom.us/j/99098440581?pwd=K01lM0VyVTB1NjRJN2lRbzFMTit3QT09
> Password: template
>
> Or iPhone one-tap :
> US: +12532158782,,99098440581# or +13017158592,,99098440581#
> Or Telephone:
> Dial(for higher quality, dial a number based on your current location):
> US: +1 253 215 8782 or +1 301 715 8592 or +1 312 626 6799 or +1 346 248 7799 or +1 408 638 0968 or +1 646 876 9923 or +1 669 900 6833 or 877 853 5247 (Toll Free)
> Meeting ID: 990 9844 0581
> Password: 07955058
> International numbers available: https://iso.zoom.us/u/a4YcGUHwU
>
> Or Skype for Business (Lync):
> https://iso.zoom.us/skype/99098440581
> _______________________________________________
> Lib mailing listLib_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2021/07/19950.php
>
> _______________________________________________
> Lib mailing listLib_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2021/07/19954.php
> _______________________________________________
> Lib mailing listLib_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/lib
> Link to this post: http://lists.isocpp.org/lib/2021/07/19955.php
>
>
>
Received on 2021-07-30 10:56:17