Date: Tue, 22 Jun 2021 21:48:20 +0000
The current resolution is also ... questionable in the case where the string literal encoding isn't UTF-8, although there's nothing we could do in that case that's always right (we could transcode into gb18030, since that's a UTF, but not other pages).
I think the current behavior of C library functions in the case where the locale is non-unicode and they get something they can't represent is to transliterate, which is a very bad default (and also not something I want to have to implement in the standard library, at least not unless it's part of some larger Unicode support facility).
I tentatively support resolving this issue as "never transcode", and if you specify a locale that has a different encoding to what's in your format string you just get mojibake.
Another thing:
If users only use format control characters and other invariant characters in their format-string then the resulting encoding will be whatever the encoding of their parameters are. I expect users to rely on this when using plain std::format, and it seems odd to break that with chrono formatting.
There may be some implementations where the above isn't true, because the format control characters are not invariant across all supported code-pages, but MSVC isn't such an implementation. (as I've previously said the reason we care about the literal encoding in MSVC's implementation is because while the control characters are invariant some of the encodings we support are _not_ self-synchronizing, so the control characters can show up as trailing bytes of multi-byte "shift" sequences).
All this to say that just because the literal string encoding is UTF-8 _does not mean_ the encoding of the output of std::format is expected to _also_ be UTF-8.
From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Corentin Jabot via SG16
Sent: Friday, June 18, 2021 1:34 AM
To: Peter Brett <pbrett_at_[hidden]>
Cc: Corentin Jabot <corentinjabot_at_[hidden]>; sg16_at_[hidden]
Subject: Re: [SG16] Alternative approach for LWG3565 "Handling of encodings in localized chrono formatting"
On Fri, Jun 18, 2021 at 10:22 AM Peter Brett <pbrett_at_[hidden]<mailto:pbrett_at_[hidden]>> wrote:
Hi Corentin,
The requirement to perform transcoding makes me uncomfortable because I don't think it's actually implementable in the general case.
Users of the standard library can create customized locale objects with bespoke time_put facets, and there is literally no way for the chrono formatter to know which codeset a user-specified locale facet is using or how to transcode its output.
Totally happy for you to shoot down my alternative proposal, but I'm opposed to the current proposed resolution because std::locale just doesn't work like that.
The locale objects themselves do have an encoding (with the assumption that facets will respect that encoding)
The answer here is P1885 - which makes that information publicly accessible. In absence of that, implementers have the information.
Well, some of them do (glibc, microsoft), but indeed on some platforms the information does not exist because nl_langinfo is not part of the posix spec, so P1885 will give you unknown information.
Is that an issue?
My understanding is that the set of scenario in which
* There exists both a XXX an XXX.UTF-8 locale and the implementation knows how to go from one to the other
* The implementation doesn't know the encoding of XXX
is empty or very small.
I think you are right that we probably don't say how custom facets behave in respect to encodings but we certainly expect them to behave a certain way!
Best regards,
Peter
From: Corentin Jabot <corentinjabot_at_[hidden]<mailto:corentinjabot_at_[hidden]>>
Sent: 18 June 2021 09:14
To: SG16 <sg16_at_[hidden]<mailto:sg16_at_[hidden]>>
Cc: Peter Brett <pbrett_at_[hidden]<mailto:pbrett_at_[hidden]>>
Subject: Re: [SG16] Alternative approach for LWG3565 "Handling of encodings in localized chrono formatting"
On Thu, Jun 17, 2021 at 10:57 PM Peter Brett via SG16 <sg16_at_[hidden]<mailto:sg16_at_[hidden]>> wrote:
Hi all,
The current proposed resolution for LWG3565 (https://wg21.link/LWG3565<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2Fwg21.link%2FLWG3565__%3B!!EHscmS1ygiU1lA!UImbHs51DLVC5_4iWd5hIcpUw4nbv7r2fAr3NVLyMFGjevk3CAeqq8cYQwVAug%24&data=04%7C01%7CCharles.Barto%40microsoft.com%7C09c0c40202884d08d31808d93233db2f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637596020578590169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=F6WRXzWyzS05RHE5Av3j6S1DbBKiFm8ryQMp3LqTjW8%3D&reserved=0>)
involves transcoding from the locale encoding to UTF-8. This makes me a
little uncomfortable.
Can you clarify what makes you uncomfortable?
Is it possible instead to say that, if the string literal encoding is
UTF-8, then the effective locale is _as if_ the specified or global
locale was modified by replacing the associated codeset with UTF-8?
So, the following code:
std::locale l1("Russian.1251");
auto s = std::format(l1, "День недели: {:L}", std::chrono::Monday);
Would behave as if replaced by:
std::locale l1("Russian.1251");
std::locale l2(l1, std::locale("Russian.UTF-8"), locale::time);
auto s = std::format(l2, "День недели: {:L}", std::chrono::Monday);
This would permit an implementation that has UTF-8 locale data available
to use it directly, rather than being required to use the 1251 codeset
locale data and transcode in order to conform to the standard.
"associated codeset with UTF-8" is not really a thing.
The ".UTF-8" locales merely exist by convention on some platforms
There is no spec that says that
* Russian.1251 is not UTF-8
* Russian.1251.UTF-8 exists
* Russian.1251 and Russian.1251.UTF-8 only differ by encoding if both exist
Transcoding is therefore more generally applicable.
Note that I have my own reservations about this issue, namely how much effort are we willing to put
into mending a system that only works for a narrow subset of cultures, languages and circumstances?
That being said, even if that issue amounts to putting duct tape over a giant crack in the wall,
It also doesn't hurt.
It is undoubtedly more correct than the status quo and it might make the life of our windows users a bit less painful
as a stopgap solution
Peter
P.S. How would one go about writing a locale object that customizes
chrono formatting with std::format? Does anyone have a code sample?
I think the current behavior of C library functions in the case where the locale is non-unicode and they get something they can't represent is to transliterate, which is a very bad default (and also not something I want to have to implement in the standard library, at least not unless it's part of some larger Unicode support facility).
I tentatively support resolving this issue as "never transcode", and if you specify a locale that has a different encoding to what's in your format string you just get mojibake.
Another thing:
If users only use format control characters and other invariant characters in their format-string then the resulting encoding will be whatever the encoding of their parameters are. I expect users to rely on this when using plain std::format, and it seems odd to break that with chrono formatting.
There may be some implementations where the above isn't true, because the format control characters are not invariant across all supported code-pages, but MSVC isn't such an implementation. (as I've previously said the reason we care about the literal encoding in MSVC's implementation is because while the control characters are invariant some of the encodings we support are _not_ self-synchronizing, so the control characters can show up as trailing bytes of multi-byte "shift" sequences).
All this to say that just because the literal string encoding is UTF-8 _does not mean_ the encoding of the output of std::format is expected to _also_ be UTF-8.
From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Corentin Jabot via SG16
Sent: Friday, June 18, 2021 1:34 AM
To: Peter Brett <pbrett_at_[hidden]>
Cc: Corentin Jabot <corentinjabot_at_[hidden]>; sg16_at_[hidden]
Subject: Re: [SG16] Alternative approach for LWG3565 "Handling of encodings in localized chrono formatting"
On Fri, Jun 18, 2021 at 10:22 AM Peter Brett <pbrett_at_[hidden]<mailto:pbrett_at_[hidden]>> wrote:
Hi Corentin,
The requirement to perform transcoding makes me uncomfortable because I don't think it's actually implementable in the general case.
Users of the standard library can create customized locale objects with bespoke time_put facets, and there is literally no way for the chrono formatter to know which codeset a user-specified locale facet is using or how to transcode its output.
Totally happy for you to shoot down my alternative proposal, but I'm opposed to the current proposed resolution because std::locale just doesn't work like that.
The locale objects themselves do have an encoding (with the assumption that facets will respect that encoding)
The answer here is P1885 - which makes that information publicly accessible. In absence of that, implementers have the information.
Well, some of them do (glibc, microsoft), but indeed on some platforms the information does not exist because nl_langinfo is not part of the posix spec, so P1885 will give you unknown information.
Is that an issue?
My understanding is that the set of scenario in which
* There exists both a XXX an XXX.UTF-8 locale and the implementation knows how to go from one to the other
* The implementation doesn't know the encoding of XXX
is empty or very small.
I think you are right that we probably don't say how custom facets behave in respect to encodings but we certainly expect them to behave a certain way!
Best regards,
Peter
From: Corentin Jabot <corentinjabot_at_[hidden]<mailto:corentinjabot_at_[hidden]>>
Sent: 18 June 2021 09:14
To: SG16 <sg16_at_[hidden]<mailto:sg16_at_[hidden]>>
Cc: Peter Brett <pbrett_at_[hidden]<mailto:pbrett_at_[hidden]>>
Subject: Re: [SG16] Alternative approach for LWG3565 "Handling of encodings in localized chrono formatting"
On Thu, Jun 17, 2021 at 10:57 PM Peter Brett via SG16 <sg16_at_[hidden]<mailto:sg16_at_[hidden]>> wrote:
Hi all,
The current proposed resolution for LWG3565 (https://wg21.link/LWG3565<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2Fwg21.link%2FLWG3565__%3B!!EHscmS1ygiU1lA!UImbHs51DLVC5_4iWd5hIcpUw4nbv7r2fAr3NVLyMFGjevk3CAeqq8cYQwVAug%24&data=04%7C01%7CCharles.Barto%40microsoft.com%7C09c0c40202884d08d31808d93233db2f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637596020578590169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=F6WRXzWyzS05RHE5Av3j6S1DbBKiFm8ryQMp3LqTjW8%3D&reserved=0>)
involves transcoding from the locale encoding to UTF-8. This makes me a
little uncomfortable.
Can you clarify what makes you uncomfortable?
Is it possible instead to say that, if the string literal encoding is
UTF-8, then the effective locale is _as if_ the specified or global
locale was modified by replacing the associated codeset with UTF-8?
So, the following code:
std::locale l1("Russian.1251");
auto s = std::format(l1, "День недели: {:L}", std::chrono::Monday);
Would behave as if replaced by:
std::locale l1("Russian.1251");
std::locale l2(l1, std::locale("Russian.UTF-8"), locale::time);
auto s = std::format(l2, "День недели: {:L}", std::chrono::Monday);
This would permit an implementation that has UTF-8 locale data available
to use it directly, rather than being required to use the 1251 codeset
locale data and transcode in order to conform to the standard.
"associated codeset with UTF-8" is not really a thing.
The ".UTF-8" locales merely exist by convention on some platforms
There is no spec that says that
* Russian.1251 is not UTF-8
* Russian.1251.UTF-8 exists
* Russian.1251 and Russian.1251.UTF-8 only differ by encoding if both exist
Transcoding is therefore more generally applicable.
Note that I have my own reservations about this issue, namely how much effort are we willing to put
into mending a system that only works for a narrow subset of cultures, languages and circumstances?
That being said, even if that issue amounts to putting duct tape over a giant crack in the wall,
It also doesn't hurt.
It is undoubtedly more correct than the status quo and it might make the life of our windows users a bit less painful
as a stopgap solution
Peter
P.S. How would one go about writing a locale object that customizes
chrono formatting with std::format? Does anyone have a code sample?
-- SG16 mailing list SG16_at_[hidden]<mailto:SG16_at_[hidden]> https://lists.isocpp.org/mailman/listinfo.cgi/sg16<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fsg16__%3B!!EHscmS1ygiU1lA!UImbHs51DLVC5_4iWd5hIcpUw4nbv7r2fAr3NVLyMFGjevk3CAeqq8cDfqp-Dw%24&data=04%7C01%7CCharles.Barto%40microsoft.com%7C09c0c40202884d08d31808d93233db2f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637596020578590169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=IUtoTcEMl8QAx6CqTht%2FfNlvexpUTBfw%2F4uy1hg2YSQ%3D&reserved=0>
Received on 2021-06-22 16:48:25