C++ Logo

sg16

Advanced search

Re: [SG16] Additional concerns for LWG3565: Handling of encodings in localized formatting of chrono types is underspecified

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Sat, 7 Aug 2021 07:47:42 -0700
> My concern is that the wording not require us to keep track of the actual
locale name, so we can add special facets that just store the right thing,
or use a codecvt facet from the supported locale itself.

Completely agree.

> With respect to the time_put output: did you pass through the locale’s
codecvt<char32_t, char> then through std::codecvt_utf8?

Almost. We pass it through codecvt<char32_t, char> or codecvt<wchar_t, char>
depending on the standard library implementation and then do the UTF
conversion which is trivial and doesn't require dealing with facets.

> It would be nice to be able to use a std::codecvt<char, char8_t> facet
that’s on the locale as well, I doubt we’ll bother.

Right, this would eliminate the UTF conversion step but I don't think it's
essential.

Cheers,
Victor


On Wed, Aug 4, 2021 at 10:19 AM Charlie Barto <Charles.Barto_at_[hidden]>
wrote:

> I agree that we shouldn’t be too specific. My concern is that the wording
> not require us to keep track of the actual locale name, so we can add
> special facets that just store the right thing, or use a codecvt facet from
> the supported locale itself. It’s also possible that it’ll be really hard
> to standardize anything that actually has any normative meaning here, maybe
> just a note of clarification suggesting the conversion as a QoI
> improvement.
>
>
>
> Maybe all we need is like: “if the encoding of any locale specific data
> does not match the string literal encoding the resulting replacement
> characters are implementation specified”.
>
>
>
>
>
> With respect to the time_put output: did you pass through the locale’s
> codecvt<char32_t, char> then through std::codecvt_utf8?
>
>
>
> That does indeed seem like it would work “OK”
>
>
>
> It would be nice to be able to use a std::codecvt<char, char8_t> facet
> that’s on the locale as well, I doubt we’ll bother.
>
>
>
> Charlie
>
>
>
> *From:* Victor Zverovich <victor.zverovich_at_[hidden]>
> *Sent:* Wednesday, August 4, 2021 10:04 AM
> *To:* SG16 <sg16_at_[hidden]>
> *Cc:* Corentin Jabot <corentinjabot_at_[hidden]>; Charlie Barto <
> Charles.Barto_at_[hidden]>
> *Subject:* Re: [SG16] Additional concerns for LWG3565: Handling of
> encodings in localized formatting of chrono types is underspecified
>
>
>
> A simple way to implement LWG3565 is by passing the time_put output
> through codecvt although there might be other ways. This is what {fmt} does
> and it worked surprisingly well. And of course there can be fast paths for
> known or common locales/facets such as the "C" locale. The cool thing is
> that it will even work with user-defined locales unless they do something
> completely crazy and have incompatible encodings in different facets.
>
>
>
> I personally don't think we need to be very specific about the exact
> implementation mechanism. It's enough that the resolution is implementable
> and libraries have freedom to do what they want.
>
>
>
> Cheers,
>
> Victor
>
>
>
> On Mon, Aug 2, 2021 at 2:50 PM Charlie Barto via SG16 <
> sg16_at_[hidden]> wrote:
>
> It’s tricky for such locales because they can have facets in common with
> the “known locales” but be a different locale. So if the standard says we
> can’t transcode for them it’s produces extremely surprising behavior.
>
>
>
> Better wording would be something along the lines of “for some subset of
> the locales available on a system “%a %A %b, %B %c %C %p %x %X” must
> produce UTF-8 encoded output when the string literal encoding is UTF-8.
> Implementations should support UTF-8 output for these specifiers for common
> locales” Since any set of supported locales/facets is implementation
> defined we can’t really say much here I think. Also I think all
> implementations do produce UTF-8 output for at least one locale, and I
> think that applying the fix to add special utf-8 facets to be used by
> chrono would be conforming to the current standard.
>
>
>
> *From:* Corentin Jabot <corentinjabot_at_[hidden]>
> *Sent:* Monday, August 2, 2021 2:44 PM
> *To:* Charlie Barto <Charles.Barto_at_[hidden]>
> *Cc:* SG16 <sg16_at_[hidden]>
> *Subject:* Re: [SG16] Additional concerns for LWG3565: Handling of
> encodings in localized formatting of chrono types is underspecified
>
>
>
>
>
>
>
> On Mon, Aug 2, 2021 at 11:25 PM Charlie Barto <Charles.Barto_at_[hidden]>
> wrote:
>
> ➢ or if the locale is not in the implementation-defined set of known
> locales, the value of ... is locale-specific
>
> This is highly problematic, this would disallow using the "utf8" version
> of time_put in a user locale that was formed by copying some "known" locale
> and adding facets, requiring the slow implementation. It's better if we
> talk about known facets rather than known locales so that the behavior for
> copied / combined locales stays consistent.
>
>
>
> But didn't you just say it was tricky such locales?
>
> If you can identify them easily, you can put them in the implementation
> defined set of known locales (known locales here is a bit of a misnomer...
> "known not to produce mojibake for the purpose of chrono formatter"
>
>
>
>
> From: Corentin Jabot <corentinjabot_at_[hidden]>
> Sent: Monday, August 2, 2021 2:21 PM
> To: SG16 <sg16_at_[hidden]>
> Cc: Charlie Barto <Charles.Barto_at_[hidden]>
> Subject: Re: [SG16] Additional concerns for LWG3565: Handling of encodings
> in localized formatting of chrono types is underspecified
>
> There isn't mention in the wording that time_put is used.
> Ex:
>
> > %a The locale's abbreviated weekday name. If the value does not contain
> a valid weekday, an exception of type format_error is thrown.
>
> Can we say something roughly like:
>
> If the format string encoding is UTF-8, for any locale in the
> implementation-defined set of known locales, the value of %a %A %b, %B %c
> %C %p %x %X
> is UTF-8 encoded. [Note: Whether time_put is called is unspecified].
> If the format string encoding is not UTF-8, or if the locale is not in the
> implementation-defined set of known locales, the value of ... is
> locale-specific
> [Note: locale-specific value may not be in the same text encoding that the
> format string]
>
> With some bonus wording if we want to support other utf encodings?
>
>
> On Mon, Aug 2, 2021 at 9:51 PM Charlie Barto via SG16 <mailto:
> sg16_at_[hidden]> wrote:
> I was discussing this with a coworker (Billy) and he brought up the point
> that even if we have an allow-list of locales in which to do the
> transcoding the proposed resolution is _still_ quite difficult to
> implement, because users can call “combine” or use
> “locale::locale(locale&,facet*)” (or various other constructors) to shove
> arbitrary new facets into a locale. The returned locale will have a
> different name, or be unnamed, but we still need to handle the facets from
> the original locale the same as before (i.e. do the transcoding). Otherwise
> users will be chugging along happily transcoding locale specific text,
> decide they’d like to add a new facet, and suddenly get mojibake.
>
> To support this in a way that doesn’t have this surprise we’d have to take
> the facet we’d like to use and compare it against every single “supported”
> version of that facet. This means if we allow-list locale names we would
> have to allocate _every single locale’s_ version of a given facet the first
> time a locale-sensitive chrono specifier was used. The comparison may
> involve a dynamic_cast (although I’m not 100% sure that’s really
> necessary), making it potentially quite expensive. For me this is a bit of
> a dealbreaker if LWG3565 is applied without P2372.
>
> It may be better to just allow implementations to use some custom facet
> type (maybe time_get<uchar_t>, or maybe _Utf8_time_get<char>, etc) when
> doing chrono formatting. This may already be allowed by the wording in
> [time.format]. It does not appear to actually say we must use a particular
> locale, although that may be specified in ISO8601:2004, a copy of which I
> am trying to get a hold of. We could also say that conversion goes through
> a hypothetical std::codecvt<char, char8_t, std::mbstate_t> facet.
>
> In general, I’m OK with the “{:L}” specifiers being a little broken,
> there’s only so much we can do since facets don’t have a way of
> communicating the encoding of their output (besides the codecvt facets).
> If/when we add some better locale handling support, we can always add an
> “{:LEx}” specifier (maybe we could use “{:ℒ}” or “{:𝔏}” 😊).
>
> +1 on that last point
>
>
> Charlie
> --
> SG16 mailing list
> mailto:SG16_at_[hidden]
>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fsg16&data=04%7C01%7CCharles.Barto%40microsoft.com%7Ce1c227d29af34128ed5508d955fb7c48%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637635360882137947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yJy5ugVhRTiJOTUaG%2FTkcgHhe1vJbWB%2FG0CSuuWqUZ4%3D&reserved=0
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fsg16&data=04%7C01%7CCharles.Barto%40microsoft.com%7C87cde2ead0144e5620f608d95769da65%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637636934428869265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bI2p1bQWSA3seWQ7BQY2LZbh7GTuIZ%2B7%2BOPxyQT9xtQ%3D&reserved=0>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fsg16&data=04%7C01%7CCharles.Barto%40microsoft.com%7C87cde2ead0144e5620f608d95769da65%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637636934428879239%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=KJN8gLfr2RA54bDPfy8E7VnPs56mqxdHZ%2FfnuEze6hc%3D&reserved=0>
>
>

Received on 2021-08-07 09:47:57