sg16: Re: [SG16] Additional concerns for LWG3565: Handling of encodings in localized formatting of chrono types is underspecified

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Wed, 4 Aug 2021 10:03:48 -0700

A simple way to implement LWG3565 is by passing the time_put output through
codecvt although there might be other ways. This is what {fmt} does and it
worked surprisingly well. And of course there can be fast paths for known
or common locales/facets such as the "C" locale. The cool thing is that it
will even work with user-defined locales unless they do something
completely crazy and have incompatible encodings in different facets.

I personally don't think we need to be very specific about the exact
implementation mechanism. It's enough that the resolution is implementable
and libraries have freedom to do what they want.

Cheers,
Victor

On Mon, Aug 2, 2021 at 2:50 PM Charlie Barto via SG16 <sg16_at_[hidden]>
wrote:

> It’s tricky for such locales because they can have facets in common with
> the “known locales” but be a different locale. So if the standard says we
> can’t transcode for them it’s produces extremely surprising behavior.
>
>
>
> Better wording would be something along the lines of “for some subset of
> the locales available on a system “%a %A %b, %B %c %C %p %x %X” must
> produce UTF-8 encoded output when the string literal encoding is UTF-8.
> Implementations should support UTF-8 output for these specifiers for common
> locales” Since any set of supported locales/facets is implementation
> defined we can’t really say much here I think. Also I think all
> implementations do produce UTF-8 output for at least one locale, and I
> think that applying the fix to add special utf-8 facets to be used by
> chrono would be conforming to the current standard.
>
>
>
> *From:* Corentin Jabot <corentinjabot_at_[hidden]>
> *Sent:* Monday, August 2, 2021 2:44 PM
> *To:* Charlie Barto <Charles.Barto_at_[hidden]>
> *Cc:* SG16 <sg16_at_[hidden]>
> *Subject:* Re: [SG16] Additional concerns for LWG3565: Handling of
> encodings in localized formatting of chrono types is underspecified
>
>
>
>
>
>
>
> On Mon, Aug 2, 2021 at 11:25 PM Charlie Barto <Charles.Barto_at_[hidden]>
> wrote:
>
> ➢ or if the locale is not in the implementation-defined set of known
> locales, the value of ... is locale-specific
>
> This is highly problematic, this would disallow using the "utf8" version
> of time_put in a user locale that was formed by copying some "known" locale
> and adding facets, requiring the slow implementation. It's better if we
> talk about known facets rather than known locales so that the behavior for
> copied / combined locales stays consistent.
>
>
>
> But didn't you just say it was tricky such locales?
>
> If you can identify them easily, you can put them in the implementation
> defined set of known locales (known locales here is a bit of a misnomer...
> "known not to produce mojibake for the purpose of chrono formatter"
>
>
>
>
> From: Corentin Jabot <corentinjabot_at_[hidden]>
> Sent: Monday, August 2, 2021 2:21 PM
> To: SG16 <sg16_at_[hidden]>
> Cc: Charlie Barto <Charles.Barto_at_[hidden]>
> Subject: Re: [SG16] Additional concerns for LWG3565: Handling of encodings
> in localized formatting of chrono types is underspecified
>
> There isn't mention in the wording that time_put is used.
> Ex:
>
> > %a The locale's abbreviated weekday name. If the value does not contain
> a valid weekday, an exception of type format_error is thrown.
>
> Can we say something roughly like:
>
> If the format string encoding is UTF-8, for any locale in the
> implementation-defined set of known locales, the value of %a %A %b, %B %c
> %C %p %x %X
> is UTF-8 encoded. [Note: Whether time_put is called is unspecified].
> If the format string encoding is not UTF-8, or if the locale is not in the
> implementation-defined set of known locales, the value of ... is
> locale-specific
> [Note: locale-specific value may not be in the same text encoding that the
> format string]
>
> With some bonus wording if we want to support other utf encodings?
>
>
> On Mon, Aug 2, 2021 at 9:51 PM Charlie Barto via SG16 <mailto:
> sg16_at_[hidden]> wrote:
> I was discussing this with a coworker (Billy) and he brought up the point
> that even if we have an allow-list of locales in which to do the
> transcoding the proposed resolution is _still_ quite difficult to
> implement, because users can call “combine” or use
> “locale::locale(locale&,facet*)” (or various other constructors) to shove
> arbitrary new facets into a locale. The returned locale will have a
> different name, or be unnamed, but we still need to handle the facets from
> the original locale the same as before (i.e. do the transcoding). Otherwise
> users will be chugging along happily transcoding locale specific text,
> decide they’d like to add a new facet, and suddenly get mojibake.
>
> To support this in a way that doesn’t have this surprise we’d have to take
> the facet we’d like to use and compare it against every single “supported”
> version of that facet. This means if we allow-list locale names we would
> have to allocate _every single locale’s_ version of a given facet the first
> time a locale-sensitive chrono specifier was used. The comparison may
> involve a dynamic_cast (although I’m not 100% sure that’s really
> necessary), making it potentially quite expensive. For me this is a bit of
> a dealbreaker if LWG3565 is applied without P2372.
>
> It may be better to just allow implementations to use some custom facet
> type (maybe time_get<uchar_t>, or maybe _Utf8_time_get<char>, etc) when
> doing chrono formatting. This may already be allowed by the wording in
> [time.format]. It does not appear to actually say we must use a particular
> locale, although that may be specified in ISO8601:2004, a copy of which I
> am trying to get a hold of. We could also say that conversion goes through
> a hypothetical std::codecvt<char, char8_t, std::mbstate_t> facet.
>
> In general, I’m OK with the “{:L}” specifiers being a little broken,
> there’s only so much we can do since facets don’t have a way of
> communicating the encoding of their output (besides the codecvt facets).
> If/when we add some better locale handling support, we can always add an
> “{:LEx}” specifier (maybe we could use “{:ℒ}” or “{:𝔏}” 😊).
>
> +1 on that last point
>
>
> Charlie
> --
> SG16 mailing list
> mailto:SG16_at_[hidden]
>
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fsg16&data=04%7C01%7CCharles.Barto%40microsoft.com%7Ce1c227d29af34128ed5508d955fb7c48%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637635360882137947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yJy5ugVhRTiJOTUaG%2FTkcgHhe1vJbWB%2FG0CSuuWqUZ4%3D&reserved=0
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fsg16&data=04%7C01%7CCharles.Barto%40microsoft.com%7Cfdac013dbc2b4805cab208d955feac11%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637635374573165005%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=q2nebXb0XnmZLOCBHXsSFSXclgf%2BVoATvcb0zlb7Y6I%3D&reserved=0>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-08-04 12:04:03