C++ Logo

SG16

Advanced search

Subject: Re: LWG issue: Time formatters should not be locale sensitive by default
From: Corentin (corentin.jabot_at_[hidden])
Date: 2021-05-02 05:52:06


Did more research,
I was wrong

There do not seem to be an end to the brokenness, which I guess is to be
expected when dealing with POSIX
Reading of the code of strftime reveals that strftime never uses locale to
format numbers without O, even if it should.

In fact, it never uses the digits property at all! Neither does printf

There is a glibc extensions
> glibc 2.2 adds one further flag character. *I*> For decimal integer
conversion (*i*, *d*, *u*) the output uses the locale's alternative output
digits, if any. For example, since glibc 2.2.3 this will give Arabic-Indic
digits in the Persian ("fa_IR") locale.

Fortunately, fmt here is consistent -
http://eel.is/c++draft/format#string.std-15 - nobody actually localizes
number. should we? Probably.
Anyway, that's a separate question.

It turns out, that strftime uses a completely separate set of digits than
CTYPE, aka alt_digits as mentioned previously
Except that most locales that do have non-latin digits do not have these
non-latin digits in alt_digits.

What it means is that it is impossible to correctly get a localized date in
most locales that do have different digits but no alt_digits.

I thought I would clarify that as my earlier message was incorrect (and
based on the flawed assumption that POSIX locale had gotten something right)
It doesn't change my main point the fact that the L applies to the entire
string and that users mays want to convert names and not numbers so the
specifiers or the proposed resolution need not modifying

On Sat, May 1, 2021 at 8:22 PM Tom Honermann <tom_at_[hidden]> wrote:

> Thank you, Corentin. This is very useful.
>
> For reference, here are the polls we took:
>
> Poll: LWG3547 raises a valid design defect in [time.format] in C++20.
>
> SF F N A SA
> 7 2 2 0 0
>
> Attendance: 11
>
> Consensus: Strong consensus that this is a design defect.
>
>
> Poll: The proposed LWG3547 resolution as written should be applied to
> C++23.
>
> SF F N A SA
> 0 4 2 4 1
>
> Attendance: 11
>
> Consensus: No consensus for the resolution
>
> SA motivation: Migrating things embedded in a string literal is very
> difficult. There are options to deal with this in an additive way.
> Needless break in backwards with compatibility.
>
> Speaking for myself, my position in that 2nd poll was weak and I could
> have been influenced in either direction. I suspect that is true for some
> others as well. I therefore recommend attributing little weight to that
> poll, especially given new information.
>
> It sounds like LEWG may take up this issue on Monday. We’ll see whether
> there is a need for SG16 to revisit.
>
> Tom.
>
> On Apr 28, 2021, at 7:55 PM, Corentin via SG16 <sg16_at_[hidden]>
> wrote:
>
> I wanted to address the "locale-independent" specifications.
> There are *none*, in the POSIX strftime spec.
>
> https://pubs.opengroup.org/onlinepubs/009695399/functions/strftime.html
> The strftime() function shall place bytes into the array pointed to by s
> as controlled by the string pointed to by format. The format is a character
> string, beginning and ending in its initial shift state, if any. The format
> string consists of zero or more conversion specifications and ordinary
> characters. A conversion specification consists of a '%' character,
> possibly followed by an E or O modifier, and a terminating conversion
> specifier character that determines the conversion specification's
> behavior. All ordinary characters (including the terminating null byte) are
> copied unchanged into the array. If copying takes place between objects
> that overlap, the behavior is undefined. No more than maxsize bytes are
> placed into the array. Each conversion specifier is replaced by appropriate
> characters as described in the following list. The appropriate characters
> are determined using the LC_TIME category of the current locale and by the
> values of zero or more members of the broken-down time structure pointed to
> by timeptr, as specified in brackets in the description. If any of the
> specified values are outside the normal range, the characters stored are
> unspecified.
>
> The %O are an opt-in into the locale alternative numeral system.
> You might want to have dates with arabic numerals and names in hindi, for
> example.
>
> so "1 AM" can be either "१ पूर्वाह्न", or " 1 पूर्वाह्न," depending on
> whether you want to use the devanagari numerals or not.
>
>
> See also
>
> https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html#tag_07_03_05
> *alt_digits*Define alternative symbols for digits, corresponding to the %O modified
> conversion specification. [...] The %O modifier shall indicate that the
> string corresponding to the value specified via the conversion
> specification shall be used instead of the value.
>
> The way this is handled by CLDR ( and therefore most PL), is that the
> desired numbering system is attached to the locale name, or is provided as
> part of supplementary options
>
>
> Here is an example using javascript (tested locally with node)
> <image.png>
>
> Notice that
>
> - The concern of numeral system vs formatting is separate
> - Most locales defaults to latin number ( but not arabic in this
> example), I am not exactly sure why
> - Few programming languages offer a per specifier choice of numbering
> systems, these things are not usually mixed.
>
>
> Now whether the %O specifier of POSIX makes sense or not is an interesting
> question, but I wanted to point out they are no less or more depending on
> locale than other specifiers.
>
> {:L%u} formats a week day number using the locale primary numeral system
> {:L%Ou} formats a week day number using the locale alternative numeral
> system
>
> What if you pass the C locale ?
> Well, the C locale numeral primary system is arabic numbers, it does not
> have an alternative numeric system
>
> In all cases, It does what it says it does
>
> Sorry I didn't catch that concern during the meeting.
> *I hope you will reconsider the second poll as we clearly missed some
> pretty critical information! *
>
>
>
> PS:
> You will notice that this brings more questions than it answers.
> What if the globale locale uses a non-arabic numeral system? What is the
> default numeral system? Why is there a primary and alternative. What if you
> need a third?
> Why does time formatting care about that when none of the other locale
> facilities seem to?
>
> But this is clearly out of scope of this issue!
>
>
> More reference
>
> http://cldr.unicode.org/translation/-core-data/numbering-systems
>
> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/NumberFormat/NumberFormat
> https://lh.2xlibre.net/values/alt_digits/
> https://unicode-org.github.io/icu/userguide/locale/
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>



SG16 list run by sg16-owner@lists.isocpp.org