C++ Logo

sg16

Advanced search

Re: [SG16] New Unicode Working Group: Message Formatting

From: Steven R. Loomis <srl295_at_[hidden]>
Date: Wed, 15 Jan 2020 14:10:07 -0800
--
Steven R. Loomis | @srl295 | git.io/srl295
> El ene. 15, 2020, a las 1:12 p. m., Victor Zverovich <victor.zverovich_at_[hidden]> escribió:
> 
> Thanks Steven for reaching out to SG16 and thank Corentin for summarizing the current state of formatting in C++. I'm glad to see more work in the area of message formatting. So far most of the focus in std::format was on providing locale-independent formatting and to some extent giving the user control over the use of C++ locales (which are somewhat limited) but it would be interesting to extend it to localized formatting.
> 
> > something like “User {} requests {}.” is not as localizable because the order may need to change.
> 
> You can think of “User {} requests {}.” as “User {0} requests {1}.” so it should also be localizable as if indices were specified explicitly.
OK, that makes sense. The translator or some process would need to map {},{} to {0},{1} etc.
One of the reasons named parameters are useful here, though, is the opportunity to provide more context, such as “User {username} requests {ticketNumber}.”, for example. 
> 
> >  I am afraid that identifier based positional arguments would result in more cumbersome and less efficient APIS for C++ as it would require some kind of dictionary
> 
> FWIW the fmt library supports named arguments and the API is indeed somewhat cumbersome due to language limitations. I'm not sure if efficiency of argument access is a big concern for a localization facility.
It may depend on the use case.  Logging 1000s of error messages inside a real-time OS device driver may want efficiency in argument access.
An operation that is either constructing a string intended in a GUI control,  or text-to-speech in a voice UI, where the latency is human-scale, may allow for argument access that isn’t as time critical.
> - Victor
> 
> 
>> 
> 
> On Tue, Jan 14, 2020 at 2:57 PM Steven R. Loomis via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> Hi Corentin.,
> 
> --
> Steven R. Loomis | @srl295 | git.io/srl295 <http://git.io/srl295>
> 
> 
> 
>> El ene. 13, 2020, a las 5:46 p. m., Corentin Jabot <corentinjabot_at_[hidden] <mailto:corentinjabot_at_[hidden]>> escribió:
>> 
>> + message-format-wg_at_[hidden] <mailto:message-format-wg_at_[hidden]>
>> 
>> Hello.
>> Let me (try to) describe the current state of things in C++ and future directions.
>> Of course this will be my opinions and not necessarily that of SG16 or WG21.
> 
> Thank you!
> 
>> 
>> C++20 (which is on course to be approved next month) will provide a new feature in the name of std::format derived from the popular fmt library (https://fmt.dev/ <https://fmt.dev/>), itself heavily inspired and sharing the syntax of python's format function.
>> 
>> std::format("Hello {}", "World") -> "Hello World";
>> std::format("{2} + {1} = {0}", 3, 1.0, 2) -> "1.0 + 2 = 3";
>> 
>> Of interest to Unicode and localization:
>> For now this function is mostly byte based, in that it is encoding agnostic.
>> However we made the interesting decision that padding is based on display width (which is fuzzily specified),  as we realized the primary use case for padding was the creation of console interface
> 
> display width is complex… Unicode’s East Asian Width is often used for character width, but there’s more to that (and see <https://www.unicode.org/reports/tr11/#Scope <https://www.unicode.org/reports/tr11/#Scope>> … see for example https://github.com/nodejs/node/blob/b0a762157793b0d9143eaa7c270da91932f2a64f/src/node_i18n.cc#L729 <https://github.com/nodejs/node/blob/b0a762157793b0d9143eaa7c270da91932f2a64f/src/node_i18n.cc#L729> in Node.js — going beyond wcwidth, etc which do not reflect many terminal emulators’ behavior.
> 
> Is the function itself mostly designed for the console or generalized use in application (such as non-terminal UI)?
>> By default this function will format all types, notably numbers using the C locale.. locale is explicitly opt-in : std::format(locale("fr_FR"), "{:L}", 1.0) -> "1,0";
>> It is not a translation facility, but does support positional arguments with index. I am afraid that identifier based positional arguments would result in more cumbersome and less efficient APIS for C++ as it would require some kind of dictionary
> Positional with index is then localizable, something like “User {} requests {}.” is not as localizable because the order may need to change.
>> Each type of entity can have a set of options which are determined by its type. Formatters are user defined and the standard does provide formatting for numbers, strings, date/time and a few other things.
> 
> 
>> There is some consensus that we should in the future extend that interface rather than iostream as it is much more efficient and easier to use.
>> I think that sharing a syntax which is easy to use between C++ and Python is a great benefit and it would be interesting to see if Unicode can build on it too.
> 
> Yes. Please provide input to the MFWG then. You might update or create an issue at https://github.com/unicode-org/message-format-wg/issues <https://github.com/unicode-org/message-format-wg/issues>
>> 
>> That is for C++20.
>> As for the future and things we might benefit from:
>> 
>> I don't think any one is looking at translation s in the C++ standard and to be honest we are spread a bit thin. we do have std::messages which is a wrapper over gettext,does not support pluralization and, as far as I can tell, has very little uses.
> 
> That’s why MessageFormat was created, first in Java and soon after in ICU C++/Java. (Note that the MFWG is not standardizing ICU, it’s creating a follow-on to ICU’s format.) Localization of these messages is a critical requirement. The message formats represented by the working group are in heavy use.
> 
>> If we were to look at translations someday, it is clear that having a spec we can reference would be almost necessary. 
>> And we wouldn't want to create something new specific to standard C++, implementing a Unicode specification would have a lot more value.
> 
> Very good.
> 
>> At the same time we are currently looking at measurements and units apis.
>> I tried to make the point that we should provide localized formatting for measurements and units if such api is provided.
>> Alas, I found that there is no spec for that, nor UAX and the CLDR was not complete (some unit would have kilo versions, some not, things like that),
> 
> CLDR specifies arbitrary SI prefixes for all units, see https://unicode-org.atlassian.net/browse/CLDR-13057 <https://unicode-org.atlassian.net/browse/CLDR-13057> So kilo-anything is supported. What are the other shortcomings? 
> 
> CLDR units are implemented in ICU, and are part of Ecma402 https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/NumberFormat <https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/NumberFormat> 
> 
> new Intl.NumberFormat("pt-PT",  { style: 'unit',  unit: "mile-per-hour"}).format(50);
> // → 50 mi/h
> 
> 
>> which would add quite a burden
>> for the C++ committee to specify and we would most likely get it wrong.
>> I think it would be tremendously helpful for us to have a specification on how to format measurement units.
> 
> I would recommend CLDR, http://unicode.org/reports/tr35/tr35-general.html#Unit_Elements <http://unicode.org/reports/tr35/tr35-general.html#Unit_Elements>
> 
> Unit specification (as part of the message format) is in-scope for the message format working group. I think ICU already supports units in message format (I can’t find it on a quick search).
> 
>> Similarly, all string to number and number to string conversions in the standard, including integral and floating points assume the Hindu-Arabic numerals system.
>> A specification telling us when and how use other numeral systems would be beneficial.
> 
> Please see https://unicode.org/reports/tr35/tr35-numbers.html#otherNumberingSystems <https://unicode.org/reports/tr35/tr35-numbers.html#otherNumberingSystems> for example. EcmaScript and many others have adopted this.
> 
>> I have no idea if either of these points fall into the purview of your group.
>> 
>> I may be forgetting many things, but i think it's a fair overview of the current state of things in C++ as far as formatting is concerned.
>> I hope that helps.
> 
> Thanks!
> 
> 
>> 
>> Regards, 
>> 
>> 
>> Corentin
>> 
>>  
>> 
>> 
>> 
>> On Fri, 10 Jan 2020 at 23:54, Steven R. Loomis via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>> FYI. This might be of interest as far as std::format goes.
>> 
>> Steven, IBM/ICU
>> 
>> 
>> --
>> Steven R. Loomis | @srl295 | git.io/srl295 <http://git.io/srl295>
>> 
>> 
>> 
>>> Inicio del mensaje reenviado:
>>> 
>>> De: announcements_at_[hidden] <mailto:announcements_at_[hidden]>
>>> Asunto: New Unicode Working Group: Message Formatting
>>> Fecha: 10 de enero de 2020, 1:55:35 p. m. PST
>>> Para: announcements_at_[hidden] <mailto:announcements_at_[hidden]>
>>> Responder a: root_at_[hidden] <mailto:root_at_[hidden]>
>>> 
>>> <msg-wg-annc-large.jpg>One of the challenges in adapting programs to work with different languages is message formatting. This is the process of formatting and inserting data values into messages in the user’s language. For example, “The package will arrive at {time} on {date}” could be translated into German as “Das Paket wird am {date} um {time} geliefert”, and the particular {time} and {date} variables would be automatically formatted for German, and inserted in the right places.
>>> 
>>> The Unicode Consortium has provided message formatting for some time via the ICU programming libraries and CLDR locale data repository. But until now we have not had a syntax for localizable message strings standardized by Unicode. Furthermore, the current ICU MessageFormat is relatively complex for existing operations, such as plural forms, and it does not scale well to other language properties, such as gender and inflections.
>>> 
>>> The Unicode CLDR Technical Committee is formalizing a new working group to develop a technical specification for message format that addresses these issues. That working group is called the Message Format Working Group and is chaired by Romulo Cintra from CaixaBank. Other participants currently represented are Amazon, Dropbox, Facebook, Google, IBM, Mozilla, OpenJSF, and Paypal.
>>> 
>>> For information on how to get involved, visit the working group’s GitHub page: https://github.com/unicode-org/message-format-wg <https://github.com/unicode-org/message-format-wg>
>>> 
>>> Open discussions will take place on GitHub, and written notes will be posted after every meeting.
>>> 
>>> Over 130,000 characters are available for adoption <http://unicode.org/consortium/adopt-a-character.html>, to help the Unicode Consortium’s work on digitally disadvantaged         languages.
>>> 
>>> <ynh-infinity.png> <http://unicode.org/consortium/adopt-a-character.html>
>>> 
>>> http://blog.unicode.org/2020/01/new-unicode-working-group-message.html <http://blog.unicode.org/2020/01/new-unicode-working-group-message.html>
>>> 
>> 
>> -- 
>> SG16 mailing list
>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16 <https://lists.isocpp.org/mailman/listinfo.cgi/sg16>
> 
> -- 
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16 <https://lists.isocpp.org/mailman/listinfo.cgi/sg16>

Received on 2020-01-15 16:12:47