C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] Unicode streams

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Fri, 18 Oct 2019 12:05:31 +0200
On Fri, 18 Oct 2019 at 11:38, Mateusz Pusz <mateusz.pusz_at_[hidden]> wrote:

> Awesome, thanks!
>
> Just please note that this is not a thread about the Physical Units
> library in general. For this, we have one already on the SG6 reflector
> started after the evening session in Cologne. Also, I bring a big paper to
> Belfast about it (P1935R0) but due to some technical issues it did not land
> in the initial Belfast mailing. It should be added by Hal soon.
>
> Let's scope on Unicode related issues here.
>

Yes, fortunately formatting can be handled entirely separately from the
rest, like date formatting can be handled separately from date manipulation
:)


> Best
>
> Mat
>
> pt., 18 paź 2019 o 11:17 Corentin Jabot <corentinjabot_at_[hidden]>
> napisał(a):
>
>> Also adding Vincent Reverdy who seems to be working in the same area (cf
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1930r0.pdf )
>>
>> On Thu, 17 Oct 2019 at 22:30, Corentin Jabot <corentinjabot_at_[hidden]>
>> wrote:
>>
>>> Adding Victor directly
>>>
>>> On Thu, 17 Oct 2019 at 21:21, Mateusz Pusz <mateusz.pusz_at_[hidden]>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> Right now I am in the process of designing and implementing a Physical
>>>> Units library that hopefully will be a start for having such a feature in
>>>> the C++ Standard Library. You can find more info on the library here:
>>>> https://github.com/mpusz/units.
>>>>
>>>> Recently, I started to work on the text output of quantities.
>>>> Quantities consist of value and a unit symbol. The latter is a perfect use
>>>> case for Unicode. Consider:
>>>>
>>>> 10 us vs 10 μs
>>>> 2 kg*m/s^2 vs 2 kg⋅m/s²
>>>>
>>>> Before C++20 we could get away with a hack by providing Unicode
>>>> characters to `char`-based types and streams, but with the introduction of
>>>> `char8_t` in C++20 it seems it will be a bigger issue from now on. The
>>>> library implementors will have to provide 2 separate implementations:
>>>> 1. For `char`-based types (string_view, ostream) without Unicode signs
>>>> 2. For Unicode char based types
>>>>
>>>
>>> Yes, with the caveat that you can only output utf-8 to sink that expects
>>> it and conversion from Unicode to anything not Unicode will loose
>>> information
>>>
>>>
>>>>
>>>> However, there are a few issues here:
>>>> 1. As of now, we do not have std::u8cout or even std::u8ostream. So
>>>> there is really no easy way to create and use a stream for Unicode
>>>> characters. So even if I implement
>>>>
>>>> template<class CharT, class Traits>
>>>> friend std::basic_ostream<CharT, Traits>&
>>>> operator<<(std::basic_ostream<CharT, Traits>& os, const quantity& q)
>>>>
>>>> correctly, we do not have an easy way to use it.
>>>>
>>>> 2. In order to implement the above, I could imagine such an interface
>>>> for a symbol prefix:
>>>>
>>>> template<typename CharT, typename Traits, typename Prefix, typename
>>>> Ratio>
>>>> inline constexpr std::basic_string_view<CharT, Traits> prefix_symbol;
>>>>
>>>> and its partial specializations for different prefixes/ratios:
>>>>
>>>> template<typename CharT, typename Traits>
>>>> inline constexpr std::basic_string_view<char, Traits> prefix_symbol<char,
>>>> Traits, si_prefix, std::micro> = "u";
>>>> template<typename CharT, typename Traits>
>>>> inline constexpr std::basic_string_view<CharT,
>>>> Traits> prefix_symbol<CharT, Traits, si_prefix, std::micro> = u8"\u00b5";
>>>> // µ
>>>> template<typename CharT, typename Traits>
>>>> inline constexpr std::basic_string_view<CharT,
>>>> Traits> prefix_symbol<CharT, Traits, si_prefix, std::milli> = "m";
>>>>
>>>> The problem is that the above code will not compile. Specialization for
>>>> all `CharT` will not be possible to be initialized with a literal like "m".
>>>> Also, there is no generic mechanism to initialize all Unicode-based
>>>> versions of the type with the same literal as each of them requires a
>>>> different prefix (u8, u, U). Providing a specialization for every character
>>>> type here is going to be a nightmare for library authors.
>>>>
>>>> To solve the second problem fmt and chrono defined something called
>>>> STATICALLY-WIDEN (http://wg21.link/time.general) but it seems that it
>>>> is more a specification hack rather than the implementation technique. I
>>>> call it a hack as it currently addresses only `char` and `wchar_t` and does
>>>> not mention Unicode characters at all as of now.
>>>>
>>>> Dear SG16 members, do you have any BKMs or suggestions on how to write
>>>> a library that is Unicode aware and safe in an easy and approachable way?
>>>> Should we strive to provide a nice-looking representation of units for
>>>> outputs that support Unicode (console, files, etc) or should we, as ever
>>>> before, just support only `char` and `wchar_t` and ignore the existence of
>>>> Unicode in C++?
>>>>
>>>
>>> I would forgo iostream and provide formatters for format.
>>> All of that is locale specific (so the approach you describe above does
>>> not work in the general case, for example cm2 will be τ.εκ. in greek [1])
>>> Which means icu
>>> The documentation is sparse [2], but you can play around with some test
>>> code
>>>
>>> https://github.com/unicode-org/icu/blob/e25796f6e545082af74f0017d55ec2d915c40a3d/icu4c/source/test/intltest/measfmttest.cpp
>>>
>>> OSX provide something similar
>>> https://developer.apple.com/documentation/foundation/nsmeasurementformatter?language=objc
>>>
>>> It seems easy enough for simple units
>>> For more complicated things that are compound units for example grams
>>> per cm2, the formatting might be a bit hairy
>>>
>>> Ideally at a high level,
>>>
>>> std::format(u8"{}", some_unit, std::locale("el_CY"));
>>>
>>> would do the right thing.
>>>
>>> I am not aware of SG-16 discussing measurements yet.
>>>
>>> It's a bigger design space than just providing u8 overloads.
>>> The question is not to provide a "nice" representation but the
>>> representation user expect in their preferred locale.
>>> I don't think the committee should be in the business of specifying
>>> notation.
>>>
>>>
>>> [1] https://www.unicode.org/cldr/charts/36/summary/root.html You can
>>> explore the CLDR data to list units
>>> [2]
>>> https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1MeasureFormat.html
>>>
>>>
>>> Sorry to drop a massive curve ball on you
>>>
>>> Regards,
>>>
>>> Corentin
>>>
>>>
>>>>
>>>> Please keep in mind that the library is hoped to target C++23.
>>>>
>>>> Best
>>>>
>>>> Mat
>>>> _______________________________________________
>>>> SG16 Unicode mailing list
>>>> Unicode_at_[hidden]
>>>> http://www.open-std.org/mailman/listinfo/unicode
>>>>
>>>

Received on 2019-10-18 12:05:44