sg16: Re: [SG16-Unicode] Unicode streams

From: Mateusz Pusz <mateusz.pusz_at_[hidden]>
Date: Fri, 18 Oct 2019 11:37:44 +0200

Awesome, thanks!

Just please note that this is not a thread about the Physical Units library
in general. For this, we have one already on the SG6 reflector started
after the evening session in Cologne. Also, I bring a big paper to Belfast
about it (P1935R0) but due to some technical issues it did not land in the
initial Belfast mailing. It should be added by Hal soon.

Let's scope on Unicode related issues here.

Best

Mat

pt., 18 paź 2019 o 11:17 Corentin Jabot <corentinjabot_at_[hidden]>
napisał(a):

> Also adding Vincent Reverdy who seems to be working in the same area (cf
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1930r0.pdf )
>
> On Thu, 17 Oct 2019 at 22:30, Corentin Jabot <corentinjabot_at_[hidden]>
> wrote:
>
>> Adding Victor directly
>>
>> On Thu, 17 Oct 2019 at 21:21, Mateusz Pusz <mateusz.pusz_at_[hidden]>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> Right now I am in the process of designing and implementing a Physical
>>> Units library that hopefully will be a start for having such a feature in
>>> the C++ Standard Library. You can find more info on the library here:
>>> https://github.com/mpusz/units.
>>>
>>> Recently, I started to work on the text output of quantities. Quantities
>>> consist of value and a unit symbol. The latter is a perfect use case for
>>> Unicode. Consider:
>>>
>>> 10 us vs 10 μs
>>> 2 kg*m/s^2 vs 2 kg⋅m/s²
>>>
>>> Before C++20 we could get away with a hack by providing Unicode
>>> characters to `char`-based types and streams, but with the introduction of
>>> `char8_t` in C++20 it seems it will be a bigger issue from now on. The
>>> library implementors will have to provide 2 separate implementations:
>>> 1. For `char`-based types (string_view, ostream) without Unicode signs
>>> 2. For Unicode char based types
>>>
>>
>> Yes, with the caveat that you can only output utf-8 to sink that expects
>> it and conversion from Unicode to anything not Unicode will loose
>> information
>>
>>
>>>
>>> However, there are a few issues here:
>>> 1. As of now, we do not have std::u8cout or even std::u8ostream. So
>>> there is really no easy way to create and use a stream for Unicode
>>> characters. So even if I implement
>>>
>>> template<class CharT, class Traits>
>>> friend std::basic_ostream<CharT, Traits>&
>>> operator<<(std::basic_ostream<CharT, Traits>& os, const quantity& q)
>>>
>>> correctly, we do not have an easy way to use it.
>>>
>>> 2. In order to implement the above, I could imagine such an interface
>>> for a symbol prefix:
>>>
>>> template<typename CharT, typename Traits, typename Prefix, typename
>>> Ratio>
>>> inline constexpr std::basic_string_view<CharT, Traits> prefix_symbol;
>>>
>>> and its partial specializations for different prefixes/ratios:
>>>
>>> template<typename CharT, typename Traits>
>>> inline constexpr std::basic_string_view<char, Traits> prefix_symbol<char,
>>> Traits, si_prefix, std::micro> = "u";
>>> template<typename CharT, typename Traits>
>>> inline constexpr std::basic_string_view<CharT,
>>> Traits> prefix_symbol<CharT, Traits, si_prefix, std::micro> = u8"\u00b5";
>>> // µ
>>> template<typename CharT, typename Traits>
>>> inline constexpr std::basic_string_view<CharT,
>>> Traits> prefix_symbol<CharT, Traits, si_prefix, std::milli> = "m";
>>>
>>> The problem is that the above code will not compile. Specialization for
>>> all `CharT` will not be possible to be initialized with a literal like "m".
>>> Also, there is no generic mechanism to initialize all Unicode-based
>>> versions of the type with the same literal as each of them requires a
>>> different prefix (u8, u, U). Providing a specialization for every character
>>> type here is going to be a nightmare for library authors.
>>>
>>> To solve the second problem fmt and chrono defined something called
>>> STATICALLY-WIDEN (http://wg21.link/time.general) but it seems that it
>>> is more a specification hack rather than the implementation technique. I
>>> call it a hack as it currently addresses only `char` and `wchar_t` and does
>>> not mention Unicode characters at all as of now.
>>>
>>> Dear SG16 members, do you have any BKMs or suggestions on how to write a
>>> library that is Unicode aware and safe in an easy and approachable way?
>>> Should we strive to provide a nice-looking representation of units for
>>> outputs that support Unicode (console, files, etc) or should we, as ever
>>> before, just support only `char` and `wchar_t` and ignore the existence of
>>> Unicode in C++?
>>>
>>
>> I would forgo iostream and provide formatters for format.
>> All of that is locale specific (so the approach you describe above does
>> not work in the general case, for example cm2 will be τ.εκ. in greek [1])
>> Which means icu
>> The documentation is sparse [2], but you can play around with some test
>> code
>>
>> https://github.com/unicode-org/icu/blob/e25796f6e545082af74f0017d55ec2d915c40a3d/icu4c/source/test/intltest/measfmttest.cpp
>>
>> OSX provide something similar
>> https://developer.apple.com/documentation/foundation/nsmeasurementformatter?language=objc
>>
>> It seems easy enough for simple units
>> For more complicated things that are compound units for example grams per
>> cm2, the formatting might be a bit hairy
>>
>> Ideally at a high level,
>>
>> std::format(u8"{}", some_unit, std::locale("el_CY"));
>>
>> would do the right thing.
>>
>> I am not aware of SG-16 discussing measurements yet.
>>
>> It's a bigger design space than just providing u8 overloads.
>> The question is not to provide a "nice" representation but the
>> representation user expect in their preferred locale.
>> I don't think the committee should be in the business of specifying
>> notation.
>>
>>
>> [1] https://www.unicode.org/cldr/charts/36/summary/root.html You can
>> explore the CLDR data to list units
>> [2]
>> https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1MeasureFormat.html
>>
>>
>> Sorry to drop a massive curve ball on you
>>
>> Regards,
>>
>> Corentin
>>
>>
>>>
>>> Please keep in mind that the library is hoped to target C++23.
>>>
>>> Best
>>>
>>> Mat
>>> _______________________________________________
>>> SG16 Unicode mailing list
>>> Unicode_at_[hidden]
>>> http://www.open-std.org/mailman/listinfo/unicode
>>>
>>

Received on 2019-10-18 11:38:10