C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] Unicode streams

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Fri, 18 Oct 2019 11:17:27 +0200
Also adding Vincent Reverdy who seems to be working in the same area (cf
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1930r0.pdf )

On Thu, 17 Oct 2019 at 22:30, Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

> Adding Victor directly
>
> On Thu, 17 Oct 2019 at 21:21, Mateusz Pusz <mateusz.pusz_at_[hidden]> wrote:
>
>> Hi everyone,
>>
>> Right now I am in the process of designing and implementing a Physical
>> Units library that hopefully will be a start for having such a feature in
>> the C++ Standard Library. You can find more info on the library here:
>> https://github.com/mpusz/units.
>>
>> Recently, I started to work on the text output of quantities. Quantities
>> consist of value and a unit symbol. The latter is a perfect use case for
>> Unicode. Consider:
>>
>> 10 us vs 10 μs
>> 2 kg*m/s^2 vs 2 kg⋅m/s²
>>
>> Before C++20 we could get away with a hack by providing Unicode
>> characters to `char`-based types and streams, but with the introduction of
>> `char8_t` in C++20 it seems it will be a bigger issue from now on. The
>> library implementors will have to provide 2 separate implementations:
>> 1. For `char`-based types (string_view, ostream) without Unicode signs
>> 2. For Unicode char based types
>>
>
> Yes, with the caveat that you can only output utf-8 to sink that expects
> it and conversion from Unicode to anything not Unicode will loose
> information
>
>
>>
>> However, there are a few issues here:
>> 1. As of now, we do not have std::u8cout or even std::u8ostream. So
>> there is really no easy way to create and use a stream for Unicode
>> characters. So even if I implement
>>
>> template<class CharT, class Traits>
>> friend std::basic_ostream<CharT, Traits>&
>> operator<<(std::basic_ostream<CharT, Traits>& os, const quantity& q)
>>
>> correctly, we do not have an easy way to use it.
>>
>> 2. In order to implement the above, I could imagine such an interface for
>> a symbol prefix:
>>
>> template<typename CharT, typename Traits, typename Prefix, typename Ratio>
>> inline constexpr std::basic_string_view<CharT, Traits> prefix_symbol;
>>
>> and its partial specializations for different prefixes/ratios:
>>
>> template<typename CharT, typename Traits>
>> inline constexpr std::basic_string_view<char, Traits> prefix_symbol<char,
>> Traits, si_prefix, std::micro> = "u";
>> template<typename CharT, typename Traits>
>> inline constexpr std::basic_string_view<CharT,
>> Traits> prefix_symbol<CharT, Traits, si_prefix, std::micro> = u8"\u00b5";
>> // µ
>> template<typename CharT, typename Traits>
>> inline constexpr std::basic_string_view<CharT,
>> Traits> prefix_symbol<CharT, Traits, si_prefix, std::milli> = "m";
>>
>> The problem is that the above code will not compile. Specialization for
>> all `CharT` will not be possible to be initialized with a literal like "m".
>> Also, there is no generic mechanism to initialize all Unicode-based
>> versions of the type with the same literal as each of them requires a
>> different prefix (u8, u, U). Providing a specialization for every character
>> type here is going to be a nightmare for library authors.
>>
>> To solve the second problem fmt and chrono defined something called
>> STATICALLY-WIDEN (http://wg21.link/time.general) but it seems that it is
>> more a specification hack rather than the implementation technique. I call
>> it a hack as it currently addresses only `char` and `wchar_t` and does not
>> mention Unicode characters at all as of now.
>>
>> Dear SG16 members, do you have any BKMs or suggestions on how to write a
>> library that is Unicode aware and safe in an easy and approachable way?
>> Should we strive to provide a nice-looking representation of units for
>> outputs that support Unicode (console, files, etc) or should we, as ever
>> before, just support only `char` and `wchar_t` and ignore the existence of
>> Unicode in C++?
>>
>
> I would forgo iostream and provide formatters for format.
> All of that is locale specific (so the approach you describe above does
> not work in the general case, for example cm2 will be τ.εκ. in greek [1])
> Which means icu
> The documentation is sparse [2], but you can play around with some test
> code
>
> https://github.com/unicode-org/icu/blob/e25796f6e545082af74f0017d55ec2d915c40a3d/icu4c/source/test/intltest/measfmttest.cpp
>
> OSX provide something similar
> https://developer.apple.com/documentation/foundation/nsmeasurementformatter?language=objc
>
> It seems easy enough for simple units
> For more complicated things that are compound units for example grams per
> cm2, the formatting might be a bit hairy
>
> Ideally at a high level,
>
> std::format(u8"{}", some_unit, std::locale("el_CY"));
>
> would do the right thing.
>
> I am not aware of SG-16 discussing measurements yet.
>
> It's a bigger design space than just providing u8 overloads.
> The question is not to provide a "nice" representation but the
> representation user expect in their preferred locale.
> I don't think the committee should be in the business of specifying
> notation.
>
>
> [1] https://www.unicode.org/cldr/charts/36/summary/root.html You can
> explore the CLDR data to list units
> [2]
> https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1MeasureFormat.html
>
>
> Sorry to drop a massive curve ball on you
>
> Regards,
>
> Corentin
>
>
>>
>> Please keep in mind that the library is hoped to target C++23.
>>
>> Best
>>
>> Mat
>> _______________________________________________
>> SG16 Unicode mailing list
>> Unicode_at_[hidden]
>> http://www.open-std.org/mailman/listinfo/unicode
>>
>

Received on 2019-10-18 11:17:40