Date: Fri, 12 Apr 2019 23:27:00 +0000
> I took a vote on narrow/wide vs. narrow_execution/wide_execution, but not
> narrow_character_set/wide_character_set or
> narrow_execution_character_set/wide_execution_character_set. I'm down for
> making these names as ugly and unpalatable and unspell-able as possible,
> because nobody should be using them ever without compelling reason (e.g.,
> interop with old code).
Very strongly agree.
> I agree. The entire unicode library will only work with
> unicode_code_point/scalar_value (char32_t or a strong typedef, whatever
> people decide). However, in order to compensate for the fact that the
> stored text sequences in many places will not be able to use this library,
> we need robust transcoding (encode/decode) support. The default is
> encodings that:
>
> 1. pipe things from code_unit_t -> unicode_code_point_t;
> 2. (do all your work here);
> 3. then, pipe things from unicode_code_point_t -> code_unit_t
I think it should be unicode_scalar_value_t because surrogate code
points are forbidden in well-formed Unicode.
> If you specify the inner bit to not be Unicode, the library should (and
> will) loudly and noisily fail you for not providing Unicode it can use. But
> maybe someone just wants ebcdic -> wide_ebcdic with some strange
> non-unicode intermediary encoding. That's fine too; it just won't work with
> all of the Standard because it is Sufficiently Weird. Your encode/decode
> will work, and your transcode within that boundary, but not transcoding
> outside of it without some way to go from what you have to Unicode.
I was having something like that in mind except it looks like it may be
too complex in order to support very rare use case.
> I agree; the core of the library will be built on Unicode and Unicode
> Algorithms that work on Unicode Code Points/Unicode Scalar Values. However,
> there are one too many text encodings in the wild and serving up production
> data -- including obscene amounts of Financial and Government data -- that
> is *not* in a Unicode Format of any kind. Telling these industries that
> they will not be apart of the new world does not sound like a useful
> business proposition; therefore, they will pay the cost of (lazy, eager)
> transcoding as described above, and then use the Unicode Algorithms once
> they transcode. (They can then optionally translate back down to whatever
> they want; e.g., when they're sending it out of their program.)
Yes. However, I think maybe they have already wrote enough code for
their very specific encoding and may want to keep using it instead of
paying transcoding cost.
> Note that only the people who do not keep Unicode around will need to pay
> the cost of transcoding. If your data is already Unicode-friendly, then the
> standard and the interfaces we provide will support you fully. This means
> that any hard-coded algorithms that are not templated on encoding /
> decoding must provide a range to Unicode Codepoints to work on (or straight
> up take char8_t, char16_t, and char32_t, all of which are assumed by
> compile-time conventions to be valid Unicode).
Well, in my experience you just have a concept of
UnicodeScalarValueSequence and then do all your algorithms in terms of
that (at least). That way you don't care whether it's UTF-8, UTF-16 or
UTF-32 under the hood.
> narrow_character_set/wide_character_set or
> narrow_execution_character_set/wide_execution_character_set. I'm down for
> making these names as ugly and unpalatable and unspell-able as possible,
> because nobody should be using them ever without compelling reason (e.g.,
> interop with old code).
Very strongly agree.
> I agree. The entire unicode library will only work with
> unicode_code_point/scalar_value (char32_t or a strong typedef, whatever
> people decide). However, in order to compensate for the fact that the
> stored text sequences in many places will not be able to use this library,
> we need robust transcoding (encode/decode) support. The default is
> encodings that:
>
> 1. pipe things from code_unit_t -> unicode_code_point_t;
> 2. (do all your work here);
> 3. then, pipe things from unicode_code_point_t -> code_unit_t
I think it should be unicode_scalar_value_t because surrogate code
points are forbidden in well-formed Unicode.
> If you specify the inner bit to not be Unicode, the library should (and
> will) loudly and noisily fail you for not providing Unicode it can use. But
> maybe someone just wants ebcdic -> wide_ebcdic with some strange
> non-unicode intermediary encoding. That's fine too; it just won't work with
> all of the Standard because it is Sufficiently Weird. Your encode/decode
> will work, and your transcode within that boundary, but not transcoding
> outside of it without some way to go from what you have to Unicode.
I was having something like that in mind except it looks like it may be
too complex in order to support very rare use case.
> I agree; the core of the library will be built on Unicode and Unicode
> Algorithms that work on Unicode Code Points/Unicode Scalar Values. However,
> there are one too many text encodings in the wild and serving up production
> data -- including obscene amounts of Financial and Government data -- that
> is *not* in a Unicode Format of any kind. Telling these industries that
> they will not be apart of the new world does not sound like a useful
> business proposition; therefore, they will pay the cost of (lazy, eager)
> transcoding as described above, and then use the Unicode Algorithms once
> they transcode. (They can then optionally translate back down to whatever
> they want; e.g., when they're sending it out of their program.)
Yes. However, I think maybe they have already wrote enough code for
their very specific encoding and may want to keep using it instead of
paying transcoding cost.
> Note that only the people who do not keep Unicode around will need to pay
> the cost of transcoding. If your data is already Unicode-friendly, then the
> standard and the interfaces we provide will support you fully. This means
> that any hard-coded algorithms that are not templated on encoding /
> decoding must provide a range to Unicode Codepoints to work on (or straight
> up take char8_t, char16_t, and char32_t, all of which are assumed by
> compile-time conventions to be valid Unicode).
Well, in my experience you just have a concept of
UnicodeScalarValueSequence and then do all your algorithms in terms of
that (at least). That way you don't care whether it's UTF-8, UTF-16 or
UTF-32 under the hood.
Received on 2019-04-13 01:27:17