C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] Feedback on p1629r0

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Sun, 23 Jun 2019 09:35:31 -0400
Thank you for the feedback. I'll try to address the points one at a time.

In std::text::encoding_errc it says:
>
> // sequence can be encoded but resulting
> // code point is invalid (e.g., encodes a lone surrogate)
> invalid_output = 0x05
>
> No, lone surrogates are fully valid code points, but they are invalid
> scalar values.
>

I can clarify the comment. The goal is that invalid_output describes things
that can't end up in the sequence as far as the encoding can tell. For
example, valid UTF16 will not produce a lone surrogate code point in the
sequence. So I should say "... but resulting code point is invalid for the
encoding".


> I don't think converting to scalar values is "decoding", especially if
> the code uses dumb string types.
>

That's probably closer to a matter of perspective. If your code points are
already all scalar values, then it's a no-op. Otherwise, it's decoding --
just a very light one.


> In 3.2.2.3 it talks about assuming that text is valid, this can be
> enforced by strong types such as scalar_value_sequence.
>

You need a way to get to `scalar_value_sequence` to begin with.
scalar_value_sequence can be the result of a text_encode or text_decode
operation. And, this encoding scheme does not prevent someone from using
either char32_t as their code point type or unicode_scalar_value as their
code point type, or having scalar_value_sequence as the "OutputRange" for
the encode or decode operations.

Regardless, you will still have "char*", "char8_t*", "char16_t*" and
similar that may or may not be a sequence of scalar values.
assume_valid_handler lets someone "bless" their storage, saying "yes, it's
already scalar values, I checked and its fine".


> In 3.2.3 using char32_t directly may be a bad idea. I think we should
> focus on strong types instead... Oh, it doesn't require Unicode...
>



> If it provides ASCII it then better provide ASCII character type. We
> don't want to continue abusing "char".
>

I have not written out sketches of what these encoding types will loo at,
as you've noticed. This is because Tom Honermann and others have expressed
great interest in pursuing what was done with text_view, where each
encoding has its own associated input and output types. The question of
signaling compatibility was up in the air, with ideas and basic
implementation to date suggestion we check for (implicit) convertibility
between the Encoder's character type and the Decoder's Character Type
(e.g., they can both interoperate with each other using a
unicode_scalar_value conversion).

Others have expressed great concerns for such a system (e.g., Henri
Sivonen's post and the resulting discussion). I am not sure there is a
clear winner here: being r0 of this proposal and since I might make this a
school project, I would like to do work in the space and report back field
experience rather than go all-in on a design that results in a bad slip-ups
from mixing character sets, or bad conversion headaches for users.


> I don't like basic_utf8 providing encode_lone_surrogates parameter.
> That's not UTF8 then at all.
>

As stated, "this is not going to be looked into too deeply for the first
iteration of this proposal." That is, I am not going to actively pursue
such a path, it's just for thinking about. Whether or not we need these
modes or if it just isn't worth implementer time is completely fine. I'm
not bothered about not having this and throwing it out of R1/R2 of the
proposal, because Encoding is already concept-ified and can be swapped in
and out at will. If someone really wants WTF8, they can write their own
WTF8. It will be a small waste of their time having to reimplement _mostly_
what's inside the standard's encoding, but oh well: they'll survive.

>
> For scalar value and grapheme cluster containers we would need iterator
> or range functions. My code uses next_scalar_value and
> previous_scalar_value so I can iterate scalar values inside the code
> unit range.
>

I haven't gotten that far yet. :D

Seriously, after encoding facilities, normalization needs to be addressed.
It's incredibly important because some APIs (like MS's WideCharToMultiByte
and MultiByteToWideChar) perform normalization *for* you, if you ask it.
There are performance gains to be had for the free functions I plan to
write if we let an implementation both transcode and normalize at the same
time. ***

*** - needs to be proven out with benchmarks, first

Again, thank you for the feedback. I'll do my best to clean up what I can
soon.

Sincerely,
JeanHeyd

Received on 2019-06-23 15:35:43