C++ Logo


Advanced search

Re: Considerations for Unicode algorithms

From: Steve Downey <sdowney_at_[hidden]>
Date: Tue, 31 Jan 2023 23:43:06 -0500
On Tue, Jan 31, 2023 at 5:35 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> *code unit sequences should be validated by default.*
> The only way I know of to do this well (without contracts) is for
> validation to produce a wrapper type that statically indicates that
> validation has been performed. Validation is fast, but fast operations add
> up if repeated many times. I favor specifying preconditions that can be
> specified as contracts in the future.
> I think this is in the translation from the various UTF forms to
codepoints, corresponding to the whatwg encoding standard's (
https://encoding.spec.whatwg.org/ ) notion of a decoder. The suggestion,
if I'm understanding correctly, is that the default mode should validate
the stream of bytes to produce char32_t containing code points. There are
various options for dealing with broken text ranging from halting,
supplying replacement chars, or nothing.
This sort of complication is why I'd like to see this aspect factored out
of the algorithms. I've had some success with the segmentation algorithms
producing 'segments' that allow access to the underlying code units for
reconstruction. Nothing ready for publication yet, unfortunately.

Received on 2023-02-01 04:43:20