On Mon, Aug 9, 2021 at 4:51 PM Charlie Barto via SG16 <sg16@lists.isocpp.org> wrote:
It's not just performance (although my point about actually having to look at each byte of the input string is still the case). As far as I can tell the "less than one instruction per byte" vectorized approach described in the linked paper does not tell you _where_ the error occurred, so is not useful if you need to handle valid UTF-8 that follows invalid UTF-8.

In any case my main issue isn't with performance but that _forcing_ validation at program startup makes some perfectly reasonable programs (such as "cp" on windows) impossible to write.

If we offer a "std::arguments" type interface it's probably a good idea to offer both the UTF-8 and WTF-8 interfaces. This is, in fact, exactly what rust does, and I think they got this right. For the validated version they also validate each argument individually, so, for example, your program won't die if it's put in a directory that has bogus UTF-8 in its path (argv[0]).

I think we should really find a way to expose arguments as globals, from there we can build an infinity of interfaces on top.
I have no idea how that would work with shared libraries though.


Charlie.

-----Original Message-----
From: Peter Brett <pbrett@cadence.com>
Sent: Monday, August 9, 2021 5:19 AM
To: Charlie Barto <Charles.Barto@microsoft.com>
Cc: sg16@lists.isocpp.org
Subject: RE: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

Hi Charlie,

Now that we have real-world evidence that UTF-8 can be validated in less than one instruction per byte on commodity platforms (and somewhat faster than memcpy()-ing those bytes), when do you think we can stop being so concerned about the performance impact of UTF-8 validation?

    Validating UTF-8 in less than one instruction per byte
    J. Keiser and D. Lemire

Best regards,

                  Peter

> -----Original Message-----
> From: SG16 <sg16-bounces@lists.isocpp.org> On Behalf Of Charlie Barto
> via
> SG16
> Sent: 29 July 2021 23:33
> To: Corentin Jabot <corentinjabot@gmail.com>
>
> yes, and in any case if we wanted to ensure the parameters were
> actually
> utf-8 the runtime startup code would have to do that check. If users
> are checking they defer or omit validity checks in some cases. This
> can be important, to check that the string is actually well formed you
> need to _actually look_ at every single byte and then do a sequence of
> probably a few dozen instructions to decide if it's valid. Sometimes
> it's OK to just assume it _is_ valid if you don't do anything that
> actually requires the whole thing be valid. For example you may
> linearly search for delimiters then parse text between them, as long
> as you are careful about validating the text between the delimiters it
> doesn't matter if some other part of the string is bogus, and you
> never have to execute the instructions that would check those other bits of the string.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16