C++ Logo


Advanced search

Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Mon, 9 Aug 2021 17:11:14 +0200
On Mon, Aug 9, 2021 at 4:51 PM Charlie Barto via SG16 <sg16_at_[hidden]>

> It's not just performance (although my point about actually having to look
> at each byte of the input string is still the case). As far as I can tell
> the "less than one instruction per byte" vectorized approach described in
> the linked paper does not tell you _where_ the error occurred, so is not
> useful if you need to handle valid UTF-8 that follows invalid UTF-8.
> In any case my main issue isn't with performance but that _forcing_
> validation at program startup makes some perfectly reasonable programs
> (such as "cp" on windows) impossible to write.
> If we offer a "std::arguments" type interface it's probably a good idea to
> offer both the UTF-8 and WTF-8 interfaces. This is, in fact, exactly what
> rust does, and I think they got this right. For the validated version they
> also validate each argument individually, so, for example, your program
> won't die if it's put in a directory that has bogus UTF-8 in its path
> (argv[0]).

I think we should really find a way to expose arguments as globals, from
there we can build an infinity of interfaces on top.
I have no idea how that would work with shared libraries though.

> Charlie.
> -----Original Message-----
> From: Peter Brett <pbrett_at_[hidden]>
> Sent: Monday, August 9, 2021 5:19 AM
> To: Charlie Barto <Charles.Barto_at_[hidden]>
> Cc: sg16_at_[hidden]
> Subject: RE: [SG16] A UTF-8 environment specification; an alternative to
> assuming UTF-8 based on choice of literal encoding
> Hi Charlie,
> Now that we have real-world evidence that UTF-8 can be validated in less
> than one instruction per byte on commodity platforms (and somewhat faster
> than memcpy()-ing those bytes), when do you think we can stop being so
> concerned about the performance impact of UTF-8 validation?
> Validating UTF-8 in less than one instruction per byte
> J. Keiser and D. Lemire
> Best regards,
> Peter
> > -----Original Message-----
> > From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Charlie Barto
> > via
> > SG16
> > Sent: 29 July 2021 23:33
> > To: Corentin Jabot <corentinjabot_at_[hidden]>
> >
> > yes, and in any case if we wanted to ensure the parameters were
> > actually
> > utf-8 the runtime startup code would have to do that check. If users
> > are checking they defer or omit validity checks in some cases. This
> > can be important, to check that the string is actually well formed you
> > need to _actually look_ at every single byte and then do a sequence of
> > probably a few dozen instructions to decide if it's valid. Sometimes
> > it's OK to just assume it _is_ valid if you don't do anything that
> > actually requires the whole thing be valid. For example you may
> > linearly search for delimiters then parse text between them, as long
> > as you are careful about validating the text between the delimiters it
> > doesn't matter if some other part of the string is bogus, and you
> > never have to execute the instructions that would check those other bits
> of the string.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2021-08-09 10:11:28