sg16: Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

From: Charlie Barto <Charles.Barto_at_[hidden]>
Date: Mon, 9 Aug 2021 19:33:52 +0000

I think exposing the global could work fine for shared libraries. On linux it should be no problem because the runtime shouldn't have to allocate any memory or anything (i.e. there's no actual initialization code that needs to run for the global). On Windows there could be some dll related problems but it still seems implementable to me.

Also correction to my comment on rust's env::args mechanism, they actually do kill the program upon seeing any invalid arguments, their env::os_args mechanism gives you "WTF-8" (on windows) or a byte-string (on unix), and won't kill the process.

From: Corentin Jabot <corentinjabot_at_[hidden]>
Sent: Monday, August 9, 2021 8:11 AM
To: SG16 <sg16_at_[hidden]>
Cc: Peter Brett <pbrett_at_[hidden]>; Charlie Barto <Charles.Barto_at_[hidden]>
Subject: Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

On Mon, Aug 9, 2021 at 4:51 PM Charlie Barto via SG16 <sg16_at_[hidden]<mailto:sg16_at_[hidden]>> wrote:
It's not just performance (although my point about actually having to look at each byte of the input string is still the case). As far as I can tell the "less than one instruction per byte" vectorized approach described in the linked paper does not tell you _where_ the error occurred, so is not useful if you need to handle valid UTF-8 that follows invalid UTF-8.

In any case my main issue isn't with performance but that _forcing_ validation at program startup makes some perfectly reasonable programs (such as "cp" on windows) impossible to write.

If we offer a "std::arguments" type interface it's probably a good idea to offer both the UTF-8 and WTF-8 interfaces. This is, in fact, exactly what rust does, and I think they got this right. For the validated version they also validate each argument individually, so, for example, your program won't die if it's put in a directory that has bogus UTF-8 in its path (argv[0]).

I think we should really find a way to expose arguments as globals, from there we can build an infinity of interfaces on top.
I have no idea how that would work with shared libraries though.

Charlie.

-----Original Message-----
From: Peter Brett <pbrett_at_[hidden]<mailto:pbrett_at_[hidden]>>
Sent: Monday, August 9, 2021 5:19 AM
To: Charlie Barto <Charles.Barto_at_[hidden]<mailto:Charles.Barto_at_[hidden]>>
Cc: sg16_at_[hidden]<mailto:sg16_at_[hidden]>
Subject: RE: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

Hi Charlie,

Now that we have real-world evidence that UTF-8 can be validated in less than one instruction per byte on commodity platforms (and somewhat faster than memcpy()-ing those bytes), when do you think we can stop being so concerned about the performance impact of UTF-8 validation?

    Validating UTF-8 in less than one instruction per byte
    J. Keiser and D. Lemire

Best regards,

                  Peter

> -----Original Message-----
> From: SG16 <sg16-bounces_at_[hidden]<mailto:sg16-bounces_at_[hidden]>> On Behalf Of Charlie Barto
> via
> SG16
> Sent: 29 July 2021 23:33
> To: Corentin Jabot <corentinjabot_at_[hidden]<mailto:corentinjabot_at_[hidden]>>
>
> yes, and in any case if we wanted to ensure the parameters were
> actually
> utf-8 the runtime startup code would have to do that check. If users
> are checking they defer or omit validity checks in some cases. This
> can be important, to check that the string is actually well formed you
> need to _actually look_ at every single byte and then do a sequence of
> probably a few dozen instructions to decide if it's valid. Sometimes
> it's OK to just assume it _is_ valid if you don't do anything that
> actually requires the whole thing be valid. For example you may
> linearly search for delimiters then parse text between them, as long
> as you are careful about validating the text between the delimiters it
> doesn't matter if some other part of the string is bogus, and you
> never have to execute the instructions that would check those other bits of the string.

--
SG16 mailing list
SG16_at_[hidden]<mailto:SG16_at_[hidden]>
https://lists.isocpp.org/mailman/listinfo.cgi/sg16<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.isocpp.org%2Fmailman%2Flistinfo.cgi%2Fsg16&data=04%7C01%7CCharles.Barto%40microsoft.com%7Cf1d29485da2f4f10430108d95b47f531%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637641186886923034%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0IhiVQiGJ85do0AHkoxnaXHTG%2B8SXkhNi9gtbcqMnr0%3D&reserved=0>

Received on 2021-08-09 14:33:57