It's not just performance (although my point about actually having to look at each byte of the input string is still the case). As far as I can tell the "less than one instruction per byte" vectorized approach described in the linked paper does not tell you _where_ the error occurred, so is not useful if you need to handle valid UTF-8 that follows invalid UTF-8.

In any case my main issue isn't with performance but that _forcing_ validation at program startup makes some perfectly reasonable programs (such as "cp" on windows) impossible to write.

If we offer a "std::arguments" type interface it's probably a good idea to offer both the UTF-8 and WTF-8 interfaces. This is, in fact, exactly what rust does, and I think they got this right. For the validated version they also validate each argument individually, so, for example, your program won't die if it's put in a directory that has bogus UTF-8 in its path (argv[0]).

I think we should really find a way to expose arguments as globals, from there we can build an infinity of interfaces on top.
I have no idea how that would work with shared libraries though.


Hi Charlie,

Now that we have real-world evidence that UTF-8 can be validated in less than one instruction per byte on commodity platforms (and somewhat faster than memcpy()-ing those bytes), when do you think we can stop being so concerned about the performance impact of UTF-8 validation?

    Validating UTF-8 in less than one instruction per byte
    J. Keiser and D. Lemire

Best regards,


> yes, and in any case if we wanted to ensure the parameters were
> actually
> utf-8 the runtime startup code would have to do that check. If users
> are checking they defer or omit validity checks in some cases. This
> can be important, to check that the string is actually well formed you
> need to _actually look_ at every single byte and then do a sequence of
> probably a few dozen instructions to decide if it's valid. Sometimes
> it's OK to just assume it _is_ valid if you don't do anything that
> actually requires the whole thing be valid. For example you may
> linearly search for delimiters then parse text between them, as long
> as you are careful about validating the text between the delimiters it
> doesn't matter if some other part of the string is bogus, and you
> never have to execute the instructions that would check those other bits of the string.

