sg16: Re: [SG16-Unicode] Performance of C interfaces (was: Re: SG16 meeting summary for August 21st, 2019)

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Sun, 1 Sep 2019 19:00:23 -0400

On Sun, Sep 1, 2019 at 12:07 PM Steve Downey <sdowney_at_[hidden]> wrote:
>
> That was, if I recall correctly, about the C standard library interfaces in the Null-terminated multibyte strings section. Basically that the character at a time interfaces are not amenable to vectorization.
>

     Yes. The C interfaces for UTFx-to-multi-byte (mbrtoc16, etc.) and
back currently do one-by-one character encoding with a function that
is often hidden behind a DLL function call, or in object code. The
former prevents anything from being done about it, the latter is just
a prayer than LTO can optimize _so well_ that your loop using the
one-by-one codepoint converting functions and turn the whole thing
into a really, really nice loop which converts things very quickly.

     I have not observed this to ever happen, and I'm working on a
benchmarking suite of various methods of conversion that will help
quantify these results in tangible ways.

     With ptr + length, someone can optimize the resulting call as
much as they like. With null-terminated versions of the function, I am
skeptical the same performance can be achieved without first calling
strlen() but I have no experience or data to back up that intuition.

Sincerely,
JeanHeyd Meneide

Received on 2019-09-02 01:00:36