ISOCPP sg16 List: Re: Performance requirements for Unicode views/types/algorithms

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Mon, 27 Feb 2023 22:29:48 +0100

On 26/02/2023 02.48, Steve Downey via SG16 wrote:
> Just like everything else in the standard library, they should be adequate to good for most uses. None of the standard containers were designed to be the best for almost all cases, not even std::vector. As long as I don't have to tell people to avoid them, they are probably good enough. Best if the obvious use of them is inherently correct.

I agree; for general-purpose stuff offered by the standard library, performance
should be "usually good enough" and it would be a plus if the specified interface
didn't get in the way of a highly efficient implementation. (std::unordered_map
fails the latter test.)

It still means we should get an impression of how much performance we potentially
leave on the floor. If we're looking at 10x, I'm getting sad.

> Much text processing is tied to IO and the performance is mostly secondary.

I am cautious of such statements; none of us has a comprehensive understanding
of how C++ is used out there. In particular since a lot of C++ code is
closed-source code we'll never get to see.

> If we could make accidentally incorrect harder to do that would be a win.
>
> I think range interfaces give the widest general permissions to provide more specialized alternatives, like a forward SIMD decoder sending data to an optimized vector-ish back inserter.

That already highlights an interesting interface pessimization:
A back inserter needs to check for every insert operation whether the
preallocated memory in the std::vector is (still) large enough to hold
the additional data. An eager algorithm could pre-size the vector
accordingly and ensure that checks on the target capacity are not needed
at least for the majority of operations. Fewer branches, more happiness.

> I do think a specialized text type with better support for in the middle inserts and deletes would be useful, but also very much orthogonal to the papers we're looking at now. I have some plans to bring forward a paper proposing 2,3-finger trees as a data structure (rope-like but more recent, and also a persistent data structure) that might be useful for a text manipulation type, but I think the underlying container should probably be generic and retrievable from text.

It seems a Unicode text data type / string type would mostly
serve the purpose of maintaining invariants such as "NFC" in
the face of insertions and deletions. I'm wondering about
the use-cases for such, and whether they're numerous enough
for the standard to care. And, of course, we need a paper.

Jens

Received on 2023-02-27 21:29:53