sg16: [SG16] [boost] [review] [text] Text formal review

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Sun, 21 Jun 2020 10:16:07 -0400

This is a review for the Boost.Text library, submitted a day late (but
hopefully not a dollar short! (U.S. colloquialism, don't mind me!)).

The library has 3 somewhat related but (somewhat?) separable
sub-libraries. In "building block" order, these are:

- A string layer (a new std::string)
- A unicode layer (algorithms and data)
- A text layer (string, but if it gave a single flying crap about Unicode)

There are 4 (3?) types to care about in the string layer:
unencoded_rope, string, segmented_vector (and string_builder...?).
There are not many new types in the unicode layer save for things that
help the algorithms/data do the things well and report findings from
algorithm calls.
There are 2 types to care about in the unicode layer: text, and rope.

We will start with the lowest building block layer. This will likely
not be a typical review: most others have called out in the
documentation and other places things that have failed, so I will
focus primarily on the utility and design of the layers and what they
can bring to the table.

======
Layer 0
======

[[string]]
It gets rid of char_traits (yay!) but then also throws out the
allocator (ewwww!). ... That's it.

This type does not affect my view of the library because it can be
(mostly) safely ignored. It can be nuked from orbit and nothing of
value will be lost. I would actually recommend std::string be used
underneath, because why do this to the ecosystem for the (N+1)th time?

[[unencoded_rope]][[segmented_vector]]
These two data structures are FAR more spicy and incredibly
interesting. They provide different guarantees of insertion and
erasure complexity. Of course, neither have allocators built in so I
can't really customize how this works without hijacking global new and
delete, but Zach has made clear his distaste for the allocator world
and having recently built several standard containers I don't blame
him.

Nevertheless, both of these data structures are being talked about
together because they provide the same type of functionality:
unencoded_rope is just specialized for char storage. Notably,
segmented_vector has an insert for (iterator, value_type) while
unencoded_rope seemed to be missing that and only wanted to deal with
"strings"/ranges, rather than single elements. This made me applying
my fun text-wrapper on unencoded_rope mildly annoying because
single-insert was just not present:

phd::text::basic_text<utf8, nfd, boost::text::unencoded_rope> wee;
wee.insert(u8'A'); // kabloosh!

Nevertheless, my "shortcuts" for single insertion are honestly a waste
of space because I can just turn that into a range of size 1 and use
less of the "required" SequenceContainer
(https://en.cppreference.com/w/cpp/named_req/SequenceContainer) bits
'n' bobs anyway. So no real harm, no actual foul!

Running my tests with an encoding slapped on top of the unencoded rope
or the segmented vector worked, which meant I could get a different
storage policy with the nfd normalization form and the encoding of my
choice. (Well, I only tested utf8/16/32, one byte encoding, and then
the current execution character set (which was just utf8 anyways so
that's not really that exhaustive, is it?)).

This layer has immense value. Keep it and ship it; great job, Zach!

[[string_builder]]
I think this is vestigial. So, uh, doesn't really affect the review,
and I don't care for it?

=======
Layer 1
=======

Yes.

... That's it. That's literally it: ship it. Goddamn, ship this layer
like your favorite movie couple. This is what we need. This is what we
crave. It's ICU, except if ICU went to study under Stepanov, Lee and
Plauger instead of Gosling, Sheridan and Naughton. No complaints, no
problems: having this layer makes this library ABSOLUTELY worth it,
210%. There are even special normalize-in-place algorithms for
strings, which can save on performance. You can implement your own
Unicode text-aware layer on top of this stuff, it provides a robust
set of algorithms and normalization forms (hell yeah!) and makes every
second that this library is not in Boost a tragedy.

Passed the necessary tests on my machine despite taking an age, but
that's moreso because generated Unicode tests is a doozy.

Speaking of "implement your own Unicode text-aware layer..."

======
Layer 2
======

This is the layer I am -- on a library design level and a personal
philosophy level -- the most opposed to.

But my answer is still to accept it (well, modulo it being based on
the above string type. Please just use std::string).

[[ text ]] [[ rope ]]
While these containers can be evaluated individually, other reviews
have picked up a great deal of pickings at them and so I won't bother.
There was some grumbling about how a rope-like data structure is not
interesting enough to be included and I will just quietly wave that
off as "my use case is the only use case that matters and therefore I
don't care about other people's invariants or needs".

There are many implicitly (and explicitly) stated and maintained
opinions in this layer:

- UTF-8 is the way, truth, and life.
- Unicode is the only encoding that matters ever, for all time, in perpetuity.
- Allocators are shit!
- NFC is probably the best we can do here for varying reasons.
- Who needs non-contiguous storage anyways?
- Who needs non-SBO storage, anyways?

These are all opinions, many of which are present in the design of the
text container. And they allow this text container to ship. But that
lack of flexibility -- while okay for Qt or Apple's CoreText or
whatever other platform-specific hoo-ha you want to get involved with
-- does not help. In fact, it cuts them off: more than one person
during Meeting C++ spoke to me of Boost.Text and said it could not
meet their needs because it maintained encoding or normalization
invariants that did not interoperate with their existing system.
Storage is also an issue: while "I use boost::text::string underneath"
is fine and dandy, many systems (next to none, maybe?) are going to
speak in "text" or its related string type. They will want the
underlying container to speak to. For duck-type purposes, it works.
But for everyone else, it fails.

Since the string layer uses an `int` for its size and capacity, it is
lopsidedly incompatible with existing STL's implementations of string,
to the point that a reinterpret_cast -- however evil -- is not
suitable for transporting a reference-without-copy into these APIs.
God bless string_view and its friends, because it allows us to at
least continue to talk to some APIs since the text type guarantees
contiguous storage. This means that at the boundaries of an
application -- or even as a plugin to a wider ecosystem -- I am paying
a (sometimes obscene) cost to interoperate between
std::string/llvm::SmallString/unicode_code_unit_sequence and all the
other things people have developed to sit between them and what they
believe their string needs are. And while it is whack that so many of
these classes exist,

they do.

That lack of interoperability -- and once again, the lack of an
allocator template parameter -- hampers this library from COMPLETELY
DOMINATING the string scene. It will always be used as a solution,
maybe even 80% of the time. Those seeking more will have to figure out
how to build their own UTF16 containers, or their own special-encoded
containers, with very little support from the text library (save for
some transcoding functions they can leverage, but only from specific
Unicode encodings).

Onto the good news: the text and rope classes work like I expect them
to. Pass my tests. A+ great job keeps my text in utf8 and the
prescribed normalization form! Despite the length of my previous
critique that basically amounts to "who died and made you King of my
string layout and memory allocation?", this layer and the library
should still be accepted.

============
Okay, Seriously?
=============

Yep.

See, the problem right now with C++ -- and the standard in General --
is that we like to wait for something to bake for an eternity, often
long after it's useful and necessary for the end user. In C++11 we
introduced a "codecvt"-style thing called "std::wstring_convert",
whose sole purpose was transcoding, plus or minus some platform
shenanigans. It was implemented poorly on almost all platforms, its
performance is hot garbage
(https://github.com/ThePhD/sol2/issues/571), and it generally was a
bug-ridden mess.

But it shipped.

What we did when we both deprecated and removed std::wstring_convert
and its related facets is we took a real pain point in the C++
community and decided to make it far worse than it already was. See,
C++ -- and C++11 -- were steaming piles of dogpoo when it came to
Unicode (https://stackoverflow.com/a/17106065). So when
wstring_convert came on the scene, it was a breath of fresh air. Yeah,
the performance is garbage, yes the interface is trash, yes it hasn't
learned anything from Stepanov's fantastic work, but it was there. It
was workable. And it was standard.

And the Committee ripped it out of the user's hands.

Boost.Text, for however many extremely opinionated decisions it makes
that ends up excluding certain parts of the C++ ecosystem, provide a
SORELY needed relief for the majority of the C++ community who have
been struggling for the tiniest bit of a text solution. So even if the
storage has a mandated encoding; a strict normalization form is given;
and, everything else costs you a pound of flesh to build yourself, the
whole point is that there is a default, and it is a pretty good
default.

This is something that cannot be understated in the slightest; we have
nothing -- and I mean, N O T H I N G -- that reflects a good C++
library for Unicode. Even if you do not like Zach's decisions, other
people can pick up Zach's container types and run with them for quite
a while. Sure, the 7x performance gains I got in my last job using
solely allocators is impossible with Boost Text! But, Layer 1 exists:
I can leverage well-done Unicode algorithms to do the job I need to,
even if it is not as convenient and pre-packaged as I would like it to
be. This is not only important for the ecosystem at large, but for the
Boost Community. For a long time people have wondered if Boost will
lead the charge towards a better, brighter future by solving problems
that users face the most, or if it would fade into
compatibility-library obscurity and be repeatedly reviled for its
special build needs and required setup over its standard library
equivalents.

Boost.Text is one of many libraries I *expect* to see land in Boost to
solve critical problems, to be iterated and shipped towards the wider
C++ ecosystem and have an impact that most library developers would
only dream of.

==========
In Conclusion
==========

Just one really big thing for me:

- Use std::string underneath. `int` is not a good size type. People
work with strings larger than 1 GB (INT_MAX / 2, as reported by the
string implementation).

Other people commented on the other fixes I would care about and most
of those have already been noted, thanks! Other than that...

Please accept Boost.Text for inclusion in the next available version
of Boost and continue to work towards the end of our collective 40
year string nightmare.

We can sort out COMPLETE DOMINATION of the design space a little
later, since this design is -- thankfully -- not one that is immune to
source backwards compatible improvements.

Received on 2020-06-21 09:19:32