Date: Wed, 30 May 2018 20:36:16 -0400
I've been bikeshedding and looking at a wide variety of languages and
implementations of text. Many still use code units and provide iterators to
code units as its default level of abstraction. Others provide code points,
and a few newer languages and libraries provide grapheme clusters.
I'm beginning to think `std::text` -- in whatever form it takes -- should
include no defaults and instead just pack itself with member functions such
as `.codepoints()`, `.graphemes(/*options*/)`, `.words(/*options*/)` and
let the user decide at what level they want to be working. I think encoding
and normalization should be part of the type name, because those are the
two very important, but after that we should simply be handing out views
and letting people pick whatever abstraction level they want.
At the very least, it means that we don't arm the wrong gun for the user by
accident / by default. And at some level, the user will have to read
through even the synopsis of what each of these functions will bring to
them. Some are obvious (words, sentences, etc. text segmentation
algorithms), but others might require some thinking (codepoints, graphemes
in particular).
I can understand that not having defaults could be a frustrating experience
to start with, however. I feel like it's justifiable given that there's no
100% right answer. Maybe we can get 80% but it will only make the 20% more
surprising. I feel like having docs / tables that describe the various
segmentation algorithms, what they bring to table, and what the user can
get out of it might be more worth while.
Is this a viable path?
implementations of text. Many still use code units and provide iterators to
code units as its default level of abstraction. Others provide code points,
and a few newer languages and libraries provide grapheme clusters.
I'm beginning to think `std::text` -- in whatever form it takes -- should
include no defaults and instead just pack itself with member functions such
as `.codepoints()`, `.graphemes(/*options*/)`, `.words(/*options*/)` and
let the user decide at what level they want to be working. I think encoding
and normalization should be part of the type name, because those are the
two very important, but after that we should simply be handing out views
and letting people pick whatever abstraction level they want.
At the very least, it means that we don't arm the wrong gun for the user by
accident / by default. And at some level, the user will have to read
through even the synopsis of what each of these functions will bring to
them. Some are obvious (words, sentences, etc. text segmentation
algorithms), but others might require some thinking (codepoints, graphemes
in particular).
I can understand that not having defaults could be a frustrating experience
to start with, however. I feel like it's justifiable given that there's no
100% right answer. Maybe we can get 80% but it will only make the 20% more
surprising. I feel like having docs / tables that describe the various
segmentation algorithms, what they bring to table, and what the user can
get out of it might be more worth while.
Is this a viable path?
Received on 2018-05-31 02:36:19