Date: Thu, 31 May 2018 10:41:20 -0500
On Wed, May 30, 2018 at 7:36 PM, ThePhD <phdofthehouse_at_[hidden]> wrote:
> I've been bikeshedding and looking at a wide variety of languages and
> implementations of text. Many still use code units and provide iterators to
> code units as its default level of abstraction. Others provide code points,
> and a few newer languages and libraries provide grapheme clusters.
>
> I'm beginning to think `std::text` -- in whatever form it takes -- should
> include no defaults and instead just pack itself with member functions such
> as `.codepoints()`, `.graphemes(/*options*/)`, `.words(/*options*/)` and
> let the user decide at what level they want to be working. I think encoding
> and normalization should be part of the type name, because those are the
> two very important, but after that we should simply be handing out views
> and letting people pick whatever abstraction level they want.
>
There is no compelling reason for those things to be members. Boost.Text
will have all that functionality (and mostly does already), separated out
as free-function algorithms. I find that this works quite well.
Moreover, a text-type is a sequence of something. Or rather, I can't get
much use out of it if it isn't. So, what is it a sequence of? I think
that graphemes, code points, and code units are the most essential units of
work that one might want to use. However, a sequence of exactly *one* of
these should be the kind of range that a text type models.
I think most users that work with text will want graphemes as the essential
view of text. That may prove to be untrue.
> At the very least, it means that we don't arm the wrong gun for the user
> by accident / by default. And at some level, the user will have to read
> through even the synopsis of what each of these functions will bring to
> them. Some are obvious (words, sentences, etc. text segmentation
> algorithms), but others might require some thinking (codepoints, graphemes
> in particular).
>
Not picking one is a lot worse. One of the many problems with Unicode is
that it is too damn complicated for experienced users, much less new ones.
We need types with the right default so that people can just pick up a new
version of their compiler and get to work without taking a week first to do
research. To the extent possible, it should "just work."
> I can understand that not having defaults could be a frustrating
> experience to start with, however. I feel like it's justifiable given that
> there's no 100% right answer. Maybe we can get 80% but it will only make
> the 20% more surprising. I feel like having docs / tables that describe the
> various segmentation algorithms, what they bring to table, and what the
> user can get out of it might be more worth while.
>
> Is this a viable path?
>
I don't think it is. Again, the future reactions to Boost.Text will bear
that out (or not!).
Zach
> I've been bikeshedding and looking at a wide variety of languages and
> implementations of text. Many still use code units and provide iterators to
> code units as its default level of abstraction. Others provide code points,
> and a few newer languages and libraries provide grapheme clusters.
>
> I'm beginning to think `std::text` -- in whatever form it takes -- should
> include no defaults and instead just pack itself with member functions such
> as `.codepoints()`, `.graphemes(/*options*/)`, `.words(/*options*/)` and
> let the user decide at what level they want to be working. I think encoding
> and normalization should be part of the type name, because those are the
> two very important, but after that we should simply be handing out views
> and letting people pick whatever abstraction level they want.
>
There is no compelling reason for those things to be members. Boost.Text
will have all that functionality (and mostly does already), separated out
as free-function algorithms. I find that this works quite well.
Moreover, a text-type is a sequence of something. Or rather, I can't get
much use out of it if it isn't. So, what is it a sequence of? I think
that graphemes, code points, and code units are the most essential units of
work that one might want to use. However, a sequence of exactly *one* of
these should be the kind of range that a text type models.
I think most users that work with text will want graphemes as the essential
view of text. That may prove to be untrue.
> At the very least, it means that we don't arm the wrong gun for the user
> by accident / by default. And at some level, the user will have to read
> through even the synopsis of what each of these functions will bring to
> them. Some are obvious (words, sentences, etc. text segmentation
> algorithms), but others might require some thinking (codepoints, graphemes
> in particular).
>
Not picking one is a lot worse. One of the many problems with Unicode is
that it is too damn complicated for experienced users, much less new ones.
We need types with the right default so that people can just pick up a new
version of their compiler and get to work without taking a week first to do
research. To the extent possible, it should "just work."
> I can understand that not having defaults could be a frustrating
> experience to start with, however. I feel like it's justifiable given that
> there's no 100% right answer. Maybe we can get 80% but it will only make
> the 20% more surprising. I feel like having docs / tables that describe the
> various segmentation algorithms, what they bring to table, and what the
> user can get out of it might be more worth while.
>
> Is this a viable path?
>
I don't think it is. Again, the future reactions to Boost.Text will bear
that out (or not!).
Zach
Received on 2018-05-31 17:41:22