There is no compelling reason for those things to be members.  Boost.Text will have all that functionality (and mostly does already), separated out as free-function algorithms.  I find that this works quite well.


Moreover, a text-type is a sequence of something.  Or rather, I can't get much use out of it if it isn't.  So, what is it a sequence of?  I think that graphemes, code points, and code units are the most essential units of work that one might want to use.  However, a sequence of exactly *one* of these should be the kind of range that a text type models.

I think most users that work with text will want graphemes as the essential view of text.  That may prove to be untrue.

That's too many assumptions and conditionals for my taste.
If a default does not immediately make sense to an overwhelming majority, I don't think that default can be justified.
People will bring their own set of assumptions and I think it's probably better to teach them the choices they can make and let them make that choice for their use case.

Ad yeah, research will need to be involved - is that unreasonable ? See Kate Gregory's 'It's complicated' talk.

And grapheme vs codepoint is I think very usecase dependent.
I don't like the idea of baking-in "a default use case".

Most text/locale issues inherited from the 80s boil down to bad default and baked-in assumptions which are I think at the basis of people misunderstandings.
I know it framed mine for a long time.

If we make something a default, it will then again frame people understanding of what Unicode is and whatever we do will be, at best, incomplete.

I actually quite like the idea if text being an opaque thing that you need to feed to a view to operate on.

If we definitively need a default, I would rather it be codepoints, i think it's the less "opinionated".

At the very least, it means that we don't arm the wrong gun for the user by accident / by default. And at some level, the user will have to read through even the synopsis of what each of these functions will bring to them. Some are obvious (words, sentences, etc. text segmentation algorithms), but others might require some thinking (codepoints, graphemes in particular).


Not picking one is a lot worse.  One of the many problems with Unicode is that it is too damn complicated for experienced users, much less new ones.  We need types with the right default so that people can just pick up a new version of their compiler and get to work without taking a week first to do research.  To the extent possible, it should "just work."
I can understand that not having defaults could be a frustrating experience to start with, however. I feel like it's justifiable given that there's no 100% right answer. Maybe we can get 80% but it will only make the 20% more surprising. I feel like having docs / tables that describe the various segmentation algorithms, what they bring to table, and what the user can get out of it might be more worth while.

Is this a viable path?

I think it is :)
I usually read the manual when I buy a power tool I'm not familiar with.

I don't think it is.  Again, the future reactions to Boost.Text will bear that out (or not!).


Unicode mailing list