sg16: Re: [SG16-Unicode] Replacement for codecvt

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Thu, 29 Aug 2019 14:29:09 -0400

Thank you for the feedback.

On Thu, Aug 29, 2019 at 10:59 AM Niall Douglas <s_sourceforge_at_[hidden]>
wrote:

> > 1) Submit pull requests to JeanHeyd for features that you need. His
> > target is C++23 of course and I at least don't want to see him get
> > distracted with supporting earlier standards. If we need that to
> > evaluate the design, so be it, but please try and assist where
> > possible. I see this work as SG16's #1 priority for C++23.
>
> If he breaks the requisite parts out into a standalone git repo, I'll be
> happy to submit issue requests to that.
>
> In my opinion, every WG21 library proposal should come with a reference
> implementation in a standalone, highly reusable, submodularisable git
> repo. I know that I don't *quite* achieve that myself, but I think it
> particularly important to avoid bundling reference implementation in
> with lots of other unrelated stuff.
>
> People won't lock in dependencies on bundles, too ripe for breaking
> change. They like self contained, single purpose, orthogonal git repos.
>

The implementation needs a bit more work. Itprovides a bit more than what
the paper says, but needs more work before its ready to use. I plan to
spend the next 4 months working on it, if Grant Proposals and University
Class Proposals work out. Otherwise, it will be the usual free time work
improved and be hopefully production ready near midsummer 2020.

That being said, it is going to be by no way shippable with something like
LLFIO in the next month or so. Apologies.

> > 2) Write papers proposing changes to the interface that you would need
> > (e.g., deterministic exceptions). Papers will help us more quickly
> > evaluate and consider changes. Any such papers should include
> > motivation, examples, and proposed changes (preferably wording if
> > applicable as well). But you know all that of course ;)
>
> Alas my plate is overflowing. I have four new papers in progress, such
> is my overload that none of those four will make Belfast.
>

Same.

> >> 3. Anywhere where you might throw an exception, you need to use a
> >> deterministic alternative like error_code, Outcome, etc.
> >
> > I think this needs a paper.
>
> I gotta be blunt here, if text reencoding throws exceptions, it has the
> wrong design.
>

Error handling in the text API is done with your choice of error handler
(the default uses the replacement character to encode errors into the text
itself). You never throw unless you are:
A) working with a dynamic container (e.g., std::basic_string/std::vector)
B) you ask for it by injecting the throw_error_handler directly

We can make a `herbception_throw_handler` as a test drive when Herb's
implementation makes progress. The base decoding objects are
constexpr-capable as well. Unfortunately, this means that intrinsic use on
MSVC is out of the question, because they have not implemented
std::is_constant_evaluated, nor have any internal compiler tricks for it.
GCC and clang should be okay here. (In MSVC, that just means I need a
TEXT_LIB_CONSTEXPR macro, which defaults to nothingness in VC++. I will
prefer this approach heavily because WideCharToMultiByte is the easy way to
get fast performance on Windows and that's not constexpr-blessed.)

> >> 5. cmake build system suitable for reuse from a cmake superproject (i.e.
> >> targets only modern cmake, exports the targets consumed by the
> >> superproject)
> >
> > This seems like something that you or others could help out with.
>
> I can give advice.
>

The text implementation is currently buried in my "bag of literally
everything" library (ThePhD/phd) where I put incubating ideas. When they're
ready, they get moved out into standalone repositories (e.g.,
https://github.com/ThePhD/out_ptr, https://github.com/ThePhD/itsy_bitsy).
Text is going to be moved out into its own library soon, with its own CMake
file that can be add_subdirectory'd in with little to no hassle. I haven't
quite gotten INSTALL targets perfectly right on either of those 2
libraries, so I'll need to learn how to make the CMake loop easy.

> >> 6. Any public headers should not include anything "heavy" like Ranges,
> >> Variant etc as some of my user base like to include LLFIO headers in
> >> global headers. String is acceptable, as LLFIO includes <filesystem>.
> >
> > This library is fundamentally ranges based. I don't see any way to
> > avoid dependencies on ranges in the interface; that is something we
> > explicitly desire.
>
> Unfortunately Ranges-based libraries would be a showstopper for me.
>

It is for most people. I use ranges mostly to avoid redefining concept
checks and for ranges::begin/ranges::end and other ADL-based shenanigans.
Personally, lowering the requirement on the API is going to be better than
concept checking and more; I'm going to be manually implementing the ADL
extension points I need and probably including a `text/span` polyfill to
erase the fact that most implementations don't have a span. (Or letting the
user provide their own, but every minute spent trying to do polyfill
garbage is less time spent writing good Unicode abstractions.) The goal
will likely be C++17 to start, then down to C++14 later on if I can do so
without compromising the performance and algorithmic capabilities for most
encodings.

> The problem is their current hefty impact on build times. Most of my
> clients, and consumers of my libraries, therefore ban Ranges in their
> codebases.
>
> Even long term, I can see outright bans on Ranges in header files
> continuing until compilers bake Ranges into the language. So expect a
> substantial subset of C++ users refusing to use *any* Ranges based
> library, including standard libraries, for at least a decade or more to
> come.
>
> Once again, being blunt here, I can't see why text reencoding needs
> anything more than span<T> and basic_string_view<T>. Ranges
> auto-understands both of those. Again, if the design really needs
> Ranges, it has to be the wrong design.
>

     Iterators (and their child, Ranges) have been the backbone of
libraries communicating with one another since Stepanov got it right.
Hardcoding span/string_view as the communication medium is the
fundamentally wrong choice, because it fundamentally excludes a wide range
of use cases, including Boost.Text's unencoded_rope, libstdc++'s and SGI's
__gnu_cxx::rope<T> storage, gap_buffer implementations, ring_buffer
implementations on the network, and more.

    My implementation of this library will most certainly specialize when
it detects you have handed it contiguous iterators/ranges and do "the fast
thing" when it can, but by no means should pointers be the layer of
abstraction that text encoding is built on. At the least,
RandomAccessIterators are useful, but most of the unicode algorithms "base
implementation" can be done with Forward or Bidirectional Iterators. We
need to have an implementation that works for most things, and lean on our
library implementers to do QoI. Not that it will be left entirely to
Standard Library maintainers: a critical part of the paper is allowing
users to specialize encode, decode and transcode calls with their own
(potentially ADL, though I really loathe ADL) customization points to be
far more tailored to their encodings and use cases.

     This does not necessitate std::ranges or range-v3: most of what is
necessary are simple adl_(c)begin, adl_(c)end, and maybe adl_size free
functions. Checks like "is this a contiguous iterator" can just be
backported hardcodings of what is speified as concepts and requirements in
the C++20 Working Draft. It's work-around-able,

but every workaround is another day of mucking around playing "how many
features did this compiler not implement under /std:c++14 | -std=c++14".

Sincerely,
JeanHeyd

Received on 2019-08-29 20:29:22