sg16: Re: [SG16] iconv-style interface for transcoding functions

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Thu, 28 Jan 2021 14:09:56 -0500

Dear Jens, Corentin, and Peter,

     Yes, apologies. I only spoke of performance in the call (that's
where my brain has been for the past 3 weeks, sorry!) but it has other
utility. The use of NULL or POINTER TO NULL as a design results in
these traits when using the XnstoYns(charX** input, size_t*
input_size, charY** output, size_t* output_size) interface, where X
and Y are one of the prefixes for an encoding (mc, mwc, c8, c16, c32):

     - validate input:
       mcerr_t err = XnstoYns(&input, &input_size, NULL, NULL);

     - count how many code units of input will be consumed (same as above):
       mcerr_t err = XnstoYns(&input, &input_size, NULL, NULL);

     - count how many code units of the target encoding would be there:
       mcerr_t err = XnstoYns(&input, &input_size, NULL, &output_size);

     - perform transcoding from input to output:
       mcerr_t err = XnstoYns(&input, &input_size, &output, &output_size);

     - perform transcoding from input to output, *assume enough input
to fill output* (DANGER):
       mcerr_t err = XnstoYns(&input, NULL, &output, &output_size);

     - perform transcoding from input to output, *assume enough output
to handle all input* (DANGER):
       mcerr_t err = XnstoYns(&input, &input_size, &output, NULL);

That is the first design. It has a single function call but provides
these services. The variations of the last one are dangerous, but save
checking for exhaustion in the loop. In my implementation, we check
for these being null and then dispatch to a templated C++
implementation internally that hoists the question of "check for
exhaustion of input/output" to template booleans. For all nodes, we
note input is always required.

A second design would likely take this form, similar to how to_chars
is specified: XsntoYsn(charX** input, charX* input_end, charY**
output, charY* end). C++ has iterator-style interfaces everywhere: C
has them in a handful of str* functions, so both have existing
practice. This interface saves us updating 2 pointers (the iconv
design). We can achieve the following with this interface:

     - validate input:
       mcerr_t err = XnstoYns(&input, input + input_size, NULL, NULL);

     - count how many code units of input will be consumed:
       mcerr_t err = XnstoYns(&input, input + input_size, NULL, NULL);
       // do math on before/after of input

     - count how many code units of the target encoding would be there:
       mcerr_t err = XnstoYns(&input, &input_size, &dummy_pointer, NULL);
       // do math on before/after of &dummy_pointer, dummy_pointer
       // points to NULL to tell function not to write

     - perform transcoding from input to output:
       mcerr_t err = XnstoYns(&input, input + input_size, &output,
output + output_size);

     - perform transcoding from input to output, *assume enough input
to fill output* (DANGER):
       mcerr_t err = XnstoYns(&input, NULL, &output, output + output_size);

     - perform transcoding from input to output, *assume enough output
to handle all input* (DANGER):
       mcerr_t err = XnstoYns(&input, input + input_size, &output, NULL);

This interface seems to meet all the requirements, except for the
"count how many code units of the target encoding would be there".
Particularly, this code:

       charY* dummy_pointer = nullptr; // (1)
       charY* before_dummy_pointer = dummy_pointer;
       mcerr_t err = XnstoYns(&input, &input_size, &dummy_pointer, NULL); // (2)
       size_t amount_of_required_output =
static_cast<ptrdiff_t>(dummy_pointer - before_dummy_pointer);

     (1) sets us up to do something illegal in C++, since (2) would
increment a NULL pointer. This is illegal, thanks to [expr.add]/4
(http://eel.is/c++draft/expr.add#4). Recently, Clang started
optimizing against this assumption:
https://reviews.llvm.org/D67122#change-rRt5ip3Mob6r

     Given [expr.add]/4, we only have one choice here with this API
design. We can create a temporary automatic duration C array of some
fixed size, and just looping while calling XnstoYns over and over
again. This would eventually get us a proper count, at the cost of
speed. For example, most system libc's are compiled as shared objects
/ dynamically linked libraries: that's (N / array_size) dynamic
library calls which cannot be optimized away due to the nature of
dynamic calls.

All in all, I would like API Design #1 because it enables us to
(safely) get a size, without needing to create temporary buffers or
worry about abstract machine particulars surrounding the use of null
pointers for the purpose of calling. This is also because it is a C
API - in C++, we can just use counted_iterator +
counted_iterator_sentinel and a templated, range-based function call
to do the work. (This is, in fact, exactly what you can do with
P1629's interfaces, but we do provide dedicated counting interfaces as
well so you don't have to.)

     A third design would just be adding "validate_XsntoYsn" and
"count_XsntoYsn" functions, alongside the converting "XsntoYsn"
function. This means we don't have to be passing NULL anymore to say
"please do some counting". There's less precedent for this in most C
APIs, since a lot of them go with this "swiss army knife" approach to
specification of a single function. But that's likely an artefact of
the days where "7 + 1 null terminator is the number of significant
characters from the beginning of a function name that can be used for
differentiation in the C abstract machine". (It's been officially
raised to 32 and some other limits increased with it, so I think we
have some more room these days to have more function calls.)

     I will add this analysis to the paper. Whether or not the calls
marked (DANGER) should be allowed is also a separate discussion, that
I think I won't be equipped to properly have until I publish
benchmarks.

Thanks,
JeanHeyd

On Thu, Jan 28, 2021 at 1:27 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
> On 28/01/2021 12.27, Peter Brett wrote:
> > Hi Jens,
> >
> > Thank you for sending this suggestion through. It'll be good input for a discussion at the start of our next meeting.
> >
> > One of the features in the current paper that would be lost is described in (3.5):
> >
> > (3.5) - If output_size is not NULL, then *output_size will be
> > decremented the amount of code units that would have
> > been written to *output (even if output was NULL). If
> > the output is exhausted (*output_size will be
> > decremented below zero), the function returns
> > MCHAR_INSUFFICIENT_OUTPUT.
>
> Wait, *output_size is of type size_t, so cannot ever become negative.
>
> > This allows the restartable transcoding functions to be used to measure the amount of space required to store the results of a transcoding operation without pre-allocating a buffer.
>
> So, if I have a large input buffer, but a small-ish output buffer
> (maybe I need to fill one 4k page of memory at a time), I'll always
> pay for processing the entire input buffer, just to do the count?
> I'm strongly opposed to having such an interface.
> (In the scope of this interface design, I'd turn to weakly opposed
> when having this special behavior only if output_ptr == nullptr,
> i.e. no output is produced.)
>
> (Also, iconv doesn't do that.)
>
> > This is valuable. How would it be provided in a hypothetical [begin, end) pointer interface?
>
> Having the result in a negative size_t certainly doesn't work, either.
>
> Hypothetically, you invoke this special behavior with end==nullptr
> and just increment "begin" as far as you need to convey the count.
> Except that this is undefined behavior, because "begin" might now
> point beyond its array.
>
> An alternative would be to have an alternative set of functions
> that just does not the counting (no output), but the zoo we have
> for transcoding is already large.
>
> Regardless of the outcome of this discussion, please add the arguments (on both sides)
> to the paper, together with the conclusion.
>
> Jens
>
>
> > Best regards,
> >
> > Peter
> >
> >
> >> -----Original Message-----
> >> From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Jens Maurer via SG16
> >> Sent: 27 January 2021 21:32
> >> To: SG16 <sg16_at_[hidden]>
> >> Cc: Jens Maurer <Jens.Maurer_at_[hidden]>
> >> Subject: [SG16] iconv-style interface for transcoding functions
> >>
> >> EXTERNAL MAIL
> >>
> >>
> >> In today's teleconference, JeanHeyd suggested that the
> >> "pointer + length" interface would allow to pass nullptr
> >> for the output length, allowing the user to assert
> >> "there is enough space", which, in turn, allows to forego
> >> range checking during transcoding.
> >>
> >> At least the version of iconv on my current Ubuntu
> >> system does not offer precedence for these semantics;
> >> passing nullptr for the output length just crashes.
> >> Test case:
> >>
> >> #include <iconv.h>
> >> #include <stdlib.h>
> >>
> >> int main()
> >> {
> >> iconv_t cd = iconv_open("utf-8", "utf-8");
> >> char in[] = "abcd";
> >> char *pin = in;
> >> size_t nin = 4;
> >> char *pout = (char*)malloc(100);
> >> size_t nout = 100;
> >> size_t n = iconv(cd, &pin, &nin, &pout, nullptr);
> >> }
> >>
> >> This behavior is consistent with the description in
> >> the man page, where no mention of special handling
> >> of nullptr length arguments appears.
> >> Plus the POSIX specification agrees:
> >> https://urldefense.com/v3/__https://pubs.opengroup.org/onlinepubs/9699919799
> >> /functions/iconv.html__;!!EHscmS1ygiU1lA!R9gLUcalUHS_RvR92m7qRbPIefVW8CV2wNI
> >> qm5qwaReR7OyQATrk1qRmyrbNEg$
> >>
> >> I'd also like to point out that interfaces that
> >> assume "there will be enough space" are prone to
> >> misuse, admitting buffer overflows. I'd also
> >> like to point out that the perceived run-time
> >> overhead of the extra length check is partially
> >> mitigated by
> >>
> >> - the necessity to check the length pointer for nullptr
> >> and branch to a special implementation
> >>
> >> - in a [begin, end) iterator range implementation,
> >> the ability to determine the available space and omit
> >> some or all length checks if ample space is provided.
> >>
> >> In short, I believe the core interface should treat in
> >> [begin, end) iterator ranges, where "begin" is updated
> >> by the function. If that doesn't materialize (for whatever
> >> reason), I expressly do not want thin decorators to
> >> be standardized, ballooning the number of functions even
> >> more.
> >>
> >> Jens
> >> --
> >> SG16 mailing list
> >> SG16_at_[hidden]
> >> https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg
> >> 16__;!!EHscmS1ygiU1lA!R9gLUcalUHS_RvR92m7qRbPIefVW8CV2wNIqm5qwaReR7OyQATrk1q
> >> TqQcP2aQ$
>

Received on 2021-01-28 13:10:14