sg16: Re: [SG16] iconv-style interface for transcoding functions

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Thu, 28 Jan 2021 14:22:33 -0500

Dear Jens, Peter and Corentin,

     Small addendum: using my brain a little better, I realize that we
could also change the return type of the function to size_t for the
XsntoYsn(charX** input, charX* input_end, charY** output, charY*
output_end) function.

     0 or anything above is a "number of code units output" count. All
errors would be size_t-cast negative integral values. This is not
without precedent: C does this for many of its error codes, including
the transcode functions today. It would, in practice, only exclude the
very LARGEST buffers to take away the values (size_t)-1, (size_t)-2,
(size_t)-3 and for the error values. We could also not have a
"success" return code at all, and let users do input == input_end to
know that all input had been successfully converted with any return
value >= 0 that is also not any of the (size_t)-casted negative
values.

     Of course, even as I describe possible schemes for this, it
doesn't seem quite so clean, and still has the problem that, for
example, input_end could never be NULL. But, it's disputed whether
that would be necessary, without benchmarks showing it may help.

Sincerely,
JeanHeyd

On Thu, Jan 28, 2021 at 2:09 PM JeanHeyd Meneide
<phdofthehouse_at_[hidden]> wrote:
>
> Dear Jens, Corentin, and Peter,
>
> Yes, apologies. I only spoke of performance in the call (that's
> where my brain has been for the past 3 weeks, sorry!) but it has other
> utility. The use of NULL or POINTER TO NULL as a design results in
> these traits when using the XnstoYns(charX** input, size_t*
> input_size, charY** output, size_t* output_size) interface, where X
> and Y are one of the prefixes for an encoding (mc, mwc, c8, c16, c32):
>
> - validate input:
> mcerr_t err = XnstoYns(&input, &input_size, NULL, NULL);
>
> - count how many code units of input will be consumed (same as above):
> mcerr_t err = XnstoYns(&input, &input_size, NULL, NULL);
>
> - count how many code units of the target encoding would be there:
> mcerr_t err = XnstoYns(&input, &input_size, NULL, &output_size);
>
> - perform transcoding from input to output:
> mcerr_t err = XnstoYns(&input, &input_size, &output, &output_size);
>
> - perform transcoding from input to output, *assume enough input
> to fill output* (DANGER):
> mcerr_t err = XnstoYns(&input, NULL, &output, &output_size);
>
> - perform transcoding from input to output, *assume enough output
> to handle all input* (DANGER):
> mcerr_t err = XnstoYns(&input, &input_size, &output, NULL);
>
> That is the first design. It has a single function call but provides
> these services. The variations of the last one are dangerous, but save
> checking for exhaustion in the loop. In my implementation, we check
> for these being null and then dispatch to a templated C++
> implementation internally that hoists the question of "check for
> exhaustion of input/output" to template booleans. For all nodes, we
> note input is always required.
>
> A second design would likely take this form, similar to how to_chars
> is specified: XsntoYsn(charX** input, charX* input_end, charY**
> output, charY* end). C++ has iterator-style interfaces everywhere: C
> has them in a handful of str* functions, so both have existing
> practice. This interface saves us updating 2 pointers (the iconv
> design). We can achieve the following with this interface:
>
> - validate input:
> mcerr_t err = XnstoYns(&input, input + input_size, NULL, NULL);
>
> - count how many code units of input will be consumed:
> mcerr_t err = XnstoYns(&input, input + input_size, NULL, NULL);
> // do math on before/after of input
>
> - count how many code units of the target encoding would be there:
> mcerr_t err = XnstoYns(&input, &input_size, &dummy_pointer, NULL);
> // do math on before/after of &dummy_pointer, dummy_pointer
> // points to NULL to tell function not to write
>
> - perform transcoding from input to output:
> mcerr_t err = XnstoYns(&input, input + input_size, &output,
> output + output_size);
>
> - perform transcoding from input to output, *assume enough input
> to fill output* (DANGER):
> mcerr_t err = XnstoYns(&input, NULL, &output, output + output_size);
>
> - perform transcoding from input to output, *assume enough output
> to handle all input* (DANGER):
> mcerr_t err = XnstoYns(&input, input + input_size, &output, NULL);
>
> This interface seems to meet all the requirements, except for the
> "count how many code units of the target encoding would be there".
> Particularly, this code:
>
> charY* dummy_pointer = nullptr; // (1)
> charY* before_dummy_pointer = dummy_pointer;
> mcerr_t err = XnstoYns(&input, &input_size, &dummy_pointer, NULL); // (2)
> size_t amount_of_required_output =
> static_cast<ptrdiff_t>(dummy_pointer - before_dummy_pointer);
>
> (1) sets us up to do something illegal in C++, since (2) would
> increment a NULL pointer. This is illegal, thanks to [expr.add]/4
> (http://eel.is/c++draft/expr.add#4). Recently, Clang started
> optimizing against this assumption:
> https://reviews.llvm.org/D67122#change-rRt5ip3Mob6r
>
> Given [expr.add]/4, we only have one choice here with this API
> design. We can create a temporary automatic duration C array of some
> fixed size, and just looping while calling XnstoYns over and over
> again. This would eventually get us a proper count, at the cost of
> speed. For example, most system libc's are compiled as shared objects
> / dynamically linked libraries: that's (N / array_size) dynamic
> library calls which cannot be optimized away due to the nature of
> dynamic calls.
>
> All in all, I would like API Design #1 because it enables us to
> (safely) get a size, without needing to create temporary buffers or
> worry about abstract machine particulars surrounding the use of null
> pointers for the purpose of calling. This is also because it is a C
> API - in C++, we can just use counted_iterator +
> counted_iterator_sentinel and a templated, range-based function call
> to do the work. (This is, in fact, exactly what you can do with
> P1629's interfaces, but we do provide dedicated counting interfaces as
> well so you don't have to.)
>
> A third design would just be adding "validate_XsntoYsn" and
> "count_XsntoYsn" functions, alongside the converting "XsntoYsn"
> function. This means we don't have to be passing NULL anymore to say
> "please do some counting". There's less precedent for this in most C
> APIs, since a lot of them go with this "swiss army knife" approach to
> specification of a single function. But that's likely an artefact of
> the days where "7 + 1 null terminator is the number of significant
> characters from the beginning of a function name that can be used for
> differentiation in the C abstract machine". (It's been officially
> raised to 32 and some other limits increased with it, so I think we
> have some more room these days to have more function calls.)
>
> I will add this analysis to the paper. Whether or not the calls
> marked (DANGER) should be allowed is also a separate discussion, that
> I think I won't be equipped to properly have until I publish
> benchmarks.
>
> Thanks,
> JeanHeyd
>
> On Thu, Jan 28, 2021 at 1:27 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
> >
> > On 28/01/2021 12.27, Peter Brett wrote:
> > > Hi Jens,
> > >
> > > Thank you for sending this suggestion through. It'll be good input for a discussion at the start of our next meeting.
> > >
> > > One of the features in the current paper that would be lost is described in (3.5):
> > >
> > > (3.5) - If output_size is not NULL, then *output_size will be
> > > decremented the amount of code units that would have
> > > been written to *output (even if output was NULL). If
> > > the output is exhausted (*output_size will be
> > > decremented below zero), the function returns
> > > MCHAR_INSUFFICIENT_OUTPUT.
> >
> > Wait, *output_size is of type size_t, so cannot ever become negative.
> >
> > > This allows the restartable transcoding functions to be used to measure the amount of space required to store the results of a transcoding operation without pre-allocating a buffer.
> >
> > So, if I have a large input buffer, but a small-ish output buffer
> > (maybe I need to fill one 4k page of memory at a time), I'll always
> > pay for processing the entire input buffer, just to do the count?
> > I'm strongly opposed to having such an interface.
> > (In the scope of this interface design, I'd turn to weakly opposed
> > when having this special behavior only if output_ptr == nullptr,
> > i.e. no output is produced.)
> >
> > (Also, iconv doesn't do that.)
> >
> > > This is valuable. How would it be provided in a hypothetical [begin, end) pointer interface?
> >
> > Having the result in a negative size_t certainly doesn't work, either.
> >
> > Hypothetically, you invoke this special behavior with end==nullptr
> > and just increment "begin" as far as you need to convey the count.
> > Except that this is undefined behavior, because "begin" might now
> > point beyond its array.
> >
> > An alternative would be to have an alternative set of functions
> > that just does not the counting (no output), but the zoo we have
> > for transcoding is already large.
> >
> > Regardless of the outcome of this discussion, please add the arguments (on both sides)
> > to the paper, together with the conclusion.
> >
> > Jens
> >
> >
> > > Best regards,
> > >
> > > Peter
> > >
> > >
> > >> -----Original Message-----
> > >> From: SG16 <sg16-bounces_at_[hidden]> On Behalf Of Jens Maurer via SG16
> > >> Sent: 27 January 2021 21:32
> > >> To: SG16 <sg16_at_[hidden]>
> > >> Cc: Jens Maurer <Jens.Maurer_at_[hidden]>
> > >> Subject: [SG16] iconv-style interface for transcoding functions
> > >>
> > >> EXTERNAL MAIL
> > >>
> > >>
> > >> In today's teleconference, JeanHeyd suggested that the
> > >> "pointer + length" interface would allow to pass nullptr
> > >> for the output length, allowing the user to assert
> > >> "there is enough space", which, in turn, allows to forego
> > >> range checking during transcoding.
> > >>
> > >> At least the version of iconv on my current Ubuntu
> > >> system does not offer precedence for these semantics;
> > >> passing nullptr for the output length just crashes.
> > >> Test case:
> > >>
> > >> #include <iconv.h>
> > >> #include <stdlib.h>
> > >>
> > >> int main()
> > >> {
> > >> iconv_t cd = iconv_open("utf-8", "utf-8");
> > >> char in[] = "abcd";
> > >> char *pin = in;
> > >> size_t nin = 4;
> > >> char *pout = (char*)malloc(100);
> > >> size_t nout = 100;
> > >> size_t n = iconv(cd, &pin, &nin, &pout, nullptr);
> > >> }
> > >>
> > >> This behavior is consistent with the description in
> > >> the man page, where no mention of special handling
> > >> of nullptr length arguments appears.
> > >> Plus the POSIX specification agrees:
> > >> https://urldefense.com/v3/__https://pubs.opengroup.org/onlinepubs/9699919799
> > >> /functions/iconv.html__;!!EHscmS1ygiU1lA!R9gLUcalUHS_RvR92m7qRbPIefVW8CV2wNI
> > >> qm5qwaReR7OyQATrk1qRmyrbNEg$
> > >>
> > >> I'd also like to point out that interfaces that
> > >> assume "there will be enough space" are prone to
> > >> misuse, admitting buffer overflows. I'd also
> > >> like to point out that the perceived run-time
> > >> overhead of the extra length check is partially
> > >> mitigated by
> > >>
> > >> - the necessity to check the length pointer for nullptr
> > >> and branch to a special implementation
> > >>
> > >> - in a [begin, end) iterator range implementation,
> > >> the ability to determine the available space and omit
> > >> some or all length checks if ample space is provided.
> > >>
> > >> In short, I believe the core interface should treat in
> > >> [begin, end) iterator ranges, where "begin" is updated
> > >> by the function. If that doesn't materialize (for whatever
> > >> reason), I expressly do not want thin decorators to
> > >> be standardized, ballooning the number of functions even
> > >> more.
> > >>
> > >> Jens
> > >> --
> > >> SG16 mailing list
> > >> SG16_at_[hidden]
> > >> https://urldefense.com/v3/__https://lists.isocpp.org/mailman/listinfo.cgi/sg
> > >> 16__;!!EHscmS1ygiU1lA!R9gLUcalUHS_RvR92m7qRbPIefVW8CV2wNIqm5qwaReR7OyQATrk1q
> > >> TqQcP2aQ$
> >

Received on 2021-01-28 13:22:46