C++ Logo


Advanced search

Re: D2626R0 charN_t incremental adoption: Casting pointers of UTF character types

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 24 Aug 2022 15:26:41 -0400
On 8/23/22 6:54 PM, Corentin via SG16 wrote:
> On Tue, Aug 23, 2022 at 11:18 PM Tom Honermann <tom_at_[hidden]> wrote:
> Thanks again for the paper, Corentin.
> The following comments are based on the P2626R0
> <https://wg21.link/p2626r0> paper revision submitted for the
> August mailing. I'm sorry for the delay in responding.I have a
> number of reservations about the paper as currently proposed
> though I am strongly in favor of what it is trying to accomplish.
> Thanks for the feedback.
> I certainly won't have time to address everything by tomorrow but I'll try
Thank you, again, sorry for the delayed feedback (and that this response
is coming just before the meeting!). I wanted to send my thoughts last
week, but failed to make it happen.
> I think I failed to convey what this proposal is.
> If you have seen one of Titus's talks with the video of the giant
> chainsaw dangling from a helicopter, this is what this is.
> Maybe for good measure the chainsaw is on fire.
> It's better than the existing alternatives presented in the paper
> (which I got wrong btw, another one of them is also UB, surprise
> surprise), but it is still dangerous and should be used with caution.
Anything we do in this space is going to have sharp edges; there is no
doubt about that. I'm hoping we can dull them slightly.
> Rust can get away with a less dangerous interface because the source
> pointer is borrowed,
> so there is only one thing looking at the memory at a given time, as
> far as I understand.
> In C++ we have this memory region, pointed to by an arbitrary number
> of pointers and we are changing its type, with compilers which assume
> that can't happen.
> The interface is reflective of what i understand the constraints of
> the abstract machine to be, and what we can actually specify and
> implement (and I am not confident about understanding any of it)
That sounds right outside of constant evaluation context. Within
constant evaluation contexts, memory regions don't exist in the same
way; the object model is what governs behavior (this is partially why
placement new is not allowed in constant evaluation).
> My primary concern is whether these interfaces suffice to solve
> the issues as experienced in real world code. The before/after
> presentation in the Tony table section does a nice job of
> illustrating why the existing cast operations do not suffice, but
> does not illustrate the inherent danger in using the proposed
> interfaces. Consider the following (some suspension of disbelief
> is required here since std::print won't accept char8_t-based text,
> but ignore that for now; as the paper says, "charN_t types are
> poorly supported in the standard library. We should notably
> support them in format").
> void rename(const char *from, const char *to);
> const char8_t from[] = ...;
> const char8_t to[] = ...;
> rename(cast_utf_to<char>(from, sizeof(from)),
> cast_utf_to<char>(to, sizeof(to)));
> std::print("Renamed {} to {}\n", from, to); // UB; the
> lifetime of the char8_t objects from and to held were ended.
> To fix this, it is necessary to rebuild the prior objects (it is
> not valid to reference an object of type char via an lvalue of
> type char8_t; that is the aliasing problem). But attempts to do so
> run into the problem that from and to are not actually associated
> with the replacement objects other than through a common region of
> storage; references to the replaced objects are needed to re-bless
> the storage as holding objects of the previous type. However, even
> in that case, I'm not sure that the association between the
> variables and the objects can really be reestablished; I think
> problems similar to those for which std::launder was introduced
> are present here. Access to the new objects must go through a
> pointer/reference provided by the operation that created the
> replacement objects.
> You have redeclarations here so not sure which is which.
I don't see any redeclarations.
> Either way, you are only doing things to the memory, not the pointers
> themselves.
Correct, but that leaves pointers to one type pointing to objects of
another unrelated type. That is ok outside constant evaluation in some
cases, but not within.
> You change the type of the objects, which can then be used
> through pointers of the new type, change back and the old pointers can
> be used again.
> (Although it is not clear how that plays with papers like P2414, the
> original pointers would need to be unzaped when the lifetime is restored).
> In any case, you need to recast in the other direction, if that's not
> clear enough in the paper, I'll try to clarify.
Thanks, discussion of this in the paper would be helpful.
> I think what is most needed, at least for the example above, is an
> operation that temporarily (e.g., via RAII and an object with full
> expression lifetime) mutates the type of an object temporarily and
> then restores it.
> I did think about it.
> Trying to hide the dangers behind a safe looking interface may do more
> harm than good.
> A simple approach with an raii object and an accessor for example
> could very easily lead to a situation where the lifetime of that raii
> object, and therefore the restoration, occurs before the new pointer
> is used.
> We could also have a function that takes a callable, do the cast, make
> the call with the casted pointer, and restore it before returning, but
> * That doesn't work if you are getting a pointer from somewhere -
> for example main
> * It still hides a bunch of dragons like concurrent accesses.
> * The callee could store the new pointer and in that case you don't
> want to restore the old one or ever touch it again either.
> There is a transfer of ownership that happens, and it can be permanent
> or temporary depending on the scenario.
I don't think a permanent transfer of ownership is possible unless the
storage is going to be leaked. Destruction must destroy the correct
object type, at least in constant evaluation.
> The paper lists the ICU character casts as an inspiration for the
> paper. However, the proposed interfaces don't match the semantics
> of the ICU casts. The ICU casts don't invalidate the source
> objects; they coerce the compiler into forgetting what it knows
> about the contents of memory. As such, I'm skeptical that the
> proposed interfaces could be used to replace the ICU casts. If
> they can be, a diff of the changes that ICU would require to adopt
> the new interfaces would be very helpful.
> I do not believe the ICU implementation is aligned with what it is
> possible to specify, and it may possibly be defeated by a sufficiently
> clever compiler.
I agree that may very well be the case.
> But yes, the new functions can be used by icu internally or by the
> user, either way a recast to the original type is necessary if any old
> pointers are to be reused.
This is what I would like to see proved out a bit. How would ICU have to
change to use the new functionality? How would it impact users of ICU?
> > As specified, the operations do not end the lifetime of, nor change
> the type of, the array objects that hold the sequence of elements that
> are being converted. This creates the strange result that, for
> example, an array object of type char16_t could have elements of type
> wchar_t. I'm not sure if that is ok from a core language perspective
> or not, but it seems problematic to me.
> This needs to be seen by core, sg1, etc.
I'm curious why SG1 would be involved. It is up to the SG1 chair whether
they review the paper of course, but I'm wondering if I'm missing
something that suggests SG1 should be involved.
> I would be amazed if that wording was not terrible.
> The proposed operations are not really casts; they behave more
> like a destructive in-place move. Describing them as casts is, I
> think, misleading.
> I'm open to better (i.e. more scary sounding) names.
> But I tried a lot of different names and these are the most
> descriptive things I could find given I want to keep the unicode
> semantic in the name too.
> With a few exceptions, each of the existing cast operators
> supports symmetric conversion between types. If a cast operation
> supports converion from type A to type B, then it also (usually)
> supports conversion from type B to type A. I'm not aware of a
> precedent in the language for a cast in one direction to have a
> different name than for a cast in the other direction. And I don't
> think such a distinction is needed in this case. The fundamental
> operation that is needed is the ability to cast/convert/swap an
> object of one type to a type that is, or shares, an underlying
> type (same kind (integer, float, etc...), size, alignment). Once
> that operation is available, encoding aware interfaces that limit
> conversion direction can be built on top if desired.
> I don't think we want to generalize that dangerous interface to even
> more types.
> The paper explains why there is this choice of separate interfaces -
> both for the names and different preconditions.
> The paper includes a link to godbolt.org
> <https://godbolt.org/z/d6n8b6qKd> containing the following example
> use. The call to in.size() is UB since it follows a call to
> std::move(in). I think this illustrates how difficult these
> interfaces may be to use correctly.
> I don't believe it to be ub (move doesn't move and invalidating the
> data pointer doesn't do anything to the size... but also the
> string_view's invariants don't hold any more so, who knows) but it
> should read as_char.size() regardless.
Ah, fair point. I forgot that cast_utf_to doesn't destory the range
object; just the referenced contents.
> std::string utf8_to_iso8859_7(std::u8string_view in) {
> iconv_t conv = iconv_open("ISO-8859-7", "UTF-8");
> auto as_char = cast_utf_to<char>(std::move(in));
> std::string out(10, 0);
> std::size_t out_size = out.size();
> std::size_t in_size = in.size();
> char* inptr = const_cast<char*>(as_char.data()); // C
> interfaces are fantastic
> char* outptr = out.data();
> int ret = iconv(conv, &inptr, &in_size, &outptr, &out_size);
> return out;
> }
> The basic_string_view and span overloads seem deeply problematic
> to me. Consider:
> std::string s = ...;
> auto u8sv = cast_as_utf_unchecked(s); // the implicitly
> constructed temporary string_view binds to the rvalue reference.
> // The buffer managed by s is of type char, but now holds
> objects of type char8_t.
> // Any use of s, including destruction (at least in constexpr
> context) is UB.
> You get the same problem with
> auto u8sv = cast_as_utf_unchecked(s.data(), s.data()+s.size());
Correct, but that doesn't involve implicit conversions.
> The wording for the basic_string_view and span overloads of
> cast_as_utf_unchecked references a From type that does not appear
> in the declaration.
> I will look into that.
> The TL;DR is I think SG16 should decide whether they like the general
> direction, and send me to SG1 so they can tell me whether this is
> implementable and whether
> better is possible.

I'm going to want more confidence in the core language strategy before
forwarding this paper outside of SG16; EWG and SG1 don't require SG16
approval before looking at it.


> But either way, here be dragons. by design - or rather by virtue of
> the fact the abstract machine isn't to be toyed with.
> It's a higher level start_lifetime_as-like thing, but not that high
> level either.
> It doesn't mean it's not necessary, for example it's an important
> piece of getting format(u8"{}", 42) to work.(fmt would call from chars
> and then cast the buffer - or have a u8 buffer, cast it to char and
> back, depending how format is implemented).
> The only way to get a safe behavior is to make a copy by
> successive application of bit_cast which is prohibitive expensive in
> some cases
> Tom.
> On 7/30/22 2:16 PM, Corentin via SG16 wrote:
>> Early draft, feedback welcome.
>> Thanks,
>> Corentin
>> https://isocpp.org/files/papers/D2626R0.pdf

Received on 2022-08-24 19:26:46