C++ Logo

sg16

Advanced search

Re: D2626R0 charN_t incremental adoption: Casting pointers of UTF character types

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 23 Aug 2022 17:18:39 -0400
Thanks again for the paper, Corentin.

The following comments are based on the P2626R0
<https://wg21.link/p2626r0> paper revision submitted for the August
mailing. I'm sorry for the delay in responding.

I have a number of reservations about the paper as currently proposed
though I am strongly in favor of what it is trying to accomplish.

My primary concern is whether these interfaces suffice to solve the
issues as experienced in real world code. The before/after presentation
in the Tony table section does a nice job of illustrating why the
existing cast operations do not suffice, but does not illustrate the
inherent danger in using the proposed interfaces. Consider the following
(some suspension of disbelief is required here since std::print won't
accept char8_t-based text, but ignore that for now; as the paper says,
"charN_t types are poorly supported in the standard library. We should
notably support them in format").

    void rename(const char *from, const char *to);
    const char8_t from[] = ...;
    const char8_t to[] = ...;
    rename(cast_utf_to<char>(from, sizeof(from)), cast_utf_to<char>(to,
    sizeof(to)));
    std::print("Renamed {} to {}\n", from, to); // UB; the lifetime of
    the char8_t objects from and to held were ended.

To fix this, it is necessary to rebuild the prior objects (it is not
valid to reference an object of type char via an lvalue of type char8_t;
that is the aliasing problem). But attempts to do so run into the
problem that from and to are not actually associated with the
replacement objects other than through a common region of storage;
references to the replaced objects are needed to re-bless the storage as
holding objects of the previous type. However, even in that case, I'm
not sure that the association between the variables and the objects can
really be reestablished; I think problems similar to those for which
std::launder was introduced are present here. Access to the new objects
must go through a pointer/reference provided by the operation that
created the replacement objects.

I think what is most needed, at least for the example above, is an
operation that temporarily (e.g., via RAII and an object with full
expression lifetime) mutates the type of an object temporarily and then
restores it.

The paper lists the ICU character casts as an inspiration for the paper.
However, the proposed interfaces don't match the semantics of the ICU
casts. The ICU casts don't invalidate the source objects; they coerce
the compiler into forgetting what it knows about the contents of memory.
As such, I'm skeptical that the proposed interfaces could be used to
replace the ICU casts. If they can be, a diff of the changes that ICU
would require to adopt the new interfaces would be very helpful.

As specified, the operations do not end the lifetime of, nor change the
type of, the array objects that hold the sequence of elements that are
being converted. This creates the strange result that, for example, an
array object of type char16_t could have elements of type wchar_t. I'm
not sure if that is ok from a core language perspective or not, but it
seems problematic to me.

The proposed operations are not really casts; they behave more like a
destructive in-place move. Describing them as casts is, I think, misleading.

With a few exceptions, each of the existing cast operators supports
symmetric conversion between types. If a cast operation supports
conversion from type A to type B, then it also (usually) supports
conversion from type B to type A. I'm not aware of a precedent in the
language for a cast in one direction to have a different name than for a
cast in the other direction. And I don't think such a distinction is
needed in this case. The fundamental operation that is needed is the
ability to cast/convert/swap an object of one type to a type that is, or
shares, an underlying type (same kind (integer, float, etc...), size,
alignment). Once that operation is available, encoding aware interfaces
that limit conversion direction can be built on top if desired.

The paper includes a link to godbolt.org
<https://godbolt.org/z/d6n8b6qKd> containing the following example use.
The call to in.size() is UB since it follows a call to std::move(in). I
think this illustrates how difficult these interfaces may be to use
correctly.

    std::string utf8_to_iso8859_7(std::u8string_view in) {
         iconv_t conv = iconv_open("ISO-8859-7", "UTF-8");
         auto as_char = cast_utf_to<char>(std::move(in));
         std::string out(10, 0);
         std::size_t out_size = out.size();
         std::size_t in_size = in.size();
         char* inptr = const_cast<char*>(as_char.data()); // C
    interfaces are fantastic
         char* outptr = out.data();
         int ret = iconv(conv, &inptr, &in_size, &outptr, &out_size);
         return out;
    }

The basic_string_view and span overloads seem deeply problematic to me.
Consider:

    std::string s = ...;
    auto u8sv = cast_as_utf_unchecked(s); // the implicitly constructed
    temporary string_view binds to the rvalue reference.
    // The buffer managed by s is of type char, but now holds objects of
    type char8_t.
    // Any use of s, including destruction (at least in constexpr
    context) is UB.

The wording for the basic_string_view and span overloads of
cast_as_utf_unchecked references a From type that does not appear in the
declaration.

Tom.

On 7/30/22 2:16 PM, Corentin via SG16 wrote:
> Early draft, feedback welcome.
>
> Thanks,
> Corentin
>
> https://isocpp.org/files/papers/D2626R0.pdf
>

Received on 2022-08-23 21:18:41