C++ Logo

sg16

Advanced search

Re: QoI for escaped formatting of non-Unicode-encoding strings: deployment overhead versus ideal behaviour

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Thu, 13 Jul 2023 09:48:51 -0400
On Thu, Jul 13, 2023 at 3:59 AM Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

> Hey Hubert, I sadly missed the last meeting, which I'm guessing provides
> some necessary context here.
>

This was more of a concern I raised on the side that was not much discussed
in the meeting. I did fail to relay important context though: The
ordinary/wide string literal encodings are properties of the compilation
environment and there is no guarantee that the runtime environment has the
facilities necessary to programmatically decode those encodings (e.g.,
maybe literals in the program are only passed to fputs).

The leakage of literal encoding information into the type system (at some
level or another) was also briefly mentioned. Related unspoken context is
that separately-compiled translation units (e.g., third-party library
packages) will potentially have different literal encodings. Strategies to
contain the ODR-violation fallout should be employed.


> I'll try to answer some of these things in a vacuum anyway, i like to live
> dangerously!
>
> Generally, I'd be rather concerned about the need to reason about literal
> encoding at runtime.
> Without a way to track in the type system whether a string was produced at
> runtime or at compile time (including those who are produced by constant
> evaluation),
>

Treating runtime strings the same way as compile-time strings was (iirc)
part of the motivation for specifying interpretation using the literal
encoding in the first place. We don't need to differentiate between strings
produced at runtime versus compile-time for encoding purposes with the
current specification.


> (which i think we actively do not want as, in addition of the added
> complexity, we, i think, prefer not have constant evaluation behave
> differently from the runtime),
> It seems generally impossible to do.
>

[format.string.escaped] explicitly refers to the associated character
encoding for `charT`, with a reference to Table 12 (which documents literal
encodings). So reasoning about the literal encoding at runtime is necessary
with the current specification.


>
> In which case do you think the distinction matters?
> I realize that it is a situation that can occur in practice, and which we
> can diagnose... but not necessarily do more about it in the general case.
>
> > 2. The set of characters considered separators or non-printable
> characters
>
> Properties of characters are agnostic of their encoding, and I'd be rather
> opposed to having that tied to locale or encoding. SPACE is always going to
> be a whitespace whether in UTF-8
> or EUC-KR, so we really only need properties for unicode codepoints. Which
> is good as other character sets are usually defined in terms of glyphs that
> do not have properties attached to them.
>

With locales, iswprint exists (although it doesn't help when wchar_t is
UTF-16) and is a reasonable way to figure out whether to numeric escape or
not. We could defer to Unicode, but then we still need to map the encoded
character to a Unicode codepoint.


>
> > What are people's thoughts on POSIX locale, ICU, or iconv dependencies
> from C++ standard libraries as the way to support non-Unicode encodings?
>
> If we want to support generalized conversions from and to encodings that
> are outside of the current set (narrow and wide execution, utf-N), the use
> of a library by an implementation is going to be unavoidable, but i don't
> think we can rely on specific one, so we can't mandate a set of supported
> encodings or a specific mapping. I have no opinion on whether we should.
>

The issue in front of us is that the current set *isn't* the narrow and
wide execution encodings and UTF-8/16/32; it is the ordinary and wide
string literal encodings and UTF-8/16/32.


>
> On Thu, Jul 13, 2023 at 1:38 AM Hubert Tong via SG16 <
> sg16_at_[hidden]> wrote:
>
>> Hi SG 16:
>>
>> When escaping strings (an operation likely done at runtime), some
>> information about the literal encoding (a property of the compilation
>> environment) is needed.
>>
>> For "ideal behaviour", it seems to me that the ability to
>> hardcode/capture at compile time/deploy with the runtime is needed for the
>> following:
>> 1. Understanding of the encoding scheme (e.g., valid initial code units,
>> valid continuation code units, etc.)
>> 2. The set of characters considered separators or non-printable characters
>>
>> It seems to me that (1) is going to need some database of encodings
>> already.
>>
>> Additionally, I am not sure that the policy chosen for unassigned
>> codepoints should be the same between Unicode and non-Unicode encodings.
>>
>> Is my analysis reasonable? What are people's thoughts on POSIX locale,
>> ICU, or iconv dependencies from C++ standard libraries as the way to
>> support non-Unicode encodings? Since the specified formatting operation
>> "cannot fail", what is the story when the underlying runtime environment
>> lacks support for the literal encoding (violation of implementation-defined
>> limits due to invalid runtime environment setup)?
>>
>> The alternative (for non-Unicode encodings) seems to be "handle code
>> units that match the encoding of a member of the basic character set,
>> numeric escape everything else".
>>
>> -- HT
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2023-07-13 13:49:19