C++ Logo

sg16

Advanced search

Re: [SG16] More notes on P2286R3 - Formatting Ranges

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Thu, 2 Dec 2021 22:02:32 +0100
On 02/12/2021 15.43, Corentin via SG16 wrote:
> Hello,
> I wanted to add some information about the paper we discussed yesterday.
>
> I do not understand the motivation for being able to copy the output of fmt's "debug" back to a string literal and expect a consistent result. I think I'd like to see that explored more in the paper.

That was my question in the call whether such round-tripping to a
string literal was an intended use-case or not. I don't think
I actually got an answer.

(Regardless, it seems a bad idea to invent a new escaping syntax
that is inconsistent with [lex.string].)

> Unicode defines graphic characters as characters of the categories L, M, N, P, S, Zs (unicode 14, chapter 2.4)
> Note that new unicode versions can assign codepoints and make them graphic, so the output isn't stable from one version to the next.

We should really make a list where the behavior of C++ depends on progressing
Unicode versions, and maybe document that in a note in the standard for
general awareness.

 - allowable identifier characters
 - escaping for fmt "debug"
 - stale width estimation in [format.string.std]
 - maybe something else I forgot

> Graphic excludes control, and formatting codepoints, but not all graphic characters are visible.
> Go lang defines an additional property "Printable" which is like graphics but excludes spaces other than SPACE.
>
> The paper needs to decide what it considers for escaping.

Yes, we need a clear specification here, preferably as a reference
to Unicode character attributes (e.g. graphic but not space).

> Both of these properties are reasonably compact tables.
> "printable" as defined by golang is probably a good default.
>
> --
> For non-unicode, I wonder if escaping everything but basic latin1 would be reasonable.
> Other solutions include converting to unicode first, or pull-in std::isprint, which ties in locale, and only works for stateless, single code units. Neither of these work if an encoding is assumed incorrectly.
> But that's the case if the encoding is unicode too.

Frankly, I need a lot more information to understand how this "debug"
escaping business would work in a wide-EBCDIC (with shift states)
environment. Do we require transcoding to Unicode just to find out
which characters are printable and which ones aren't?
Or do we just escape everything outside the basic character set using "\x" ?
If the former, that seems to require transcoding facilities in the library
that otherwise would not be needed.

> Should the debug specifier have a text/binary mode?

It seems the current design assumes that a string always contains text, otherwise
the whole escaping discussion doesn't make a lot of sense.

Jens

Received on 2021-12-02 15:03:38