C++ Logo

sg16

Advanced search

Re: [SG16] [isocpp-lib-ext] Review of P2093R2: Formatted output

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Sat, 28 Nov 2020 08:33:35 -0800
To keep this thread manageable here are my answers to Tom and Jens (thanks
for the feedback!) in one convenient wall of text.

*Answers to Tom:*

> My intent here was to note that, if C++ streams were used as the default
output, that it is unclear to me how a hypothetical std::wprint() would
work. This was intended as another point in favor of using C streams. I
agree that this is subject matter for a different paper.

Great.

> My choice of Windows-1251 for an example scenario was motivated solely by
the use of Russian characters in the example in the paper.

Sure but it nevertheless a great choice that clearly demonstrates that ACP
is definitely the wrong thing when files are involved. That said, EBCDIC
and other encodings are still supported via the non-Unicode path.

> I don't think the Notepad example is particularly relevant.

It is relevant for #2 because it shows that when a Russian user creates a
text file on Windows it will most definitely be encoded in UTF-8 and not
"ANSI" encoding (and definitely not the terminal encoding). This is true
for Notepad and other popular editors. Same with files obtained from the
Internet. We should understand the common encoding for text files in order
for our text facilities to be useful and consistent.

> There is no particular expectation that a .txt file was produced by a
program running on the local machine, so the local code page isn't a
particularly good default in any case.

Exactly.

> If I write a version of the Windows 'type' command as you used it above,
call it 'cat', compile it without Microsoft's /utf-8 option, then I would
like it to still do the right thing; not the behavior you illustrated above.

I misunderstood your suggestion. Are you suggesting for the non-Unicode
path (print_nonunicode) to do the transcoding to the encoding determined by
ACP and for the Unicode path (print_unicode) to produce UTF-8? Note that
using ACP won't solve the mojjibake problem because the terminal encoding
(CP866) is separate from the ACP encoding, at least for Russian. Confusing
those two is a common misconception and source of problems (see e.g.
https://stackoverflow.com/questions/49259502/windows-console-codepage-866).
Using the terminal encoding would produce completely useless output for
anything but interaction with legacy command-line programs via pipes (and
even there the usefulness of the result of the pipeline is questionable).

> That is true only if the execution/literal encoding and the run-time
encoding do not match

Yes and if we use ACP they will likely not match.

> I think it would be useful if the paper summarized the encoding behavior
for the surveyed print statements in section 5.

I can add it in the next revision.

> Assuming test.txt is UTF-8 encoded, that is correct; this is a
straightforward case of mojibake.

test.txt is CP1251 encoded. This example illustrates that using ACP doesn't
solve mojibake.

> Perhaps a 'formatter' specialization should be provided for
std::filesystem::path? Proposing something like that is likely subject
matter for a different paper, but I think it would be helpful for this
paper to discuss it.

I think that providing such specialization would be useful but it is out of
scope of the current paper since it has nothing to do with I/O integration.

> What happens if the UTF-8 input is ill-formed?

Good question. The current implementation throws an exception on
transcoding error but the error handling mechanism is open for discussion.

*Answers to Jens:*

> Doing std::format without necessarily creating a std::string is useful
functionality, but unrelated to the transcoding issues. Thus, this facility
should be separate.

Such a facility already exists in C++20 (format_to, format_to_n). The
current paper only integrates it with I/O without adding any new
functionality on the formatting level.

> Apparently, there is some OS-dependent magic going on to determine
whether output is to a console and, if so, which encoding the console might
prefer. I'm fine with such magic existing, but it should be a distinct
facility.

Sure, I will extract it into a separate API in the next revision of the
paper.

> And then there is the facility of converting the C++ literal encoding to
the console encoding, if necessary. Again, this should be a separate
facility, preferably offering a generic transcoding facility that can be
specialized for the console-only use case.

While I agree that such a transcoding facility would be useful I think it
is out of scope of the current paper. The latter requires only minimal
transcoding facilities for the Unicode case and only on some platforms
where dedicated system APIs exist.

Cheers,
Victor


On Fri, Nov 27, 2020 at 11:15 AM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

> On 23/11/2020 06.33, Tom Honermann via Lib-Ext wrote:
> > SG16 began reviewing P2093R2 <https://wg21.link/p2093r2> in our recent
> telecon <https://github.com/sg16-unicode/sg16-meetings#november-11th-2020>
> and will continue review in our next telecon scheduled for December 9th.
> >
> > The following reflects my personal thoughts on this proposal.
>
> Ditto.
>
> As I've already said in the SG16 review, I'd like to see
> smaller bits and pieces offered, instead of or at least in
> addition to hiding them behind a non-trivial "printf"-style
> wrapper.
>
> - Doing std::format without necessarily creating a std::string
> is useful functionality, but unrelated to the transcoding issues.
> Thus, this facility should be separate.
>
> - Apparently, there is some OS-dependent magic going on to
> determine whether output is to a console and, if so, which
> encoding the console might prefer. I'm fine with such magic
> existing, but it should be a distinct facility.
>
> - And then there is the facility of converting the C++ literal
> encoding to the console encoding, if necessary. Again, this
> should be a separate facility, preferably offering a generic
> transcoding facility that can be specialized for the console-only
> use case. (Only supporting that single transcoding might save
> binary size.)
>
>
> Jens
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-11-28 10:34:01