sg16: Re: [SG16] [isocpp-lib-ext] Review of P2093R2: Formatted output

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 15 Dec 2020 00:30:37 -0500

On 12/13/20 10:23 AM, Victor Zverovich wrote:
> > The scenario I intended to contrast with was an inherited file
> stream vs one obtained by a call to fopen() or similar.
>
> Could you elaborate how you propose to detect this distinction
> portably? Are you suggesting adding file descriptor checks?

I'm not proposing that the distinction be detected. My preference is
what I stated at the start of this email thread:

1. When writing directly to a terminal/console, exploit native
interfaces as necessary for text to be displayed correctly.
2. Otherwise, write output encoded to match the system/run-time
encoding; the encoding that P1885 indicates via text_encoding::system().

Then, once we have proper transcoding facilities, perhaps via a
combination of Corentin's P1885 <https://wg21.link/p1885> named
encodings and JeanHeyd's P1629 <https://wg21.link/p1629> encoding
objects, provide the ability to override the default behavior for the
2nd case, perhaps by associating an encoding with a C or C++ stream,
and/or specifying an explicit encoding on calls to std::print(). This
approach would enable the program to adapt to the user's environment (by
default, as has been required since the dawn of C and C++) while also
allowing the program to override the environment based on program
requirements or other information.

>
> > The environment the program runs in (in real deployments outside
> the abstract machine) consists of more than just a console.
>
> Right and this is why the console case is handled specially.
Yes, and we agree on the special handling for the console case.
>
> > The environment the program runs in (in real deployments outside the
> abstract machine) consists of more than just a console.
>
> Couldn't agree more.

Good :)

Tom.

>
> Cheers,
> Victor
>
>
> On Fri, Dec 11, 2020 at 9:43 PM Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 12/9/20 9:57 AM, Victor Zverovich wrote:
>> > In the former case, the programmer has wide latitude for
>> choosing an encoding and knows that the content is being written
>> to a file. In the latter case, the programmer doesn't (in
>> general) know whether the output is redirected to a file, pipe,
>> or some other character device.
>> > For #2, it is the user that is making the choice to write the
>> output to a text file, not the programmer.
>>
>> I don't think this is correct. In general, when you have a file
>> stream object you cannot tell whether it was redirected by the
>> user or not in both cases. You can of course distinguish between
>> a file and a pipe, but not whether it was redirected by the user
>> or the programmer. Even if it was possible it would be weird to
>> have lossy transcoding into a legacy codepage in one case and not
>> the other.
>
> The scenario I intended to contrast with was an inherited file
> stream vs one obtained by a call to fopen() or similar. In the
> latter case, the programmer is in control. I agree that, for
> inherited file streams, there is more uncertainty.
>
>>
>> > that choice is historically distinct from the run-time encoding
>> used by the environment the program runs in.
>>
>> This is exactly what we are trying to fix because as we can see
>> it results in mojibake in common scenarios.
>
> The proposal addresses the specific scenario where the output is
> known to be directed to a device that can be independently
> controlled and I am in favor of that. The problem is larger than
> that though. The environment the program runs in (in real
> deployments outside the abstract machine) consists of more than
> just a console. I would like to address the wider problem with
> this facility as well. We aren't at liberty to change the
> environment a program runs in, but we can design for adaptation to
> that environment.
>
>>
>> > I believe it would be a reasonable choice for a z/OS programmer
>> to use UTF-8 as the execution/literal encoding and still run that
>> program in an EBCDIC environment.
>>
>> Is the desired behavior for z/OS to have string literals compiled
>> to UTF-8 in the binary and do runtime transcoding into EBCDIC
>> instead of having string literals compiled to EBCDIC and avoiding
>> runtime transcoding? The latter is already supported by the
>> paper. The former is a somewhat strange and inefficient approach
>> but if it is the desired behavior I'd be happy to tweak the
>> wording to make this possible (suggestions are welcome). In any
>> case I think we should avoid making transcoding lossy when
>> possible or having it controlled at runtime without very good
>> reasons.
>
> Yes. The idea is that it should be possible to write a portable
> program that uses UTF-8 as the execution/literal encoding and have
> it run to the best of its abilities in any environment with the
> acknowledged limitation that, if an environment can't support all
> Unicode characters, then some data loss is inevitable. Note that
> such data limitations exist independently of what encoding is used
> for the execution/literal encoding; there is no character for 🚀
> (U+1F680) in EBCDIC.
>
> I certainly agree with the goal of avoiding data loss. My
> perspective is that avoiding mojibake (as your proposal does for
> the console/terminal) is more important than avoiding introduction
> of substitution characters.
>
>>
>> > My suggestion was that, when writing to a stream known (e.g.,
>> via _isatty()
>> <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/isatty?view=msvc-160>)
>> to be directly connected to the Windows console, that the Unicode
>> path be taken regardless of what the execution encoding is, with
>> transcoding to UTF-16 performed as necessary.
>>
>> Sounds reasonable. I suggest polling this in SG16.
> That sounds good. Per discussion at our telecon this week, we can
> do so when we look at the next revision of this paper. I think
> having a better understanding of what other languages do will be
> helpful (and thank you for the research you already started doing
> for that).
>>
>> > I think the relevant question for this paper, given that it
>> does intend to specify encoding conversions in at least some
>> cases, is how output such as filenames that may have content that
>> is not well-formed according to the execution encoding, can be
>> incorporated.
>>
>> OK, I'll look into it and add a report back in the next revision
>> of the paper.
> Excellent, thank you.
>>
>> > I don't find throwing an exception to be acceptable, but
>> attempted conversion with U+FFFD substitution as suggested by
>> Peter seems ok
>>
>> I agree. Again this is a good candidate for a poll.
>
> Sounds good. Other considerations could include some method to
> specify a callback, an alternate substitution character, or
> alternate error handling, but I suspect little motivation for any
> of that for this facility. Perhaps the further research on what is
> done for other languages will provide some additional perspective.
>
> Just to reiterate, thank you for bringing this paper forward. I
> very much want this facility!
>
> Tom.
>
>>
>> Thanks,
>> Victor
>>
>>
>> On Sat, Dec 5, 2020 at 1:16 PM Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> On 11/28/20 11:33 AM, Victor Zverovich wrote:
>>> To keep this thread manageable here are my answers to Tom
>>> and Jens (thanks for the feedback!) in one convenient wall
>>> of text.
>>> *
>>> *
>>> *Answers to Tom:*
>>>
>>> > My choice of Windows-1251 for an example scenario was
>>> motivated solely by the use of Russian characters in the
>>> example in the paper.
>>>
>>> Sure but it nevertheless a great choice that clearly
>>> demonstrates that ACP is definitely the wrong thing when
>>> files are involved. That said, EBCDIC and other encodings
>>> are still supported via the non-Unicode path.
>> I don't feel that same level of clarity. There is a
>> distinction to be made regarding writing to a file vs writing
>> to stdout. In the former case, the programmer has wide
>> latitude for choosing an encoding and knows that the content
>> is being written to a file. In the latter case, the
>> programmer doesn't (in general) know whether the output is
>> redirected to a file, pipe, or some other character device.
>>>
>>> > I don't think the Notepad example is particularly relevant.
>>>
>>> It is relevant for #2 because it shows that when a Russian
>>> user creates a text file on Windows it will most definitely
>>> be encoded in UTF-8 and not "ANSI" encoding (and definitely
>>> not the terminal encoding). This is true for Notepad and
>>> other popular editors. Same with files obtained from the
>>> Internet. We should understand the common encoding for text
>>> files in order for our text facilities to be useful and
>>> consistent.
>>
>> I think this misses the concern to some degree. For #2, it is
>> the user that is making the choice to write the output to a
>> text file, not the programmer. I believe the programmer
>> should have the ability to choose the encoding used
>> (preferably with the ability for the user to influence the
>> choice). but I'm (so far) uncomfortable with the behavior
>> being tied to the execution/literal encoding chosen at
>> compile time; that choice is historically distinct from the
>> run-time encoding used by the environment the program runs in.
>>
>> For example, I believe it would be a reasonable choice for a
>> z/OS programmer to use UTF-8 as the execution/literal
>> encoding and still run that program in an EBCDIC
>> environment. This is how Java works in such environments
>> (using UTF-16 internally of course). This is the Unicode
>> sandwich model.
>>
>>>
>>> > There is no particular expectation that a .txt file was
>>> produced by a program running on the local machine, so the
>>> local code page isn't a particularly good default in any case.
>>>
>>> Exactly.
>>>
>>> > If I write a version of the Windows 'type' command as you
>>> used it above, call it 'cat', compile it without Microsoft's
>>> /utf-8 option, then I would like it to still do the right
>>> thing; not the behavior you illustrated above.
>>>
>>> I misunderstood your suggestion. Are you suggesting for the
>>> non-Unicode path (print_nonunicode) to do the transcoding to
>>> the encoding determined by ACP and for the Unicode path
>>> (print_unicode) to produce UTF-8? Note that using ACP won't
>>> solve the mojjibake problem because the terminal encoding
>>> (CP866) is separate from the ACP encoding, at least for
>>> Russian. Confusing those two is a common misconception and
>>> source of problems (see e.g.
>>> https://stackoverflow.com/questions/49259502/windows-console-codepage-866).
>>> Using the terminal encoding would produce completely useless
>>> output for anything but interaction with legacy command-line
>>> programs via pipes (and even there the usefulness of the
>>> result of the pipeline is questionable).
>>
>> No. My suggestion was that, when writing to a stream known
>> (e.g., via _isatty()
>> <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/isatty?view=msvc-160>)
>> to be directly connected to the Windows console, that the
>> Unicode path be taken regardless of what the execution
>> encoding is, with transcoding to UTF-16 performed as
>> necessary. My expectation is that writes to the console
>> would be performed using the WriteConsoleW() function; that
>> is where UTF-16 comes in. So, when the execution encoding is
>> UTF-8, the implementation would convert to UTF-16 and then
>> call WriteConsoleW() and, for other execution encodings, the
>> implementation would transcode to UTF-16 and then call
>> WriteConsoleW(). This approach bypasses the console encoding
>> entirely (the console encoding is only relevant for the ANSI
>> implementation of the console APIs and for reads/writes to
>> the console via ReadFile() and WriteFile()).
>>
>>>
>>> > That is true only if the execution/literal encoding and
>>> the run-time encoding do not match
>>>
>>> Yes and if we use ACP they will likely not match.
>> For Windows, that is true, but also a reality of that
>> environment. For other platforms, the likelihood of a
>> mismatch is far lower (though not 0; the LANG environment
>> variable is still used in POSIX environments and can still be
>> set to select an encoding other than UTF-8).
>>>
>>> > Assuming test.txt is UTF-8 encoded, that is correct; this
>>> is a straightforward case of mojibake.
>>>
>>> test.txt is CP1251 encoded. This example illustrates that
>>> using ACP doesn't solve mojibake.
>>
>> Perhaps we are focused on different instances of mojibake. I
>> think you are pointing out that the output of the findstr
>> command will fail to present properly because the console
>> encoding doesn't match. The mojibake I was alluding to is
>> that findstr will fail to find a match in the file because
>> the encoding of the pattern string (as entered from the
>> console on the command line) doesn't match the encoding of
>> the file (unless findstr consults the wide/UTF-16 variant of
>> its command line).
>>
>>>
>>> > Perhaps a 'formatter' specialization should be provided
>>> for std::filesystem::path? Proposing something like that is
>>> likely subject matter for a different paper, but I think it
>>> would be helpful for this paper to discuss it.
>>>
>>> I think that providing such specialization would be useful
>>> but it is out of scope of the current paper since it has
>>> nothing to do with I/O integration.
>> I think the relevant question for this paper, given that it
>> does intend to specify encoding conversions in at least some
>> cases, is how output such as filenames that may have content
>> that is not well-formed according to the execution encoding,
>> can be incorporated. I think my preference is to have some
>> method to opt-out of implicit conversions; probably via a
>> per-field format flag.
>>>
>>> > What happens if the UTF-8 input is ill-formed?
>>>
>>> Good question. The current implementation throws an
>>> exception on transcoding error but the error handling
>>> mechanism is open for discussion.
>> I don't find throwing an exception to be acceptable, but
>> attempted conversion with U+FFFD substitution as suggested by
>> Peter seems ok (perhaps with an opt-out as suggested above);
>> I prefer a loss of precision over a loss of output.
>>>
>>> *Answers to Jens:*
>>>
>>> > Doing std::format without necessarily creating a
>>> std::string is useful functionality, but unrelated to the
>>> transcoding issues. Thus, this facility should be separate.
>>>
>>> Such a facility already exists in C++20 (format_to,
>>> format_to_n). The current paper only integrates it with I/O
>>> without adding any new functionality on the formatting level.
>>>
>>> > Apparently, there is some OS-dependent magic going on to
>>> determine whether output is to a console and, if so, which
>>> encoding the console might prefer. I'm fine with such magic
>>> existing, but it should be a distinct facility.
>>>
>>> Sure, I will extract it into a separate API in the next
>>> revision of the paper.
>>>
>>> > And then there is the facility of converting the C++
>>> literal encoding to the console encoding, if necessary.
>>> Again, this should be a separate facility, preferably
>>> offering a generic transcoding facility that can be
>>> specialized for the console-only use case.
>>>
>>> While I agree that such a transcoding facility would be
>>> useful I think it is out of scope of the current paper. The
>>> latter requires only minimal transcoding facilities for the
>>> Unicode case and only on some platforms where dedicated
>>> system APIs exist.
>> I agree that distinct interfaces should be provided for each
>> of these concerns, but I also think each can be pursued
>> separately and need not hold up the proposed feature. We can
>> always re-specify the proposed behavior in terms of new
>> interfaces via as-if in the future.
>>
>> Also, progress is being made on these; JeanHeyd is continuing
>> to work on general transcoding facilities. See WG14 N2595
>> <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2595.pdf>
>> for his most recent work (we'll be discussing this paper in
>> SG16 early next year).
>>
>> Tom.
>>
>>>
>>> Cheers,
>>> Victor
>>>
>>>
>>> On Fri, Nov 27, 2020 at 11:15 AM Jens Maurer via SG16
>>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>>
>>> On 23/11/2020 06.33, Tom Honermann via Lib-Ext wrote:
>>> > SG16 began reviewing P2093R2
>>> <https://wg21.link/p2093r2> in our recent telecon
>>> <https://github.com/sg16-unicode/sg16-meetings#november-11th-2020>
>>> and will continue review in our next telecon scheduled
>>> for December 9th.
>>> >
>>> > The following reflects my personal thoughts on this
>>> proposal.
>>>
>>> Ditto.
>>>
>>> As I've already said in the SG16 review, I'd like to see
>>> smaller bits and pieces offered, instead of or at least in
>>> addition to hiding them behind a non-trivial "printf"-style
>>> wrapper.
>>>
>>> - Doing std::format without necessarily creating a
>>> std::string
>>> is useful functionality, but unrelated to the
>>> transcoding issues.
>>> Thus, this facility should be separate.
>>>
>>> - Apparently, there is some OS-dependent magic going on to
>>> determine whether output is to a console and, if so, which
>>> encoding the console might prefer. I'm fine with such magic
>>> existing, but it should be a distinct facility.
>>>
>>> - And then there is the facility of converting the C++
>>> literal
>>> encoding to the console encoding, if necessary. Again, this
>>> should be a separate facility, preferably offering a generic
>>> transcoding facility that can be specialized for the
>>> console-only
>>> use case. (Only supporting that single transcoding
>>> might save
>>> binary size.)
>>>
>>>
>>> Jens
>>>
>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>
>

Received on 2020-12-14 23:30:44