sg16: Re: [SG16] [isocpp-lib-ext] Review of P2093R2: Formatted output

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Sun, 13 Dec 2020 07:23:45 -0800

> The scenario I intended to contrast with was an inherited file stream vs
one obtained by a call to fopen() or similar.

Could you elaborate how you propose to detect this distinction portably?
Are you suggesting adding file descriptor checks?

> The environment the program runs in (in real deployments outside the
abstract machine) consists of more than just a console.

Right and this is why the console case is handled specially.

> The environment the program runs in (in real deployments outside the
abstract machine) consists of more than just a console.

Couldn't agree more.

Cheers,
Victor

On Fri, Dec 11, 2020 at 9:43 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 12/9/20 9:57 AM, Victor Zverovich wrote:
>
> > In the former case, the programmer has wide latitude for choosing an
> encoding and knows that the content is being written to a file. In the
> latter case, the programmer doesn't (in general) know whether the output is
> redirected to a file, pipe, or some other character device.
> > For #2, it is the user that is making the choice to write the output to
> a text file, not the programmer.
>
> I don't think this is correct. In general, when you have a file stream
> object you cannot tell whether it was redirected by the user or not in both
> cases. You can of course distinguish between a file and a pipe, but not
> whether it was redirected by the user or the programmer. Even if it was
> possible it would be weird to have lossy transcoding into a legacy codepage
> in one case and not the other.
>
> The scenario I intended to contrast with was an inherited file stream vs
> one obtained by a call to fopen() or similar. In the latter case, the
> programmer is in control. I agree that, for inherited file streams, there
> is more uncertainty.
>
>
> > that choice is historically distinct from the run-time encoding used by
> the environment the program runs in.
>
> This is exactly what we are trying to fix because as we can see it results
> in mojibake in common scenarios.
>
> The proposal addresses the specific scenario where the output is known to
> be directed to a device that can be independently controlled and I am in
> favor of that. The problem is larger than that though. The environment
> the program runs in (in real deployments outside the abstract machine)
> consists of more than just a console. I would like to address the wider
> problem with this facility as well. We aren't at liberty to change the
> environment a program runs in, but we can design for adaptation to that
> environment.
>
>
> > I believe it would be a reasonable choice for a z/OS programmer to use
> UTF-8 as the execution/literal encoding and still run that program in an
> EBCDIC environment.
>
> Is the desired behavior for z/OS to have string literals compiled to UTF-8
> in the binary and do runtime transcoding into EBCDIC instead of having
> string literals compiled to EBCDIC and avoiding runtime transcoding? The
> latter is already supported by the paper. The former is a somewhat strange
> and inefficient approach but if it is the desired behavior I'd be happy to
> tweak the wording to make this possible (suggestions are welcome). In any
> case I think we should avoid making transcoding lossy when possible or
> having it controlled at runtime without very good reasons.
>
> Yes. The idea is that it should be possible to write a portable program
> that uses UTF-8 as the execution/literal encoding and have it run to the
> best of its abilities in any environment with the acknowledged limitation
> that, if an environment can't support all Unicode characters, then some
> data loss is inevitable. Note that such data limitations exist
> independently of what encoding is used for the execution/literal encoding;
> there is no character for 🚀 (U+1F680) in EBCDIC.
>
> I certainly agree with the goal of avoiding data loss. My perspective is
> that avoiding mojibake (as your proposal does for the console/terminal) is
> more important than avoiding introduction of substitution characters.
>
>
> > My suggestion was that, when writing to a stream known (e.g., via
> _isatty()
> <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/isatty?view=msvc-160>)
> to be directly connected to the Windows console, that the Unicode path be
> taken regardless of what the execution encoding is, with transcoding to
> UTF-16 performed as necessary.
>
> Sounds reasonable. I suggest polling this in SG16.
>
> That sounds good. Per discussion at our telecon this week, we can do so
> when we look at the next revision of this paper. I think having a better
> understanding of what other languages do will be helpful (and thank you for
> the research you already started doing for that).
>
>
> > I think the relevant question for this paper, given that it does intend
> to specify encoding conversions in at least some cases, is how output such
> as filenames that may have content that is not well-formed according to the
> execution encoding, can be incorporated.
>
> OK, I'll look into it and add a report back in the next revision of the
> paper.
>
> Excellent, thank you.
>
>
> > I don't find throwing an exception to be acceptable, but attempted
> conversion with U+FFFD substitution as suggested by Peter seems ok
>
> I agree. Again this is a good candidate for a poll.
>
> Sounds good. Other considerations could include some method to specify a
> callback, an alternate substitution character, or alternate error handling,
> but I suspect little motivation for any of that for this facility. Perhaps
> the further research on what is done for other languages will provide some
> additional perspective.
>
> Just to reiterate, thank you for bringing this paper forward. I very much
> want this facility!
>
> Tom.
>
>
> Thanks,
> Victor
>
>
> On Sat, Dec 5, 2020 at 1:16 PM Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 11/28/20 11:33 AM, Victor Zverovich wrote:
>>
>> To keep this thread manageable here are my answers to Tom and Jens
>> (thanks for the feedback!) in one convenient wall of text.
>>
>> *Answers to Tom:*
>>
>> > My choice of Windows-1251 for an example scenario was motivated solely
>> by the use of Russian characters in the example in the paper.
>>
>> Sure but it nevertheless a great choice that clearly demonstrates that
>> ACP is definitely the wrong thing when files are involved. That said,
>> EBCDIC and other encodings are still supported via the non-Unicode path.
>>
>> I don't feel that same level of clarity. There is a distinction to be
>> made regarding writing to a file vs writing to stdout. In the former case,
>> the programmer has wide latitude for choosing an encoding and knows that
>> the content is being written to a file. In the latter case, the programmer
>> doesn't (in general) know whether the output is redirected to a file, pipe,
>> or some other character device.
>>
>>
>> > I don't think the Notepad example is particularly relevant.
>>
>> It is relevant for #2 because it shows that when a Russian user creates a
>> text file on Windows it will most definitely be encoded in UTF-8 and not
>> "ANSI" encoding (and definitely not the terminal encoding). This is true
>> for Notepad and other popular editors. Same with files obtained from the
>> Internet. We should understand the common encoding for text files in order
>> for our text facilities to be useful and consistent.
>>
>> I think this misses the concern to some degree. For #2, it is the user
>> that is making the choice to write the output to a text file, not the
>> programmer. I believe the programmer should have the ability to choose the
>> encoding used (preferably with the ability for the user to influence the
>> choice). but I'm (so far) uncomfortable with the behavior being tied to the
>> execution/literal encoding chosen at compile time; that choice is
>> historically distinct from the run-time encoding used by the environment
>> the program runs in.
>>
>> For example, I believe it would be a reasonable choice for a z/OS
>> programmer to use UTF-8 as the execution/literal encoding and still run
>> that program in an EBCDIC environment. This is how Java works in such
>> environments (using UTF-16 internally of course). This is the Unicode
>> sandwich model.
>>
>>
>> > There is no particular expectation that a .txt file was produced by a
>> program running on the local machine, so the local code page isn't a
>> particularly good default in any case.
>>
>> Exactly.
>>
>> > If I write a version of the Windows 'type' command as you used it
>> above, call it 'cat', compile it without Microsoft's /utf-8 option, then I
>> would like it to still do the right thing; not the behavior you illustrated
>> above.
>>
>> I misunderstood your suggestion. Are you suggesting for the non-Unicode
>> path (print_nonunicode) to do the transcoding to the encoding determined by
>> ACP and for the Unicode path (print_unicode) to produce UTF-8? Note that
>> using ACP won't solve the mojjibake problem because the terminal encoding
>> (CP866) is separate from the ACP encoding, at least for Russian. Confusing
>> those two is a common misconception and source of problems (see e.g.
>> https://stackoverflow.com/questions/49259502/windows-console-codepage-866).
>> Using the terminal encoding would produce completely useless output for
>> anything but interaction with legacy command-line programs via pipes (and
>> even there the usefulness of the result of the pipeline is questionable).
>>
>> No. My suggestion was that, when writing to a stream known (e.g., via
>> _isatty()
>> <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/isatty?view=msvc-160>)
>> to be directly connected to the Windows console, that the Unicode path be
>> taken regardless of what the execution encoding is, with transcoding to
>> UTF-16 performed as necessary. My expectation is that writes to the
>> console would be performed using the WriteConsoleW() function; that is
>> where UTF-16 comes in. So, when the execution encoding is UTF-8, the
>> implementation would convert to UTF-16 and then call WriteConsoleW()
>> and, for other execution encodings, the implementation would transcode to
>> UTF-16 and then call WriteConsoleW(). This approach bypasses the
>> console encoding entirely (the console encoding is only relevant for the
>> ANSI implementation of the console APIs and for reads/writes to the console
>> via ReadFile() and WriteFile()).
>>
>>
>> > That is true only if the execution/literal encoding and the run-time
>> encoding do not match
>>
>> Yes and if we use ACP they will likely not match.
>>
>> For Windows, that is true, but also a reality of that environment. For
>> other platforms, the likelihood of a mismatch is far lower (though not 0;
>> the LANG environment variable is still used in POSIX environments and
>> can still be set to select an encoding other than UTF-8).
>>
>>
>> > Assuming test.txt is UTF-8 encoded, that is correct; this is a
>> straightforward case of mojibake.
>>
>> test.txt is CP1251 encoded. This example illustrates that using ACP
>> doesn't solve mojibake.
>>
>> Perhaps we are focused on different instances of mojibake. I think you
>> are pointing out that the output of the findstr command will fail to
>> present properly because the console encoding doesn't match. The mojibake
>> I was alluding to is that findstr will fail to find a match in the file
>> because the encoding of the pattern string (as entered from the console on
>> the command line) doesn't match the encoding of the file (unless findstr
>> consults the wide/UTF-16 variant of its command line).
>>
>>
>> > Perhaps a 'formatter' specialization should be provided for
>> std::filesystem::path? Proposing something like that is likely subject
>> matter for a different paper, but I think it would be helpful for this
>> paper to discuss it.
>>
>> I think that providing such specialization would be useful but it is out
>> of scope of the current paper since it has nothing to do with I/O
>> integration.
>>
>> I think the relevant question for this paper, given that it does intend
>> to specify encoding conversions in at least some cases, is how output such
>> as filenames that may have content that is not well-formed according to the
>> execution encoding, can be incorporated. I think my preference is to have
>> some method to opt-out of implicit conversions; probably via a per-field
>> format flag.
>>
>>
>> > What happens if the UTF-8 input is ill-formed?
>>
>> Good question. The current implementation throws an exception on
>> transcoding error but the error handling mechanism is open for discussion.
>>
>> I don't find throwing an exception to be acceptable, but attempted
>> conversion with U+FFFD substitution as suggested by Peter seems ok (perhaps
>> with an opt-out as suggested above); I prefer a loss of precision over a
>> loss of output.
>>
>>
>> *Answers to Jens:*
>>
>> > Doing std::format without necessarily creating a std::string is useful
>> functionality, but unrelated to the transcoding issues. Thus, this facility
>> should be separate.
>>
>> Such a facility already exists in C++20 (format_to, format_to_n). The
>> current paper only integrates it with I/O without adding any new
>> functionality on the formatting level.
>>
>> > Apparently, there is some OS-dependent magic going on to determine
>> whether output is to a console and, if so, which encoding the console might
>> prefer. I'm fine with such magic existing, but it should be a distinct
>> facility.
>>
>> Sure, I will extract it into a separate API in the next revision of the
>> paper.
>>
>> > And then there is the facility of converting the C++ literal encoding
>> to the console encoding, if necessary. Again, this should be a separate
>> facility, preferably offering a generic transcoding facility that can be
>> specialized for the console-only use case.
>>
>> While I agree that such a transcoding facility would be useful I think it
>> is out of scope of the current paper. The latter requires only minimal
>> transcoding facilities for the Unicode case and only on some platforms
>> where dedicated system APIs exist.
>>
>> I agree that distinct interfaces should be provided for each of these
>> concerns, but I also think each can be pursued separately and need not hold
>> up the proposed feature. We can always re-specify the proposed behavior in
>> terms of new interfaces via as-if in the future.
>>
>> Also, progress is being made on these; JeanHeyd is continuing to work on
>> general transcoding facilities. See WG14 N2595
>> <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2595.pdf> for his most
>> recent work (we'll be discussing this paper in SG16 early next year).
>>
>> Tom.
>>
>>
>> Cheers,
>> Victor
>>
>>
>> On Fri, Nov 27, 2020 at 11:15 AM Jens Maurer via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>> On 23/11/2020 06.33, Tom Honermann via Lib-Ext wrote:
>>> > SG16 began reviewing P2093R2 <https://wg21.link/p2093r2> in our
>>> recent telecon <
>>> https://github.com/sg16-unicode/sg16-meetings#november-11th-2020> and
>>> will continue review in our next telecon scheduled for December 9th.
>>> >
>>> > The following reflects my personal thoughts on this proposal.
>>>
>>> Ditto.
>>>
>>> As I've already said in the SG16 review, I'd like to see
>>> smaller bits and pieces offered, instead of or at least in
>>> addition to hiding them behind a non-trivial "printf"-style
>>> wrapper.
>>>
>>> - Doing std::format without necessarily creating a std::string
>>> is useful functionality, but unrelated to the transcoding issues.
>>> Thus, this facility should be separate.
>>>
>>> - Apparently, there is some OS-dependent magic going on to
>>> determine whether output is to a console and, if so, which
>>> encoding the console might prefer. I'm fine with such magic
>>> existing, but it should be a distinct facility.
>>>
>>> - And then there is the facility of converting the C++ literal
>>> encoding to the console encoding, if necessary. Again, this
>>> should be a separate facility, preferably offering a generic
>>> transcoding facility that can be specialized for the console-only
>>> use case. (Only supporting that single transcoding might save
>>> binary size.)
>>>
>>>
>>> Jens
>>>
>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>
>>
>

Received on 2020-12-13 09:24:01