Date: Sat, 5 Dec 2020 16:16:48 -0500
On 11/28/20 11:33 AM, Victor Zverovich wrote:
> To keep this thread manageable here are my answers to Tom and Jens
> (thanks for the feedback!) in one convenient wall of text.
> *
> *
> *Answers to Tom:*
>
> > My choice of Windows-1251 for an example scenario was motivated
> solely by the use of Russian characters in the example in the paper.
>
> Sure but it nevertheless a great choice that clearly demonstrates that
> ACP is definitely the wrong thing when files are involved. That said,
> EBCDIC and other encodings are still supported via the non-Unicode path.
I don't feel that same level of clarity. There is a distinction to be
made regarding writing to a file vs writing to stdout. In the former
case, the programmer has wide latitude for choosing an encoding and
knows that the content is being written to a file. In the latter case,
the programmer doesn't (in general) know whether the output is
redirected to a file, pipe, or some other character device.
>
> > I don't think the Notepad example is particularly relevant.
>
> It is relevant for #2 because it shows that when a Russian user
> creates a text file on Windows it will most definitely be encoded in
> UTF-8 and not "ANSI" encoding (and definitely not the terminal
> encoding). This is true for Notepad and other popular editors. Same
> with files obtained from the Internet. We should understand the common
> encoding for text files in order for our text facilities to be useful
> and consistent.
I think this misses the concern to some degree. For #2, it is the user
that is making the choice to write the output to a text file, not the
programmer. I believe the programmer should have the ability to choose
the encoding used (preferably with the ability for the user to influence
the choice). but I'm (so far) uncomfortable with the behavior being tied
to the execution/literal encoding chosen at compile time; that choice is
historically distinct from the run-time encoding used by the environment
the program runs in.
For example, I believe it would be a reasonable choice for a z/OS
programmer to use UTF-8 as the execution/literal encoding and still run
that program in an EBCDIC environment. This is how Java works in such
environments (using UTF-16 internally of course). This is the Unicode
sandwich model.
>
> > There is no particular expectation that a .txt file was produced by
> a program running on the local machine, so the local code page isn't a
> particularly good default in any case.
>
> Exactly.
>
> > If I write a version of the Windows 'type' command as you used it
> above, call it 'cat', compile it without Microsoft's /utf-8 option,
> then I would like it to still do the right thing; not the behavior you
> illustrated above.
>
> I misunderstood your suggestion. Are you suggesting for the
> non-Unicode path (print_nonunicode) to do the transcoding to the
> encoding determined by ACP and for the Unicode path (print_unicode) to
> produce UTF-8? Note that using ACP won't solve the mojjibake problem
> because the terminal encoding (CP866) is separate from the ACP
> encoding, at least for Russian. Confusing those two is a common
> misconception and source of problems (see e.g.
> https://stackoverflow.com/questions/49259502/windows-console-codepage-866).
> Using the terminal encoding would produce completely useless output
> for anything but interaction with legacy command-line programs via
> pipes (and even there the usefulness of the result of the pipeline is
> questionable).
No. My suggestion was that, when writing to a stream known (e.g., via
_isatty()
<https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/isatty?view=msvc-160>)
to be directly connected to the Windows console, that the Unicode path
be taken regardless of what the execution encoding is, with transcoding
to UTF-16 performed as necessary. My expectation is that writes to the
console would be performed using the WriteConsoleW() function; that is
where UTF-16 comes in. So, when the execution encoding is UTF-8, the
implementation would convert to UTF-16 and then call WriteConsoleW()
and, for other execution encodings, the implementation would transcode
to UTF-16 and then call WriteConsoleW(). This approach bypasses the
console encoding entirely (the console encoding is only relevant for the
ANSI implementation of the console APIs and for reads/writes to the
console via ReadFile() and WriteFile()).
>
> > That is true only if the execution/literal encoding and the run-time
> encoding do not match
>
> Yes and if we use ACP they will likely not match.
For Windows, that is true, but also a reality of that environment. For
other platforms, the likelihood of a mismatch is far lower (though not
0; the LANG environment variable is still used in POSIX environments and
can still be set to select an encoding other than UTF-8).
>
> > Assuming test.txt is UTF-8 encoded, that is correct; this is a
> straightforward case of mojibake.
>
> test.txt is CP1251 encoded. This example illustrates that using ACP
> doesn't solve mojibake.
Perhaps we are focused on different instances of mojibake. I think you
are pointing out that the output of the findstr command will fail to
present properly because the console encoding doesn't match. The
mojibake I was alluding to is that findstr will fail to find a match in
the file because the encoding of the pattern string (as entered from the
console on the command line) doesn't match the encoding of the file
(unless findstr consults the wide/UTF-16 variant of its command line).
>
> > Perhaps a 'formatter' specialization should be provided for
> std::filesystem::path? Proposing something like that is likely
> subject matter for a different paper, but I think it would be helpful
> for this paper to discuss it.
>
> I think that providing such specialization would be useful but it is
> out of scope of the current paper since it has nothing to do with I/O
> integration.
I think the relevant question for this paper, given that it does intend
to specify encoding conversions in at least some cases, is how output
such as filenames that may have content that is not well-formed
according to the execution encoding, can be incorporated. I think my
preference is to have some method to opt-out of implicit conversions;
probably via a per-field format flag.
>
> > What happens if the UTF-8 input is ill-formed?
>
> Good question. The current implementation throws an exception on
> transcoding error but the error handling mechanism is open for discussion.
I don't find throwing an exception to be acceptable, but attempted
conversion with U+FFFD substitution as suggested by Peter seems ok
(perhaps with an opt-out as suggested above); I prefer a loss of
precision over a loss of output.
>
> *Answers to Jens:*
>
> > Doing std::format without necessarily creating a std::string is
> useful functionality, but unrelated to the transcoding issues. Thus,
> this facility should be separate.
>
> Such a facility already exists in C++20 (format_to, format_to_n). The
> current paper only integrates it with I/O without adding any new
> functionality on the formatting level.
>
> > Apparently, there is some OS-dependent magic going on to determine
> whether output is to a console and, if so, which encoding the console
> might prefer. I'm fine with such magic existing, but it should be a
> distinct facility.
>
> Sure, I will extract it into a separate API in the next revision of
> the paper.
>
> > And then there is the facility of converting the C++ literal
> encoding to the console encoding, if necessary. Again, this should be
> a separate facility, preferably offering a generic transcoding
> facility that can be specialized for the console-only use case.
>
> While I agree that such a transcoding facility would be useful I think
> it is out of scope of the current paper. The latter requires only
> minimal transcoding facilities for the Unicode case and only on some
> platforms where dedicated system APIs exist.
I agree that distinct interfaces should be provided for each of these
concerns, but I also think each can be pursued separately and need not
hold up the proposed feature. We can always re-specify the proposed
behavior in terms of new interfaces via as-if in the future.
Also, progress is being made on these; JeanHeyd is continuing to work on
general transcoding facilities. See WG14 N2595
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2595.pdf> for his most
recent work (we'll be discussing this paper in SG16 early next year).
Tom.
>
> Cheers,
> Victor
>
>
> On Fri, Nov 27, 2020 at 11:15 AM Jens Maurer via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 23/11/2020 06.33, Tom Honermann via Lib-Ext wrote:
> > SG16 began reviewing P2093R2 <https://wg21.link/p2093r2> in our
> recent telecon
> <https://github.com/sg16-unicode/sg16-meetings#november-11th-2020>
> and will continue review in our next telecon scheduled for
> December 9th.
> >
> > The following reflects my personal thoughts on this proposal.
>
> Ditto.
>
> As I've already said in the SG16 review, I'd like to see
> smaller bits and pieces offered, instead of or at least in
> addition to hiding them behind a non-trivial "printf"-style
> wrapper.
>
> - Doing std::format without necessarily creating a std::string
> is useful functionality, but unrelated to the transcoding issues.
> Thus, this facility should be separate.
>
> - Apparently, there is some OS-dependent magic going on to
> determine whether output is to a console and, if so, which
> encoding the console might prefer. I'm fine with such magic
> existing, but it should be a distinct facility.
>
> - And then there is the facility of converting the C++ literal
> encoding to the console encoding, if necessary. Again, this
> should be a separate facility, preferably offering a generic
> transcoding facility that can be specialized for the console-only
> use case. (Only supporting that single transcoding might save
> binary size.)
>
>
> Jens
>
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
> To keep this thread manageable here are my answers to Tom and Jens
> (thanks for the feedback!) in one convenient wall of text.
> *
> *
> *Answers to Tom:*
>
> > My choice of Windows-1251 for an example scenario was motivated
> solely by the use of Russian characters in the example in the paper.
>
> Sure but it nevertheless a great choice that clearly demonstrates that
> ACP is definitely the wrong thing when files are involved. That said,
> EBCDIC and other encodings are still supported via the non-Unicode path.
I don't feel that same level of clarity. There is a distinction to be
made regarding writing to a file vs writing to stdout. In the former
case, the programmer has wide latitude for choosing an encoding and
knows that the content is being written to a file. In the latter case,
the programmer doesn't (in general) know whether the output is
redirected to a file, pipe, or some other character device.
>
> > I don't think the Notepad example is particularly relevant.
>
> It is relevant for #2 because it shows that when a Russian user
> creates a text file on Windows it will most definitely be encoded in
> UTF-8 and not "ANSI" encoding (and definitely not the terminal
> encoding). This is true for Notepad and other popular editors. Same
> with files obtained from the Internet. We should understand the common
> encoding for text files in order for our text facilities to be useful
> and consistent.
I think this misses the concern to some degree. For #2, it is the user
that is making the choice to write the output to a text file, not the
programmer. I believe the programmer should have the ability to choose
the encoding used (preferably with the ability for the user to influence
the choice). but I'm (so far) uncomfortable with the behavior being tied
to the execution/literal encoding chosen at compile time; that choice is
historically distinct from the run-time encoding used by the environment
the program runs in.
For example, I believe it would be a reasonable choice for a z/OS
programmer to use UTF-8 as the execution/literal encoding and still run
that program in an EBCDIC environment. This is how Java works in such
environments (using UTF-16 internally of course). This is the Unicode
sandwich model.
>
> > There is no particular expectation that a .txt file was produced by
> a program running on the local machine, so the local code page isn't a
> particularly good default in any case.
>
> Exactly.
>
> > If I write a version of the Windows 'type' command as you used it
> above, call it 'cat', compile it without Microsoft's /utf-8 option,
> then I would like it to still do the right thing; not the behavior you
> illustrated above.
>
> I misunderstood your suggestion. Are you suggesting for the
> non-Unicode path (print_nonunicode) to do the transcoding to the
> encoding determined by ACP and for the Unicode path (print_unicode) to
> produce UTF-8? Note that using ACP won't solve the mojjibake problem
> because the terminal encoding (CP866) is separate from the ACP
> encoding, at least for Russian. Confusing those two is a common
> misconception and source of problems (see e.g.
> https://stackoverflow.com/questions/49259502/windows-console-codepage-866).
> Using the terminal encoding would produce completely useless output
> for anything but interaction with legacy command-line programs via
> pipes (and even there the usefulness of the result of the pipeline is
> questionable).
No. My suggestion was that, when writing to a stream known (e.g., via
_isatty()
<https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/isatty?view=msvc-160>)
to be directly connected to the Windows console, that the Unicode path
be taken regardless of what the execution encoding is, with transcoding
to UTF-16 performed as necessary. My expectation is that writes to the
console would be performed using the WriteConsoleW() function; that is
where UTF-16 comes in. So, when the execution encoding is UTF-8, the
implementation would convert to UTF-16 and then call WriteConsoleW()
and, for other execution encodings, the implementation would transcode
to UTF-16 and then call WriteConsoleW(). This approach bypasses the
console encoding entirely (the console encoding is only relevant for the
ANSI implementation of the console APIs and for reads/writes to the
console via ReadFile() and WriteFile()).
>
> > That is true only if the execution/literal encoding and the run-time
> encoding do not match
>
> Yes and if we use ACP they will likely not match.
For Windows, that is true, but also a reality of that environment. For
other platforms, the likelihood of a mismatch is far lower (though not
0; the LANG environment variable is still used in POSIX environments and
can still be set to select an encoding other than UTF-8).
>
> > Assuming test.txt is UTF-8 encoded, that is correct; this is a
> straightforward case of mojibake.
>
> test.txt is CP1251 encoded. This example illustrates that using ACP
> doesn't solve mojibake.
Perhaps we are focused on different instances of mojibake. I think you
are pointing out that the output of the findstr command will fail to
present properly because the console encoding doesn't match. The
mojibake I was alluding to is that findstr will fail to find a match in
the file because the encoding of the pattern string (as entered from the
console on the command line) doesn't match the encoding of the file
(unless findstr consults the wide/UTF-16 variant of its command line).
>
> > Perhaps a 'formatter' specialization should be provided for
> std::filesystem::path? Proposing something like that is likely
> subject matter for a different paper, but I think it would be helpful
> for this paper to discuss it.
>
> I think that providing such specialization would be useful but it is
> out of scope of the current paper since it has nothing to do with I/O
> integration.
I think the relevant question for this paper, given that it does intend
to specify encoding conversions in at least some cases, is how output
such as filenames that may have content that is not well-formed
according to the execution encoding, can be incorporated. I think my
preference is to have some method to opt-out of implicit conversions;
probably via a per-field format flag.
>
> > What happens if the UTF-8 input is ill-formed?
>
> Good question. The current implementation throws an exception on
> transcoding error but the error handling mechanism is open for discussion.
I don't find throwing an exception to be acceptable, but attempted
conversion with U+FFFD substitution as suggested by Peter seems ok
(perhaps with an opt-out as suggested above); I prefer a loss of
precision over a loss of output.
>
> *Answers to Jens:*
>
> > Doing std::format without necessarily creating a std::string is
> useful functionality, but unrelated to the transcoding issues. Thus,
> this facility should be separate.
>
> Such a facility already exists in C++20 (format_to, format_to_n). The
> current paper only integrates it with I/O without adding any new
> functionality on the formatting level.
>
> > Apparently, there is some OS-dependent magic going on to determine
> whether output is to a console and, if so, which encoding the console
> might prefer. I'm fine with such magic existing, but it should be a
> distinct facility.
>
> Sure, I will extract it into a separate API in the next revision of
> the paper.
>
> > And then there is the facility of converting the C++ literal
> encoding to the console encoding, if necessary. Again, this should be
> a separate facility, preferably offering a generic transcoding
> facility that can be specialized for the console-only use case.
>
> While I agree that such a transcoding facility would be useful I think
> it is out of scope of the current paper. The latter requires only
> minimal transcoding facilities for the Unicode case and only on some
> platforms where dedicated system APIs exist.
I agree that distinct interfaces should be provided for each of these
concerns, but I also think each can be pursued separately and need not
hold up the proposed feature. We can always re-specify the proposed
behavior in terms of new interfaces via as-if in the future.
Also, progress is being made on these; JeanHeyd is continuing to work on
general transcoding facilities. See WG14 N2595
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2595.pdf> for his most
recent work (we'll be discussing this paper in SG16 early next year).
Tom.
>
> Cheers,
> Victor
>
>
> On Fri, Nov 27, 2020 at 11:15 AM Jens Maurer via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 23/11/2020 06.33, Tom Honermann via Lib-Ext wrote:
> > SG16 began reviewing P2093R2 <https://wg21.link/p2093r2> in our
> recent telecon
> <https://github.com/sg16-unicode/sg16-meetings#november-11th-2020>
> and will continue review in our next telecon scheduled for
> December 9th.
> >
> > The following reflects my personal thoughts on this proposal.
>
> Ditto.
>
> As I've already said in the SG16 review, I'd like to see
> smaller bits and pieces offered, instead of or at least in
> addition to hiding them behind a non-trivial "printf"-style
> wrapper.
>
> - Doing std::format without necessarily creating a std::string
> is useful functionality, but unrelated to the transcoding issues.
> Thus, this facility should be separate.
>
> - Apparently, there is some OS-dependent magic going on to
> determine whether output is to a console and, if so, which
> encoding the console might prefer. I'm fine with such magic
> existing, but it should be a distinct facility.
>
> - And then there is the facility of converting the C++ literal
> encoding to the console encoding, if necessary. Again, this
> should be a separate facility, preferably offering a generic
> transcoding facility that can be specialized for the console-only
> use case. (Only supporting that single transcoding might save
> binary size.)
>
>
> Jens
>
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2020-12-05 15:16:51