On Wed, Jul 28, 2021, 06:15 Tom Honermann <tom@honermann.net> wrote:
On 7/27/21 6:34 PM, Hubert Tong wrote:
On Mon, Jul 26, 2021 at 11:44 AM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

SG16 approved forwarding a draft of P2295R5 (Support for UTF-8 as a portable source file encoding) and P2362R0 (Make obfuscating wide character literals ill-formed) with minor modifications to EWG during its July 14th telecon.  All requested SG16 changes are present in the published versions of P2295R5 and P2362R1 that appear in the most recent mailing (note that P2362R1 sports a new title).

These papers are now ready for review by EWG and the Github issue tracker has been updated accordingly.  Both papers have wording that has been reviewed by a core expert and each reflects existing implementation practice.

I will note that P2295's treatment of end-of-line indicators for UTF-8 source files has not yet been implemented (to my knowledge) on platforms where text files traditionally have "out-of-band" line length information. I am not aware of technical limitations that prevent having a convention that works in the manner P2295 indicates, so this comment is for information only.

Thank you for that correction, Hubert.

Is there a de-facto standard convention for how text files that originate on other platforms are translated to such an environment?  For example, are new-line sequences in the original file removed in favor of such out-of-band information?  Or are they typically preserved?  If preserved, I imagine they may not correlate with the out-of-band line information.  Are there multiple new-line sequence forms in practice?

I'm asking because I would like to better understand the impact to programmers.  Given a UTF-8 encoded file on another platform, in practice, are there multiple ways in which such a file might be translated for this environment?  If so, is there a dominant representation?

Do we have a list of platforms currently in use that store C++ source files in thar manner ( as opposed to program data for example )?

Regardless, for such platform, we can imagine there is a phase 0 that presents a unified view of the physical source... data set as a file.

The intent of the paper being that source files can be compiled portably, if the platform can't read files, some process would be necessary to transform the file to a data set long before phase 1 and because that process can replace line breaks anyway....

The only requirement is that was is ultimately fed to the compiler is valid UTF-8 - a stream of bytes produced in some fashion.

By the same token I find it rather unfortunate that we have now two notes for these platforms using data sets while their use case is already covered by normative wording ( "implementation defined mapping" cover this use case)...


P2295 has also been reviewed by SG22 (C/C++ Liaison) and has not been tagged for review by any other SGs.  P2362 still awaits SG22 review, so I encourage the EWG and SG22 chairs to coordinate to determine if EWG review should await SG22's review.

Thank you to both authors for the time and patience they exhibited throughout the reviews of these papers; particularly with regard to finding wording for P2295.


SG16 mailing list