On Fri, 8 Mar 2019 at 06:19 Tom Honermann <tom@honermann.net> wrote:
On 3/7/19 11:17 AM, Mathias Stearn wrote:
(Forking thread)
was: Dependency information for module-aware build tools

On Thu, Mar 7, 2019 at 12:15 AM Tom Honermann <tom@honermann.net> wrote:

I find myself thinking (as I so often do these days much to the surprise of my past self), how does EBCDIC and z/OS fit in here?  If we stick to JSON and require the dependency file to be UTF-8 encoded, would all file names in these files be raw8 encoded and effectively unreadable (by humans) on z/OS?  Perhaps we could allow more flexibility, but doing so necessarily invites locales into the discussion (for those that are unaware, EBCDIC has code pages too).  For example, we could require that the selected locale match between the producers and consumers of the file (UB if they don't) and permit use of the string representation by transcoding from the locale interpreted physical file name to UTF-8, but only if reverse-transcoding produces the same physical file name, otherwise the appropriate raw format must be used.


I thought one of the reasons we are going the TR route rather than TS or IS is to allow recommending 99% solutions that provide the best experience for the vast majority of users while not necessarily being applicable to everyone. Platforms and codebases where the TR recommendations don't make sense are free to alter them for their platform, or just come up with completely different solutions to the problem. To me, this also implies that we are allowed to say that this TR doesn't support files with invalid unicode names, however that is best expressed on your platform. On windows, that means that the path must meet the requirements of UTF-16, not just UCS-2. On utf8-native platforms that have "bag-o-bytes" file names, it means that we don't support files with invalid utf8 in their names. On non-unicode platforms, that means either transcoding to/from utf8 on the way in and out of the json format, or coming up a different format, accepting that it will be specific to your platform.

I think platform specific differences are acceptable, but we should strive for general solutions.

I think the committee currently has a UTF-8 bias that doesn't necessarily reflect the global C++ community.  We don't have much representation from Japan or China where, as I understand it, Shift-JIS and GB18030 still have significant usage.  We also have few, if any, z/OS users in the committee outside of IBM representatives.  UTF-8 dominates the web, no one questions that.  But within the C++ ecosystem, I don't think UTF-8 dominates to a similar degree, at least not outside of the US and Europe.  I wish I had data to back that up.


I also thought one of our goals was to describe a subset of what is technically supported by the IS, that if you stay within these bounds, you will have the least trouble on a majority of platforms.  This means that we may want to recommend additional restrictions on file names than just "well formed unicode", such as:
* Don't have files that differ only by case (broken on case-insensitive filesystems)
* Don't have files that differ only by normalization form (broken on at least OSX)
* Stick to a small set of characters as word separators (maybe any of " .-_", definitely not ':')
* Avoid "poisoned" pathnames like PRN and CON
I think these are good guidelines and agree with recommending them.


And perhaps we should also make recommendations that are likely to increase sanity, such as:
* Don't use characters that are squashed by the NFKC/NFKD transformation (eg the Angstrom character)
* Don't have control characters in file names
* Don't mix scripts within a single path component or module identifier
* Don't start source file names with a dot
* Use one of the "blessed" file extensions for your source code (we can have a big tent of blessed extensions, but naming a C++ source file haha.py is just dumb)
Also good guidelines in my opinion.


I think we could even recommend plain ASCII (or something that can be mapped to) in file names - even a subset of ASCII.
It doesn't remove anything from users IMO, avoid a lot of issues and is portable so that the code can actually be shared across platforms.

But there may be various levels of specifications:

DON'T  have control characters in file names
PREFER only using characters in the ASCII character set

 

To be clear, I'm not suggesting we go as far as the "pitchfork" proposal in dictating a project layout. More like discouraging obviously bad things that would get you yelled at in code review in basically all non-troll projects.

+1.

Tom.

_______________________________________________
Modules mailing list
Modules@lists.isocpp.org
Subscription: http://lists.isocpp.org/mailman/listinfo.cgi/modules
Link to this post: http://lists.isocpp.org/modules/2019/03/0216.php