C++ Logo


Advanced search

Re: [Tooling] [isocpp-modules] Filename requirements for the SG15 TR

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 8 Mar 2019 00:19:06 -0500
On 3/7/19 11:17 AM, Mathias Stearn wrote:
> (Forking thread)
> was: Dependency information for module-aware build tools
> On Thu, Mar 7, 2019 at 12:15 AM Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
> I find myself thinking (as I so often do these days much to the
> surprise of my past self), how does EBCDIC and z/OS fit in here?
> If we stick to JSON and require the dependency file to be UTF-8
> encoded, would all file names in these files be raw8 encoded and
> effectively unreadable (by humans) on z/OS? Perhaps we could
> allow more flexibility, but doing so necessarily invites locales
> into the discussion (for those that are unaware, EBCDIC has code
> pages too). For example, we could require that the selected locale
> match between the producers and consumers of the file (UB if they
> don't) and permit use of the string representation by transcoding
> from the locale interpreted physical file name to UTF-8, but only
> if reverse-transcoding produces the same physical file name,
> otherwise the appropriate raw format must be used.
> I thought one of the reasons we are going the TR route rather than TS
> or IS is to allow recommending 99% solutions that provide the best
> experience for the vast majority of users while not necessarily being
> applicable to everyone. Platforms and codebases where the TR
> recommendations don't make sense are free to alter them for their
> platform, or just come up with completely different solutions to the
> problem. To me, this also implies that we are allowed to say that this
> TR doesn't support files with invalid unicode names, however that is
> best expressed on your platform. On windows, that means that the path
> must meet the requirements of UTF-16, not just UCS-2. On utf8-native
> platforms that have "bag-o-bytes" file names, it means that we don't
> support files with invalid utf8 in their names. On non-unicode
> platforms, that means either transcoding to/from utf8 on the way in
> and out of the json format, or coming up a different format, accepting
> that it will be specific to your platform.

I think platform specific differences are acceptable, but we should
strive for general solutions.

I think the committee currently has a UTF-8 bias that doesn't
necessarily reflect the global C++ community. We don't have much
representation from Japan or China where, as I understand it, Shift-JIS
and GB18030 still have significant usage. We also have few, if any,
z/OS users in the committee outside of IBM representatives. UTF-8
dominates the web, no one questions that. But within the C++ ecosystem,
I don't think UTF-8 dominates to a similar degree, at least not outside
of the US and Europe. I wish I had data to back that up.

> I also thought one of our goals was to describe a subset of what is
> technically supported by the IS, that if you stay within these bounds,
> you will have the least trouble on a majority of platforms. This
> means that we may want to recommend additional restrictions on file
> names than just "well formed unicode", such as:
> * Don't have files that differ only by case (broken on
> case-insensitive filesystems)
> * Don't have files that differ only by normalization form (broken on
> at least OSX)
> * Stick to a small set of characters as word separators (maybe any of
> " .-_", definitely not ':')
> * Avoid "poisoned" pathnames like PRN and CON
I think these are good guidelines and agree with recommending them.
> And perhaps we should also make recommendations that are likely to
> increase sanity, such as:
> * Don't use characters that are squashed by the NFKC/NFKD
> transformation (eg the Angstrom character)
> * Don't have control characters in file names
> * Don't mix scripts within a single path component or module identifier
> * Don't start source file names with a dot
> * Use one of the "blessed" file extensions for your source code (we
> can have a big tent of blessed extensions, but naming a C++ source
> file haha.py is just dumb)
Also good guidelines in my opinion.
> To be clear, I'm not suggesting we go as far as the "pitchfork"
> proposal in dictating a project layout. More like discouraging
> obviously bad things that would get you yelled at in code review in
> basically all non-troll projects.



Received on 2019-03-08 06:19:10