sg15: [Tooling] Filename requirements for the SG15 TR

From: Mathias Stearn <redbeard0531+isocpp_at_[hidden]>
Date: Thu, 7 Mar 2019 11:17:47 -0500

(Forking thread)
was: Dependency information for module-aware build tools

On Thu, Mar 7, 2019 at 12:15 AM Tom Honermann <tom_at_[hidden]> wrote:

> I find myself thinking (as I so often do these days much to the surprise
> of my past self), how does EBCDIC and z/OS fit in here? If we stick to
> JSON and require the dependency file to be UTF-8 encoded, would all file
> names in these files be raw8 encoded and effectively unreadable (by humans)
> on z/OS? Perhaps we could allow more flexibility, but doing so necessarily
> invites locales into the discussion (for those that are unaware, EBCDIC has
> code pages too). For example, we could require that the selected locale
> match between the producers and consumers of the file (UB if they don't)
> and permit use of the string representation by transcoding from the locale
> interpreted physical file name to UTF-8, but only if reverse-transcoding
> produces the same physical file name, otherwise the appropriate raw format
> must be used.
>

I thought one of the reasons we are going the TR route rather than TS or IS
is to allow recommending 99% solutions that provide the best experience for
the vast majority of users while not necessarily being applicable to
everyone. Platforms and codebases where the TR recommendations don't make
sense are free to alter them for their platform, or just come up with
completely different solutions to the problem. To me, this also implies
that we are allowed to say that this TR doesn't support files with invalid
unicode names, however that is best expressed on your platform. On windows,
that means that the path must meet the requirements of UTF-16, not just
UCS-2. On utf8-native platforms that have "bag-o-bytes" file names, it
means that we don't support files with invalid utf8 in their names. On
non-unicode platforms, that means either transcoding to/from utf8 on the
way in and out of the json format, or coming up a different format,
accepting that it will be specific to your platform.

I also thought one of our goals was to describe a subset of what is
technically supported by the IS, that if you stay within these bounds, you
will have the least trouble on a majority of platforms. This means that we
may want to recommend additional restrictions on file names than just "well
formed unicode", such as:
* Don't have files that differ only by case (broken on case-insensitive
filesystems)
* Don't have files that differ only by normalization form (broken on at
least OSX)
* Stick to a small set of characters as word separators (maybe any of "
.-_", definitely not ':')
* Avoid "poisoned" pathnames like PRN and CON

And perhaps we should also make recommendations that are likely to increase
sanity, such as:
* Don't use characters that are squashed by the NFKC/NFKD transformation
(eg the Angstrom character)
* Don't have control characters in file names
* Don't mix scripts within a single path component or module identifier
* Don't start source file names with a dot
* Use one of the "blessed" file extensions for your source code (we can
have a big tent of blessed extensions, but naming a C++ source file haha.py
is just dumb)

To be clear, I'm not suggesting we go as far as the "pitchfork" proposal in
dictating a project layout. More like discouraging obviously bad things
that would get you yelled at in code review in basically all non-troll
projects.

Received on 2019-03-07 17:18:01