C++ Logo


Advanced search

Re: [Tooling] [isocpp-modules] Filename requirements for the SG15 TR

From: Corentin <corentin.jabot_at_[hidden]>
Date: Fri, 8 Mar 2019 07:35:37 +0100
On Fri, 8 Mar 2019 at 06:19 Tom Honermann <tom_at_[hidden]> wrote:

> On 3/7/19 11:17 AM, Mathias Stearn wrote:
> (Forking thread)
> was: Dependency information for module-aware build tools
> On Thu, Mar 7, 2019 at 12:15 AM Tom Honermann <tom_at_[hidden]> wrote:
>> I find myself thinking (as I so often do these days much to the surprise
>> of my past self), how does EBCDIC and z/OS fit in here? If we stick to
>> JSON and require the dependency file to be UTF-8 encoded, would all file
>> names in these files be raw8 encoded and effectively unreadable (by humans)
>> on z/OS? Perhaps we could allow more flexibility, but doing so necessarily
>> invites locales into the discussion (for those that are unaware, EBCDIC has
>> code pages too). For example, we could require that the selected locale
>> match between the producers and consumers of the file (UB if they don't)
>> and permit use of the string representation by transcoding from the locale
>> interpreted physical file name to UTF-8, but only if reverse-transcoding
>> produces the same physical file name, otherwise the appropriate raw format
>> must be used.
> I thought one of the reasons we are going the TR route rather than TS or
> IS is to allow recommending 99% solutions that provide the best experience
> for the vast majority of users while not necessarily being applicable to
> everyone. Platforms and codebases where the TR recommendations don't make
> sense are free to alter them for their platform, or just come up with
> completely different solutions to the problem. To me, this also implies
> that we are allowed to say that this TR doesn't support files with invalid
> unicode names, however that is best expressed on your platform. On windows,
> that means that the path must meet the requirements of UTF-16, not just
> UCS-2. On utf8-native platforms that have "bag-o-bytes" file names, it
> means that we don't support files with invalid utf8 in their names. On
> non-unicode platforms, that means either transcoding to/from utf8 on the
> way in and out of the json format, or coming up a different format,
> accepting that it will be specific to your platform.
> I think platform specific differences are acceptable, but we should strive
> for general solutions.
> I think the committee currently has a UTF-8 bias that doesn't necessarily
> reflect the global C++ community. We don't have much representation from
> Japan or China where, as I understand it, Shift-JIS and GB18030 still have
> significant usage. We also have few, if any, z/OS users in the committee
> outside of IBM representatives. UTF-8 dominates the web, no one questions
> that. But within the C++ ecosystem, I don't think UTF-8 dominates to a
> similar degree, at least not outside of the US and Europe. I wish I had
> data to back that up.
> I also thought one of our goals was to describe a subset of what is
> technically supported by the IS, that if you stay within these bounds, you
> will have the least trouble on a majority of platforms. This means that we
> may want to recommend additional restrictions on file names than just "well
> formed unicode", such as:
> * Don't have files that differ only by case (broken on case-insensitive
> filesystems)
> * Don't have files that differ only by normalization form (broken on at
> least OSX)
> * Stick to a small set of characters as word separators (maybe any of "
> .-_", definitely not ':')
> * Avoid "poisoned" pathnames like PRN and CON
> I think these are good guidelines and agree with recommending them.
> And perhaps we should also make recommendations that are likely to
> increase sanity, such as:
> * Don't use characters that are squashed by the NFKC/NFKD transformation
> (eg the Angstrom character)
> * Don't have control characters in file names
> * Don't mix scripts within a single path component or module identifier
> * Don't start source file names with a dot
> * Use one of the "blessed" file extensions for your source code (we can
> have a big tent of blessed extensions, but naming a C++ source file haha.py
> is just dumb)
> Also good guidelines in my opinion.
I think we could even recommend plain ASCII (or something that can be
mapped to) in file names - even a subset of ASCII.
It doesn't remove anything from users IMO, avoid a lot of issues and is
portable so that the code can actually be shared across platforms.

But there may be various levels of specifications:

DON'T have control characters in file names
PREFER only using characters in the ASCII character set

> To be clear, I'm not suggesting we go as far as the "pitchfork" proposal
> in dictating a project layout. More like discouraging obviously bad things
> that would get you yelled at in code review in basically all non-troll
> projects.
> +1.
> Tom.
> _______________________________________________
> Modules mailing list
> Modules_at_[hidden]
> Subscription: http://lists.isocpp.org/mailman/listinfo.cgi/modules
> Link to this post: http://lists.isocpp.org/modules/2019/03/0216.php

Received on 2019-03-08 07:35:51