C++ Logo


Advanced search

Re: [Tooling] [isocpp-modules] Dependency information for module-aware build tools

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 8 Mar 2019 00:03:23 -0500
On 3/7/19 1:46 PM, Corentin wrote:
> On Thu, 7 Mar 2019 at 16:59 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
> On 3/7/19 10:30 AM, Ben Boeckel wrote:
> > On Thu, Mar 07, 2019 at 00:15:34 -0500, Tom Honermann wrote:
> >> I don't know of any that use 32-bit code units for file names.
> >>
> >> I find myself thinking (as I so often do these days much to the
> surprise
> >> of my past self), how does EBCDIC and z/OS fit in here? If we
> stick to
> >> JSON and require the dependency file to be UTF-8 encoded, would
> all file
> >> names in these files be raw8 encoded and effectively unreadable (by
> >> humans) on z/OS? Perhaps we could allow more flexibility, but
> doing so
> >> necessarily invites locales into the discussion (for those that are
> >> unaware, EBCDIC has code pages too). For example, we could
> require that
> >> the selected locale match between the producers and consumers
> of the
> >> file (UB if they don't) and permit use of the string
> representation by
> >> transcoding from the locale interpreted physical file name to
> UTF-8, but
> >> only if reverse-transcoding produces the same physical file name,
> >> otherwise the appropriate raw format must be used.
> > I first tried saying "treat these strings as if they were byte
> arrays"
> > with allowances for escaping `"` and `\`, but there was pushback
> on the
> > previous thread about it. This basically makes a new dialect of JSON
> > which is (usually) an error in existing implementations. It
> would mean
> > that tools are implementing their own JSON parsers (or even
> writers)…
> This isn't what I was suggesting. Rather, I was suggesting that
> standard UTF-8 encoded JSON be used, but that, for platforms where
> the
> interpretation of the filename may differ based on locale settings,
> that, if the file name can be losslessly round-tripped to UTF-8 and
> back, that the UTF-8 encoding of it (transcoded from the locale)
> be used
> in the JSON file as a (well-formed) UTF-8/JSON string even though
> that
> name wouldn't reflect the exact code units of the file name.
> For example, consider a file name consisting of the octets { 0x86,
> 0x89,
> 0x93, 0x85, 0x59 }. In EBCDIC code page 37, this denotes a file name
> "fileß", but in EBCDIC code page 273 denotes "file~". The suggestion
> then is, when generating the JSON file, if the current locale
> setting is
> CP37, to use the UTF-8 encoded name "fileß" as a normal JSON string.
> Tools consuming the file would then have to transcode the UTF-8
> provided
> name back to the original locale to open the file.
> Previously, I had suggested that the locales must match for the
> producer
> and consumer and that it be UB otherwise (effectively leading to file
> not found errors). However, I think it would be better to store the
> encoding used to interpret the file name at generation time (if it
> isn't
> UTF-8) in the file to allow tools to accurately reverse the UTF-8
> encoding. The supported encodings and the spelling of their names
> would, of course, be implementation/platform defined.
> >
> > Note that if you'd like to have a readable filename, adding it as a
> > `_readable` key with a human-readable utf-8 transcoding to the
> filename
> > would be supported (see my message with the JSON schema bits from
> > yesterday).
> That seems reasonable to me for file names that really can't be
> represented as UTF-8, but seems like noise otherwise. In other
> words, I
> think we should try to minimize use of raw8, raw16, etc... where
> possible.
> Didn't we realize that we can't know the encoding of a filename, and
> so we cannot reliably decode it,
> even less in a round trip safe way and that as such filenames can't be
> anything but bags of bytes?
> At least, on some platforms?
Yes. However, we do routinely present file names to humans and that
requires interpreting them according to some encoding. The challenge of
course is, choosing an encoding, and how to present code unit sequences
that are not valid in that encoding.
> The only hack I can think of is: assume an encoding with some platform
> dependent heuristic (locale, etc), round trip the filename through
> utf-8 and back if it's not bytewise
> identical, base64 encode it and add a _readable key?

Exactly (whether base64, raw8, or raw16, as the fall back isn't
significant). This is the approach I was trying to describe; you did a
better job of doing so :)


> Tom.
> >
> > --Ben
> _______________________________________________
> Modules mailing list
> Modules_at_[hidden] <mailto:Modules_at_[hidden]>
> Subscription: http://lists.isocpp.org/mailman/listinfo.cgi/modules
> Link to this post: http://lists.isocpp.org/modules/2019/03/0204.php
> _______________________________________________
> Modules mailing list
> Modules_at_[hidden]
> Subscription: http://lists.isocpp.org/mailman/listinfo.cgi/modules
> Link to this post: http://lists.isocpp.org/modules/2019/03/0210.php

Received on 2019-03-08 06:03:27