C++ Logo

sg15

Advanced search

Re: [Tooling] [isocpp-modules] Dependency information for module-aware build tools

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 7 Mar 2019 19:46:26 +0100
On Thu, 7 Mar 2019 at 16:59 Tom Honermann <tom_at_[hidden]> wrote:

> On 3/7/19 10:30 AM, Ben Boeckel wrote:
> > On Thu, Mar 07, 2019 at 00:15:34 -0500, Tom Honermann wrote:
> >> I don't know of any that use 32-bit code units for file names.
> >>
> >> I find myself thinking (as I so often do these days much to the surprise
> >> of my past self), how does EBCDIC and z/OS fit in here? If we stick to
> >> JSON and require the dependency file to be UTF-8 encoded, would all file
> >> names in these files be raw8 encoded and effectively unreadable (by
> >> humans) on z/OS? Perhaps we could allow more flexibility, but doing so
> >> necessarily invites locales into the discussion (for those that are
> >> unaware, EBCDIC has code pages too). For example, we could require that
> >> the selected locale match between the producers and consumers of the
> >> file (UB if they don't) and permit use of the string representation by
> >> transcoding from the locale interpreted physical file name to UTF-8, but
> >> only if reverse-transcoding produces the same physical file name,
> >> otherwise the appropriate raw format must be used.
> > I first tried saying "treat these strings as if they were byte arrays"
> > with allowances for escaping `"` and `\`, but there was pushback on the
> > previous thread about it. This basically makes a new dialect of JSON
> > which is (usually) an error in existing implementations. It would mean
> > that tools are implementing their own JSON parsers (or even writers)…
>
> This isn't what I was suggesting. Rather, I was suggesting that
> standard UTF-8 encoded JSON be used, but that, for platforms where the
> interpretation of the filename may differ based on locale settings,
> that, if the file name can be losslessly round-tripped to UTF-8 and
> back, that the UTF-8 encoding of it (transcoded from the locale) be used
> in the JSON file as a (well-formed) UTF-8/JSON string even though that
> name wouldn't reflect the exact code units of the file name.
>
> For example, consider a file name consisting of the octets { 0x86, 0x89,
> 0x93, 0x85, 0x59 }. In EBCDIC code page 37, this denotes a file name
> "fileß", but in EBCDIC code page 273 denotes "file~". The suggestion
> then is, when generating the JSON file, if the current locale setting is
> CP37, to use the UTF-8 encoded name "fileß" as a normal JSON string.
> Tools consuming the file would then have to transcode the UTF-8 provided
> name back to the original locale to open the file.
>
> Previously, I had suggested that the locales must match for the producer
> and consumer and that it be UB otherwise (effectively leading to file
> not found errors). However, I think it would be better to store the
> encoding used to interpret the file name at generation time (if it isn't
> UTF-8) in the file to allow tools to accurately reverse the UTF-8
> encoding. The supported encodings and the spelling of their names
> would, of course, be implementation/platform defined.
>
> >
> > Note that if you'd like to have a readable filename, adding it as a
> > `_readable` key with a human-readable utf-8 transcoding to the filename
> > would be supported (see my message with the JSON schema bits from
> > yesterday).
>
> That seems reasonable to me for file names that really can't be
> represented as UTF-8, but seems like noise otherwise. In other words, I
> think we should try to minimize use of raw8, raw16, etc... where possible.
>

Didn't we realize that we can't know the encoding of a filename, and so we
cannot reliably decode it,
even less in a round trip safe way and that as such filenames can't be
anything but bags of bytes?
At least, on some platforms?

The only hack I can think of is: assume an encoding with some platform
dependent heuristic (locale, etc), round trip the filename through utf-8
and back if it's not bytewise
identical, base64 encode it and add a _readable key?


>
> Tom.
>
> >
> > --Ben
>
>
> _______________________________________________
> Modules mailing list
> Modules_at_[hidden]
> Subscription: http://lists.isocpp.org/mailman/listinfo.cgi/modules
> Link to this post: http://lists.isocpp.org/modules/2019/03/0204.php
>

Received on 2019-03-07 19:46:39