On 3/7/19 1:46 PM, Corentin wrote:


On Thu, 7 Mar 2019 at 16:59 Tom Honermann <tom@honermann.net> wrote:
On 3/7/19 10:30 AM, Ben Boeckel wrote:
> On Thu, Mar 07, 2019 at 00:15:34 -0500, Tom Honermann wrote:
>> I don't know of any that use 32-bit code units for file names.
>>
>> I find myself thinking (as I so often do these days much to the surprise
>> of my past self), how does EBCDIC and z/OS fit in here? If we stick to
>> JSON and require the dependency file to be UTF-8 encoded, would all file
>> names in these files be raw8 encoded and effectively unreadable (by
>> humans) on z/OS?  Perhaps we could allow more flexibility, but doing so
>> necessarily invites locales into the discussion (for those that are
>> unaware, EBCDIC has code pages too).  For example, we could require that
>> the selected locale match between the producers and consumers of the
>> file (UB if they don't) and permit use of the string representation by
>> transcoding from the locale interpreted physical file name to UTF-8, but
>> only if reverse-transcoding produces the same physical file name,
>> otherwise the appropriate raw format must be used.
> I first tried saying "treat these strings as if they were byte arrays"
> with allowances for escaping `"` and `\`, but there was pushback on the
> previous thread about it. This basically makes a new dialect of JSON
> which is (usually) an error in existing implementations. It would mean
> that tools are implementing their own JSON parsers (or even writers)…

This isn't what I was suggesting.  Rather, I was suggesting that
standard UTF-8 encoded JSON be used, but that, for platforms where the
interpretation of the filename may differ based on locale settings,
that, if the file name can be losslessly round-tripped to UTF-8 and
back, that the UTF-8 encoding of it (transcoded from the locale) be used
in the JSON file as a (well-formed) UTF-8/JSON string even though that
name wouldn't reflect the exact code units of the file name.

For example, consider a file name consisting of the octets { 0x86, 0x89,
0x93, 0x85, 0x59 }.  In EBCDIC code page 37, this denotes a file name
"fileß", but in EBCDIC code page 273 denotes "file~".  The suggestion
then is, when generating the JSON file, if the current locale setting is
CP37, to use the UTF-8 encoded name "fileß" as a normal JSON string. 
Tools consuming the file would then have to transcode the UTF-8 provided
name back to the original locale to open the file.

Previously, I had suggested that the locales must match for the producer
and consumer and that it be UB otherwise (effectively leading to file
not found errors).  However, I think it would be better to store the
encoding used to interpret the file name at generation time (if it isn't
UTF-8) in the file to allow tools to accurately reverse the UTF-8
encoding.  The supported encodings and the spelling of their names
would, of course, be implementation/platform defined.

>
> Note that if you'd like to have a readable filename, adding it as a
> `_readable` key with a human-readable utf-8 transcoding to the filename
> would be supported (see my message with the JSON schema bits from
> yesterday).

That seems reasonable to me for file names that really can't be
represented as UTF-8, but seems like noise otherwise.  In other words, I
think we should try to minimize use of raw8, raw16, etc... where possible.

Didn't we realize that we can't know the encoding of a filename, and so we cannot reliably decode it,
even less in a round trip safe way and that as such filenames can't be anything but bags of bytes?
At least, on some platforms?
Yes.  However, we do routinely present file names to humans and that requires interpreting them according to some encoding.  The challenge of course is, choosing an encoding, and how to present code unit sequences that are not valid in that encoding.

The only hack I can think of is: assume an encoding with some platform dependent heuristic (locale, etc), round trip the filename through utf-8 and back if it's not bytewise
identical, base64 encode it and add a  _readable key?

Exactly (whether base64, raw8, or raw16, as the fall back isn't significant).  This is the approach I was trying to describe; you did a better job of doing so :)

Tom.

 

Tom.

>
> --Ben


_______________________________________________
Modules mailing list
Modules@lists.isocpp.org
Subscription: http://lists.isocpp.org/mailman/listinfo.cgi/modules
Link to this post: http://lists.isocpp.org/modules/2019/03/0204.php

_______________________________________________
Modules mailing list
Modules@lists.isocpp.org
Subscription: http://lists.isocpp.org/mailman/listinfo.cgi/modules
Link to this post: http://lists.isocpp.org/modules/2019/03/0210.php