On Thu, 7 Mar 2019 at 16:59 Tom Honermann <tom@honermann.net> wrote:

On 3/7/19 10:30 AM, Ben Boeckel wrote:
> On Thu, Mar 07, 2019 at 00:15:34 -0500, Tom Honermann wrote:
>> I don't know of any that use 32-bit code units for file names.
>>
>> I find myself thinking (as I so often do these days much to the surprise
>> of my past self), how does EBCDIC and z/OS fit in here? If we stick to
>> JSON and require the dependency file to be UTF-8 encoded, would all file
>> names in these files be raw8 encoded and effectively unreadable (by
>> humans) on z/OS? Perhaps we could allow more flexibility, but doing so
>> necessarily invites locales into the discussion (for those that are
>> unaware, EBCDIC has code pages too). For example, we could require that
>> the selected locale match between the producers and consumers of the
>> file (UB if they don't) and permit use of the string representation by
>> transcoding from the locale interpreted physical file name to UTF-8, but
>> only if reverse-transcoding produces the same physical file name,
>> otherwise the appropriate raw format must be used.
> I first tried saying "treat these strings as if they were byte arrays"
> with allowances for escaping `"` and `\`, but there was pushback on the
> previous thread about it. This basically makes a new dialect of JSON
> which is (usually) an error in existing implementations. It would mean
> that tools are implementing their own JSON parsers (or even writers)…

This isn't what I was suggesting. Rather, I was suggesting that
standard UTF-8 encoded JSON be used, but that, for platforms where the
interpretation of the filename may differ based on locale settings,
that, if the file name can be losslessly round-tripped to UTF-8 and
back, that the UTF-8 encoding of it (transcoded from the locale) be used
in the JSON file as a (well-formed) UTF-8/JSON string even though that
name wouldn't reflect the exact code units of the file name.

For example, consider a file name consisting of the octets { 0x86, 0x89,
0x93, 0x85, 0x59 }. In EBCDIC code page 37, this denotes a file name
"fileß", but in EBCDIC code page 273 denotes "file~". The suggestion
then is, when generating the JSON file, if the current locale setting is
CP37, to use the UTF-8 encoded name "fileß" as a normal JSON string.
Tools consuming the file would then have to transcode the UTF-8 provided
name back to the original locale to open the file.

Previously, I had suggested that the locales must match for the producer
and consumer and that it be UB otherwise (effectively leading to file
not found errors). However, I think it would be better to store the
encoding used to interpret the file name at generation time (if it isn't
UTF-8) in the file to allow tools to accurately reverse the UTF-8
encoding. The supported encodings and the spelling of their names
would, of course, be implementation/platform defined.

>
> Note that if you'd like to have a readable filename, adding it as a
> `_readable` key with a human-readable utf-8 transcoding to the filename
> would be supported (see my message with the JSON schema bits from
> yesterday).

That seems reasonable to me for file names that really can't be
represented as UTF-8, but seems like noise otherwise. In other words, I
think we should try to minimize use of raw8, raw16, etc... where possible.

Didn't we realize that we can't know the encoding of a filename, and so we cannot reliably decode it,

even less in a round trip safe way and that as such filenames can't be anything but bags of bytes?

At least, on some platforms?