Date: Fri, 8 Mar 2019 00:04:17 -0500
On 3/7/19 3:00 PM, Richard Smith via Modules wrote:
> On Thu, Mar 7, 2019 at 10:46 AM Corentin <corentin.jabot_at_[hidden]
> <mailto:corentin.jabot_at_[hidden]>> wrote:
>
> On Thu, 7 Mar 2019 at 16:59 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 3/7/19 10:30 AM, Ben Boeckel wrote:
> > On Thu, Mar 07, 2019 at 00:15:34 -0500, Tom Honermann wrote:
> >> I don't know of any that use 32-bit code units for file names.
> >>
> >> I find myself thinking (as I so often do these days much to
> the surprise
> >> of my past self), how does EBCDIC and z/OS fit in here? If
> we stick to
> >> JSON and require the dependency file to be UTF-8 encoded,
> would all file
> >> names in these files be raw8 encoded and effectively
> unreadable (by
> >> humans) on z/OS? Perhaps we could allow more flexibility,
> but doing so
> >> necessarily invites locales into the discussion (for those
> that are
> >> unaware, EBCDIC has code pages too). For example, we could
> require that
> >> the selected locale match between the producers and
> consumers of the
> >> file (UB if they don't) and permit use of the string
> representation by
> >> transcoding from the locale interpreted physical file name
> to UTF-8, but
> >> only if reverse-transcoding produces the same physical file
> name,
> >> otherwise the appropriate raw format must be used.
> > I first tried saying "treat these strings as if they were
> byte arrays"
> > with allowances for escaping `"` and `\`, but there was
> pushback on the
> > previous thread about it. This basically makes a new dialect
> of JSON
> > which is (usually) an error in existing implementations. It
> would mean
> > that tools are implementing their own JSON parsers (or even
> writers)…
>
> This isn't what I was suggesting. Rather, I was suggesting that
> standard UTF-8 encoded JSON be used, but that, for platforms
> where the
> interpretation of the filename may differ based on locale
> settings,
> that, if the file name can be losslessly round-tripped to
> UTF-8 and
> back, that the UTF-8 encoding of it (transcoded from the
> locale) be used
> in the JSON file as a (well-formed) UTF-8/JSON string even
> though that
> name wouldn't reflect the exact code units of the file name.
>
> For example, consider a file name consisting of the octets {
> 0x86, 0x89,
> 0x93, 0x85, 0x59 }. In EBCDIC code page 37, this denotes a
> file name
> "fileß", but in EBCDIC code page 273 denotes "file~". The
> suggestion
> then is, when generating the JSON file, if the current locale
> setting is
> CP37, to use the UTF-8 encoded name "fileß" as a normal JSON
> string.
> Tools consuming the file would then have to transcode the
> UTF-8 provided
> name back to the original locale to open the file.
>
> Previously, I had suggested that the locales must match for
> the producer
> and consumer and that it be UB otherwise (effectively leading
> to file
> not found errors). However, I think it would be better to
> store the
> encoding used to interpret the file name at generation time
> (if it isn't
> UTF-8) in the file to allow tools to accurately reverse the UTF-8
> encoding. The supported encodings and the spelling of their
> names
> would, of course, be implementation/platform defined.
>
> >
> > Note that if you'd like to have a readable filename, adding
> it as a
> > `_readable` key with a human-readable utf-8 transcoding to
> the filename
> > would be supported (see my message with the JSON schema bits
> from
> > yesterday).
>
> That seems reasonable to me for file names that really can't be
> represented as UTF-8, but seems like noise otherwise. In other
> words, I
> think we should try to minimize use of raw8, raw16, etc...
> where possible.
>
>
> Didn't we realize that we can't know the encoding of a filename,
> and so we cannot reliably decode it,
> even less in a round trip safe way and that as such filenames
> can't be anything but bags of bytes?
> At least, on some platforms?
>
> The only hack I can think of is: assume an encoding with some
> platform dependent heuristic (locale, etc), round trip the
> filename through utf-8 and back if it's not bytewise
> identical, base64 encode it and add a _readable key?
>
>
> As far as I'm aware (but someone please correct me if z/OS or similar
> adds another wrinkle), there are exactly three cases we need to deal with:
>
> 1) Platform paths are Unicode, encoded in UTF-8 in a specific
> normalization form. The OS normalizes, possibly case-folds, and
> rejects invalid encodings. (eg, Mac OS)
> 2) Platform paths are arbitrary sequences of 8-bit values, with some
> reserved patterns (eg, no embedded nul bytes), and no guaranteed
> intrinsic meaning or encoding. There may be a platform convention for
> encoding, but it is not enforced. (eg, Linux)
> 3) Platform paths are arbitrary sequences of 16-bit values, with some
> reserved patterns (eg, no embedded nul values, some reserved
> characters), and no guaranteed intrinsic meaning or encoding. There
> may be a platform convention for encoding, but it is not enforced.
> (eg, Windows)
>
> Case 1 is easy: paths are UTF-8, so we can represent them as Unicode
> strings.
> Case 3 is mostly easy: paths are by strong convention UTF-16, so we
> can represent them as Unicode strings when they are valid, and fall
> back to raw16 in the very rare remaining cases.
> Case 2 is trickier: while many are using UTF-8 as their convention for
> file name encoding, it is not a universally-adopted convention.
> Nonetheless, I propose we do the same as in Case 3: if the file name
> happens to be a valid UTF-8 encoding, assume that it is in fact UTF-8
> and present the path as a Unicode string. Otherwise, fall back to raw8.
>
> Does that work well enough in practice? (Remember that these files are
> primarily for communication between tools, not for humans to read.)
I think the above works well for ASCII based platforms in most cases.
Though with some surprising or unfortunate results for Shift-JIS and
GB18030 users.
The point about this format being more for tools than humans is well
taken. Though I'm bringing up the topic in this thread, I'm more
concerned about it for module map file formats (which we haven't
discussed yet).
The wrinkle with z/OS is that, while it fits the case 2 model, no file
names will be (intended to be) UTF-8 encoded and every file name would
end up represented with the raw8 format. In theory, an EBCDIC encoded
file name can have code units that form a valid UTF-8 sequence (which
would result in a file name in the JSON file that doesn't look at all
like the intended file name), but since all of the non-accented
alphanumeric characters in EBCDIC have values above 0x7F, the chance of
forming a valid UTF-8 sequence is low.
We could take the approach of emitting both a display name (with
implementation dependent QOI, essentially Ben's "_readable" suggestion)
and a raw code unit sequence. But doing that well invites locales back
into the picture again, so perhaps just dealing with locales is the
better path forward anyway.
Tom.
> On Thu, Mar 7, 2019 at 10:46 AM Corentin <corentin.jabot_at_[hidden]
> <mailto:corentin.jabot_at_[hidden]>> wrote:
>
> On Thu, 7 Mar 2019 at 16:59 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 3/7/19 10:30 AM, Ben Boeckel wrote:
> > On Thu, Mar 07, 2019 at 00:15:34 -0500, Tom Honermann wrote:
> >> I don't know of any that use 32-bit code units for file names.
> >>
> >> I find myself thinking (as I so often do these days much to
> the surprise
> >> of my past self), how does EBCDIC and z/OS fit in here? If
> we stick to
> >> JSON and require the dependency file to be UTF-8 encoded,
> would all file
> >> names in these files be raw8 encoded and effectively
> unreadable (by
> >> humans) on z/OS? Perhaps we could allow more flexibility,
> but doing so
> >> necessarily invites locales into the discussion (for those
> that are
> >> unaware, EBCDIC has code pages too). For example, we could
> require that
> >> the selected locale match between the producers and
> consumers of the
> >> file (UB if they don't) and permit use of the string
> representation by
> >> transcoding from the locale interpreted physical file name
> to UTF-8, but
> >> only if reverse-transcoding produces the same physical file
> name,
> >> otherwise the appropriate raw format must be used.
> > I first tried saying "treat these strings as if they were
> byte arrays"
> > with allowances for escaping `"` and `\`, but there was
> pushback on the
> > previous thread about it. This basically makes a new dialect
> of JSON
> > which is (usually) an error in existing implementations. It
> would mean
> > that tools are implementing their own JSON parsers (or even
> writers)…
>
> This isn't what I was suggesting. Rather, I was suggesting that
> standard UTF-8 encoded JSON be used, but that, for platforms
> where the
> interpretation of the filename may differ based on locale
> settings,
> that, if the file name can be losslessly round-tripped to
> UTF-8 and
> back, that the UTF-8 encoding of it (transcoded from the
> locale) be used
> in the JSON file as a (well-formed) UTF-8/JSON string even
> though that
> name wouldn't reflect the exact code units of the file name.
>
> For example, consider a file name consisting of the octets {
> 0x86, 0x89,
> 0x93, 0x85, 0x59 }. In EBCDIC code page 37, this denotes a
> file name
> "fileß", but in EBCDIC code page 273 denotes "file~". The
> suggestion
> then is, when generating the JSON file, if the current locale
> setting is
> CP37, to use the UTF-8 encoded name "fileß" as a normal JSON
> string.
> Tools consuming the file would then have to transcode the
> UTF-8 provided
> name back to the original locale to open the file.
>
> Previously, I had suggested that the locales must match for
> the producer
> and consumer and that it be UB otherwise (effectively leading
> to file
> not found errors). However, I think it would be better to
> store the
> encoding used to interpret the file name at generation time
> (if it isn't
> UTF-8) in the file to allow tools to accurately reverse the UTF-8
> encoding. The supported encodings and the spelling of their
> names
> would, of course, be implementation/platform defined.
>
> >
> > Note that if you'd like to have a readable filename, adding
> it as a
> > `_readable` key with a human-readable utf-8 transcoding to
> the filename
> > would be supported (see my message with the JSON schema bits
> from
> > yesterday).
>
> That seems reasonable to me for file names that really can't be
> represented as UTF-8, but seems like noise otherwise. In other
> words, I
> think we should try to minimize use of raw8, raw16, etc...
> where possible.
>
>
> Didn't we realize that we can't know the encoding of a filename,
> and so we cannot reliably decode it,
> even less in a round trip safe way and that as such filenames
> can't be anything but bags of bytes?
> At least, on some platforms?
>
> The only hack I can think of is: assume an encoding with some
> platform dependent heuristic (locale, etc), round trip the
> filename through utf-8 and back if it's not bytewise
> identical, base64 encode it and add a _readable key?
>
>
> As far as I'm aware (but someone please correct me if z/OS or similar
> adds another wrinkle), there are exactly three cases we need to deal with:
>
> 1) Platform paths are Unicode, encoded in UTF-8 in a specific
> normalization form. The OS normalizes, possibly case-folds, and
> rejects invalid encodings. (eg, Mac OS)
> 2) Platform paths are arbitrary sequences of 8-bit values, with some
> reserved patterns (eg, no embedded nul bytes), and no guaranteed
> intrinsic meaning or encoding. There may be a platform convention for
> encoding, but it is not enforced. (eg, Linux)
> 3) Platform paths are arbitrary sequences of 16-bit values, with some
> reserved patterns (eg, no embedded nul values, some reserved
> characters), and no guaranteed intrinsic meaning or encoding. There
> may be a platform convention for encoding, but it is not enforced.
> (eg, Windows)
>
> Case 1 is easy: paths are UTF-8, so we can represent them as Unicode
> strings.
> Case 3 is mostly easy: paths are by strong convention UTF-16, so we
> can represent them as Unicode strings when they are valid, and fall
> back to raw16 in the very rare remaining cases.
> Case 2 is trickier: while many are using UTF-8 as their convention for
> file name encoding, it is not a universally-adopted convention.
> Nonetheless, I propose we do the same as in Case 3: if the file name
> happens to be a valid UTF-8 encoding, assume that it is in fact UTF-8
> and present the path as a Unicode string. Otherwise, fall back to raw8.
>
> Does that work well enough in practice? (Remember that these files are
> primarily for communication between tools, not for humans to read.)
I think the above works well for ASCII based platforms in most cases.
Though with some surprising or unfortunate results for Shift-JIS and
GB18030 users.
The point about this format being more for tools than humans is well
taken. Though I'm bringing up the topic in this thread, I'm more
concerned about it for module map file formats (which we haven't
discussed yet).
The wrinkle with z/OS is that, while it fits the case 2 model, no file
names will be (intended to be) UTF-8 encoded and every file name would
end up represented with the raw8 format. In theory, an EBCDIC encoded
file name can have code units that form a valid UTF-8 sequence (which
would result in a file name in the JSON file that doesn't look at all
like the intended file name), but since all of the non-accented
alphanumeric characters in EBCDIC have values above 0x7F, the chance of
forming a valid UTF-8 sequence is low.
We could take the approach of emitting both a display name (with
implementation dependent QOI, essentially Ben's "_readable" suggestion)
and a raw code unit sequence. But doing that well invites locales back
into the picture again, so perhaps just dealing with locales is the
better path forward anyway.
Tom.
Received on 2019-03-08 06:04:21