C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Lyberta <lyberta_at_[hidden]>
Date: Fri, 06 Sep 2019 15:44:00 +0000
Niall Douglas:
> (I might add that I don't think WTF valid in RFC conforming JSON.
> Strings are in UTF, or they are not JSON strings and need to be byte
> arrays. The only RFC compliant way of storing potentially invalid UTF
> strings is as a byte array, to my best knowledge).

I said array of numbers, numbers can be 8bit for ASCII and UTF-8, 16 bit
for UTF-16, WTF-16 and others and 32-bit for UTF-32.

{
 "encoding":"WTF-16",
 "units":[ 84, 101, 115, 116 ]
}

This is how you would encode path "Test" on NTFS, as example.

If such JSON would be opened on z/OS with EBCDIC, then the WTF-16 will
be conceptually converted to this:

{
 "encoding":"EBCDIC",
 "units":[ 228, 133, 162, 163 ]
}

> That's called URL.
>
> These two files in my filesystem:
>
> $ ls -1ib /tmp/*.c
> 5303210 /tmp/\351.c
> 5303209 /tmp/é.c
>
> Are uniquely identified by these normalised IRIs:
> file:///tmp/%E9.c
> file:///tmp/é.c
>
> According to RFC 3987, é is the same as %C3%A9.

RFC 3987 uses UTF-8 for numeric values. That means it is as useful as
UTF-8. My proposal supports EBCDIC and other non-ASCII compatible
encodings by transferring numbers + metadata.

That said, I think going UTF-8 only JSON strings for paths should cover
99% of cases and since the format in question is versioned, we can
always add non-UTF-8 paths later.


Received on 2019-09-06 17:44:35