Date: Fri, 6 Sep 2019 15:37:21 +0100
On 06/09/2019 14:28, Lyberta wrote:
>> My reading of their paper was that they want to encode non-UTF8
>> sequences into UTF8 paths in a JSON file. I don't think they should take
>> that path, because it loses too much information.
>
> What if non-UTF8 part will be stored in the same way HTML encodings do?
> So we would have UTF-8 string as the name of encoding such as "WTF-16"
> for NTFS and array of numbers that are "abstract units of text" (code
> units for UTF, characters for US-ASCII, not sure about other encodings).
This does not handle the example situation I gave where ANSI and UNICODE
Win32 programs both work on the same JSON file. Programs are permitted
by the standard to experience changing native filesystem encodings every
time they are executed, including native filesystem encodings with no
valid representation in UTF-8.
(I might add that Windows ANSI APIs actually have two native filesystem
native encodings available, one can switch between either for the
current thread at runtime)
The JSON file must, if it wishes to be as impervious as possible to such
issues, store both the native filesystem encoding as it itself
experienced at the time, and its best attempt at that time at converting
that native filesystem encoding to a common encoding e.g. UTF8.
This won't solve all corner case interoperation issues. But it'll handle
lots more than relying on UTF8 exclusively.
(I might add that I don't think WTF valid in RFC conforming JSON.
Strings are in UTF, or they are not JSON strings and need to be byte
arrays. The only RFC compliant way of storing potentially invalid UTF
strings is as a byte array, to my best knowledge).
Niall
>> My reading of their paper was that they want to encode non-UTF8
>> sequences into UTF8 paths in a JSON file. I don't think they should take
>> that path, because it loses too much information.
>
> What if non-UTF8 part will be stored in the same way HTML encodings do?
> So we would have UTF-8 string as the name of encoding such as "WTF-16"
> for NTFS and array of numbers that are "abstract units of text" (code
> units for UTF, characters for US-ASCII, not sure about other encodings).
This does not handle the example situation I gave where ANSI and UNICODE
Win32 programs both work on the same JSON file. Programs are permitted
by the standard to experience changing native filesystem encodings every
time they are executed, including native filesystem encodings with no
valid representation in UTF-8.
(I might add that Windows ANSI APIs actually have two native filesystem
native encodings available, one can switch between either for the
current thread at runtime)
The JSON file must, if it wishes to be as impervious as possible to such
issues, store both the native filesystem encoding as it itself
experienced at the time, and its best attempt at that time at converting
that native filesystem encoding to a common encoding e.g. UTF8.
This won't solve all corner case interoperation issues. But it'll handle
lots more than relying on UTF8 exclusively.
(I might add that I don't think WTF valid in RFC conforming JSON.
Strings are in UTF, or they are not JSON strings and need to be byte
arrays. The only RFC compliant way of storing potentially invalid UTF
strings is as a byte array, to my best knowledge).
Niall
Received on 2019-09-06 16:37:25