Date: Tue, 30 Apr 2019 11:53:35 -0400
On 4/30/19 11:17 AM, Thiago Macieira wrote:
> On Tuesday, 30 April 2019 07:49:50 PDT Tom Honermann wrote:
>> Can you elaborate on this? What do you mean by the "kernel assuming
>> your userspace is UTF-8"? Do you mean that the filesystem driver will
>> attempt to, by default, present file names composed of 16-bit code units
>> transcoded to UTF-8 by default? Given that file names do not have an
>> explicit encoding, this seems reasonable to me and even necessary to
>> avoid name conflicts from otherwise lossy transcoding operations.
> That's exactly what I meant. Both VFAT and NTFS store filenames in UTF-16,
Except that well-formed UTF-16 isn't enforced; lone and reversed
surrogates are permitted, presumably as a holdover from the UCS-2 days.
> so
> the kernel must translate to and from that to some 8-bit encoding chosen at
> mount time so those names can be presented to userspace. Actually, the driver
> must translate because the *kernel* VFS layer requires 8-bit filenames anyway.
Indeed.
> This means filenames on VFAT and NTFS *do* have an encoding. You cannot use
> arbitrary binary file names since those wouldn't convert to UTF-16 and
> couldn't be saved.
This is not quite correct. Windows, at least, does permit creating
files with names that are invalid UTF-16 as mentioned above. This
allows arbitrary binary file names, just with 16-bit code units.
> Quite frankly, you shouldn't choose any iocharset=
> different from UTF-8, since there could be file names on disk that wouldn't
> convert and couldn't be represented.
>
Arguably, WTF-8 [1] is a better choice as it can convert and represent
all VFAT and NTFS file names (though I wouldn't mind if Microsoft were
to start requiring well-formed UTF-16 file names).
Tom.
[1]: https://simonsapin.github.io/wtf-8/
> On Tuesday, 30 April 2019 07:49:50 PDT Tom Honermann wrote:
>> Can you elaborate on this? What do you mean by the "kernel assuming
>> your userspace is UTF-8"? Do you mean that the filesystem driver will
>> attempt to, by default, present file names composed of 16-bit code units
>> transcoded to UTF-8 by default? Given that file names do not have an
>> explicit encoding, this seems reasonable to me and even necessary to
>> avoid name conflicts from otherwise lossy transcoding operations.
> That's exactly what I meant. Both VFAT and NTFS store filenames in UTF-16,
Except that well-formed UTF-16 isn't enforced; lone and reversed
surrogates are permitted, presumably as a holdover from the UCS-2 days.
> so
> the kernel must translate to and from that to some 8-bit encoding chosen at
> mount time so those names can be presented to userspace. Actually, the driver
> must translate because the *kernel* VFS layer requires 8-bit filenames anyway.
Indeed.
> This means filenames on VFAT and NTFS *do* have an encoding. You cannot use
> arbitrary binary file names since those wouldn't convert to UTF-16 and
> couldn't be saved.
This is not quite correct. Windows, at least, does permit creating
files with names that are invalid UTF-16 as mentioned above. This
allows arbitrary binary file names, just with 16-bit code units.
> Quite frankly, you shouldn't choose any iocharset=
> different from UTF-8, since there could be file names on disk that wouldn't
> convert and couldn't be represented.
>
Arguably, WTF-8 [1] is a better choice as it can convert and represent
all VFAT and NTFS file names (though I wouldn't mind if Microsoft were
to start requiring well-formed UTF-16 file names).
Tom.
[1]: https://simonsapin.github.io/wtf-8/
Received on 2019-04-30 17:53:37