sg16: Re: [SG16-Unicode] It???s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

From: Thiago Macieira <thiago_at_[hidden]>
Date: Tue, 30 Apr 2019 15:11:57 -0700

On Tuesday, 30 April 2019 08:53:35 PDT Tom Honermann wrote:
> > This means filenames on VFAT and NTFS *do* have an encoding. You cannot
> > use
> > arbitrary binary file names since those wouldn't convert to UTF-16 and
> > couldn't be saved.
>
> This is not quite correct. Windows, at least, does permit creating
> files with names that are invalid UTF-16 as mentioned above. This
> allows arbitrary binary file names, just with 16-bit code units.

Indeed, but we were arguing about the Unix API, especially that in the Linux
implementation, where you have no access to 16-bit API. So you simply can't
safe a file called "\xff" on a VFAT filesystem if it was mounted with the
default (iocharset=utf-8).

> > Quite frankly, you shouldn't choose any iocharset=
> > different from UTF-8, since there could be file names on disk that
> > wouldn't
> > convert and couldn't be represented.
>
> Arguably, WTF-8 [1] is a better choice as it can convert and represent
> all VFAT and NTFS file names (though I wouldn't mind if Microsoft were
> to start requiring well-formed UTF-16 file names).

And it might be like that, so the 8-bit API presented to the VFS layer and
userspace can represent all filenames found on disk, so long as you choose
"iocharset=utf-8". Choosing anything else may mean some files do not get
listed, since they can't be represented in the first place.

Conclusion: you really need UTF-8 these days.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products

Received on 2019-05-01 00:12:01