C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Thiago Macieira <thiago_at_[hidden]>
Date: Fri, 06 Sep 2019 07:44:56 -0700
On Friday, 6 September 2019 06:28:00 PDT Lyberta wrote:
> What if non-UTF8 part will be stored in the same way HTML encodings do?
> So we would have UTF-8 string as the name of encoding such as "WTF-16"
> for NTFS and array of numbers that are "abstract units of text" (code
> units for UTF, characters for US-ASCII, not sure about other encodings).

That's called URL.

These two files in my filesystem:

$ ls -1ib /tmp/*.c
5303210 /tmp/\351.c
5303209 /tmp/é.c

Are uniquely identified by these normalised IRIs:
        file:///tmp/%E9.c
        file:///tmp/é.c

According to RFC 3987, é is the same as %C3%A9.

I didn't suggsest using URIs/IRIs for two reasons: first, because I don't know
what happens to unmatched WTF-16 surrogates (probably violates the RFC).
Second, because of the tendency for developers to construct URLs by string
concatenation and forget to properly escape the two characters that MUST NOT
appear unescaped in the path component: # and ?.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products

Received on 2019-09-06 16:44:59