Date: Fri, 06 Sep 2019 15:23:12 -0700
On Friday, 6 September 2019 15:01:39 PDT Tony V E wrote:
> > You're describing case (a), which again implies resolving the problem by
> > declaring the problem cases to be out of scope.
>
> Well, I was imagining that the IDE kept or converted it in whatever format
> it wanted, but it read it in some native format, and has enough info about
> that native format to convert to UTF8. *When*it actually does the
> conversion (ie when reading, or later when writing the SG15 file) doesn't
> matter (I think).
That's two philosophies of what a file name is, which matches the two options
of my OP:
1) file names are text, so I'll store them in my Unicode-capable class
2) file names are binary, so I'll store them in my byte array
The IDEs and text editors divide themselves into those two categories. You've
assumed that only case 1 existed.
The failure modes differ too. In case 2, the IDEs will fail to display the
file name in graphical environments, since all the text shaping frameworks
consume Unicode input. But in this case, the IDE can display a placeholder
that indicates that the file name can't be shown, but entries in the program
memory still exist.
In case 1, you can't even represent said file. The failure happened when
listing the directory or reading from the socket, pipe or file that contained
the encoded form.
> > Why would you save it in UTF-8, knowing that the other tool that is going
> > to
> > read could be under a different assumption of what codec to use?
> >
> > Why not instead save the same bag of bits that you received from the OS,
> > which
> > you know the OS can use to refer back to the same file? The environment
> > has
> > not changed during the run of the current application, so it can perform
> > back
> > and forth translations from the bag of bits to the internal
> > representation,
> > losslessly.
>
> How do I know the environment hasn't changed when the other program (the
> reading one) runs? The SG15 was written by one program, then _later_ read
> by another.
That's not what I meant. I meant that the environment hasn't changed within
the same run of the process (at least, usually). I meant that if the
conversion from "bag of bits" to Unicode text worked once, I can convert back
and forth between them without loss.
> Are these two programs even on the same OS, or do they just have access to
> the same files?
This is a case of declaring that there is no problem: we excluded networking
from the scope. We probably exclude removable storage media too. If nothing
else, the mount points or drive letters may change.
> > No. This is the failure mode: if the file name was stored in UTF-8 and I
> > don't
> > know what the source used to decode the bag of bits to Unicode, I can't be
> > sure to reproduce the same bag of bits.
>
> If I have the filename in unicode, and the original filename was
> unicode-able, do I need the same bag of bits, or does every OS have an API
> for "find this file, here's the unicode name".
You need the same bag of bits. There's no OS that has "find this file by the
Unicode name" (excepting the case where the bag of bits and the Unicode name
are one and the same, of course).
> > At which step(s) can things go wrong?
> >
> > All of them, starting from the delineation of the problem space.
>
> Yes, I'm wondering if we can make the problem space smaller, since
> developers and tools have lots of control over the filenames they use.
Yes, we can. That's Option 1: there is almost[*] no problem if you set your
system up correctly so any failures are filesystem corruption and/or incorrect
environment set up.
Qt has been doing that for 20 years, since Qt 2.0 introduced the Unicode-
capable QString.
[*] The only remaining issue is the perfectly valid case of setting LC_ALL=C
in the environment for reading other tools' output. I would recommend just
ignoring that.
> > You're describing case (a), which again implies resolving the problem by
> > declaring the problem cases to be out of scope.
>
> Well, I was imagining that the IDE kept or converted it in whatever format
> it wanted, but it read it in some native format, and has enough info about
> that native format to convert to UTF8. *When*it actually does the
> conversion (ie when reading, or later when writing the SG15 file) doesn't
> matter (I think).
That's two philosophies of what a file name is, which matches the two options
of my OP:
1) file names are text, so I'll store them in my Unicode-capable class
2) file names are binary, so I'll store them in my byte array
The IDEs and text editors divide themselves into those two categories. You've
assumed that only case 1 existed.
The failure modes differ too. In case 2, the IDEs will fail to display the
file name in graphical environments, since all the text shaping frameworks
consume Unicode input. But in this case, the IDE can display a placeholder
that indicates that the file name can't be shown, but entries in the program
memory still exist.
In case 1, you can't even represent said file. The failure happened when
listing the directory or reading from the socket, pipe or file that contained
the encoded form.
> > Why would you save it in UTF-8, knowing that the other tool that is going
> > to
> > read could be under a different assumption of what codec to use?
> >
> > Why not instead save the same bag of bits that you received from the OS,
> > which
> > you know the OS can use to refer back to the same file? The environment
> > has
> > not changed during the run of the current application, so it can perform
> > back
> > and forth translations from the bag of bits to the internal
> > representation,
> > losslessly.
>
> How do I know the environment hasn't changed when the other program (the
> reading one) runs? The SG15 was written by one program, then _later_ read
> by another.
That's not what I meant. I meant that the environment hasn't changed within
the same run of the process (at least, usually). I meant that if the
conversion from "bag of bits" to Unicode text worked once, I can convert back
and forth between them without loss.
> Are these two programs even on the same OS, or do they just have access to
> the same files?
This is a case of declaring that there is no problem: we excluded networking
from the scope. We probably exclude removable storage media too. If nothing
else, the mount points or drive letters may change.
> > No. This is the failure mode: if the file name was stored in UTF-8 and I
> > don't
> > know what the source used to decode the bag of bits to Unicode, I can't be
> > sure to reproduce the same bag of bits.
>
> If I have the filename in unicode, and the original filename was
> unicode-able, do I need the same bag of bits, or does every OS have an API
> for "find this file, here's the unicode name".
You need the same bag of bits. There's no OS that has "find this file by the
Unicode name" (excepting the case where the bag of bits and the Unicode name
are one and the same, of course).
> > At which step(s) can things go wrong?
> >
> > All of them, starting from the delineation of the problem space.
>
> Yes, I'm wondering if we can make the problem space smaller, since
> developers and tools have lots of control over the filenames they use.
Yes, we can. That's Option 1: there is almost[*] no problem if you set your
system up correctly so any failures are filesystem corruption and/or incorrect
environment set up.
Qt has been doing that for 20 years, since Qt 2.0 introduced the Unicode-
capable QString.
[*] The only remaining issue is the perfectly valid case of setting LC_ALL=C
in the environment for reading other tools' output. I would recommend just
ignoring that.
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Software Architect - Intel System Software Products
Received on 2019-09-07 00:23:14