C++ Logo

sg16

Advanced search

Re: [SG16] A UTF-8 environment specification; an alternative to assuming UTF-8 based on choice of literal encoding

From: Thiago Macieira <thiago_at_[hidden]>
Date: Thu, 29 Jul 2021 16:17:58 -0700
On Thursday, 29 July 2021 13:07:04 PDT Charlie Barto wrote:
> Upon reflection I don't think WTF-8 is right for unix, I think for unix the
> (possibly implementation defined) behavior should be "an array of zero
> terminated byte strings, that don't contain 0x0". Not all byte strings are
> valid WTF-8, if you're on unix and need to transcode to UTF-16 with round
> tripping I think you need something like PEP-383, instead (notably the
> system need not know about your pep-383 things).

Thanks for pointing out the existence of PEP-383 and from there the UTF-8b
possibility. It's an interesting solution.

Qt 3 did have a similar solution for a time, though it used a different
character range. It was done when Linux was transitioning to UTF-8, back in
2004. I removed it from Qt 4 a while ago, because it caused other problems
elsewhere.

Even in Python it's not entirely perfect. Unlike the old Qt solution, they
appear to have applied it only to file names (which includes the command-
line):

$ cat test.py
import sys, os
try:
  f = os.open(sys.argv[1], os.O_RDONLY)
except:
  print("Could not open file:", sys.argv[1])

$ python3 test.py $'\341'
Could not open file: Traceback (most recent call last):
  File "test.py", line 3, in <module>
    f = os.open(sys.argv[1], os.O_RDONLY)
FileNotFoundError: [Errno 2] No such file or directory: '\udce1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 5, in <module>
    print("Could not open file:", sys.argv[1])
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce1' in position
0: surrogates not allowed

So far, the conclusion I come to is that applications can choose one of three
evils:
1) keep file names in 8-bit, so improperly-encoded UTF-8 can round-trip back
  on Unix, but will have trouble with improper UTF-16 on Windows
2) keep file names in 16-bit, so improperly-encoded UTF-16 can round-trip back
  on Windows, but will have trouble with improper UTF-8 on Unix
3) keep file names in 32-bit and have problems with both

In any of those scenarios, there's data loss the moment the file name is
presented to the user in an editing widget and the user edits.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel DPG Cloud Engineering

Received on 2021-07-29 18:18:02