C++ Logo

sg15

Advanced search

Re: [Tooling] [isocpp-modules] Dependency information for module-aware build tools

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 8 Mar 2019 00:00:56 -0500
On 3/7/19 11:13 AM, Ben Boeckel via Modules wrote:
> On Thu, Mar 07, 2019 at 10:59:20 -0500, Tom Honermann wrote:
>> For example, consider a file name consisting of the octets { 0x86, 0x89,
>> 0x93, 0x85, 0x59 }. In EBCDIC code page 37, this denotes a file name
>> "fileß", but in EBCDIC code page 273 denotes "file~". The suggestion
>> then is, when generating the JSON file, if the current locale setting is
>> CP37, to use the UTF-8 encoded name "fileß" as a normal JSON string.
>> Tools consuming the file would then have to transcode the UTF-8 provided
>> name back to the original locale to open the file.
> This would require build tools to do more than "just" sling strings
> around. iconv is not a light dependency… It also means that compilers
> can't do the (trivial) `is_valid_utf8` check and instead have to also do
> transcoding. And know the name of the encoding used. On Linux, you don't
> have that information at all. For example, my locale is all
> `en_US.UTF-8`, but nothing stops me from having a Shift-JIS filename
> anywhere (and I do have a few in archives of mid-2000-era software). How
> is a compiler supposed to know what the encoding of `readdir->d_name` is
> here?

A tool can't know the encoding of `readdir->d_name`. This problem
occurs with any tool that intends to display a file name, even tools
like 'ls'. For example, on Linux, in a directory with a file name "fileß":

# With default locale settings (UTF-8):
$ ls -1
fileß

# With "C" locale:
$ LANG=C ls -1
'file'$'\303\237'

# With Czech locale:
$ LANG=cs_CZ.iso88592 ls -1
'file�'$'\237'

Essentially, interpretation of a file name is always subject to locale
settings.

>
>> Previously, I had suggested that the locales must match for the producer
>> and consumer and that it be UB otherwise (effectively leading to file
>> not found errors). However, I think it would be better to store the
>> encoding used to interpret the file name at generation time (if it isn't
>> UTF-8) in the file to allow tools to accurately reverse the UTF-8
>> encoding. The supported encodings and the spelling of their names
>> would, of course, be implementation/platform defined.
> Build tools already have enough things to worry about. Transcoding and
> code pages is not something I want a /dependency file format/ to require
> them to handle.
I can appreciate not wanting additional requirements :)
>
>> On 3/7/19 10:30 AM, Ben Boeckel wrote:
>>> Note that if you'd like to have a readable filename, adding it as a
>>> `_readable` key with a human-readable utf-8 transcoding to the filename
>>> would be supported (see my message with the JSON schema bits from
>>> yesterday).
>> That seems reasonable to me for file names that really can't be
>> represented as UTF-8, but seems like noise otherwise. In other words, I
>> think we should try to minimize use of raw8, raw16, etc... where possible.
> Then we should probably look for a core format that doesn't require
> UTF-8 and intead supports byte arrays natively (effectively making it a
> binary format as far as text editors are concerned).

That was my initial inclination before the JSON approach was suggested.
I think either approach is workable.

Tom.

Received on 2019-03-08 06:01:01