C++ Logo

sg15

Advanced search

Re: [Tooling] [isocpp-modules] Dependency information for module-aware build tools

From: Ben Boeckel <ben.boeckel_at_[hidden]>
Date: Thu, 7 Mar 2019 11:13:35 -0500
On Thu, Mar 07, 2019 at 10:59:20 -0500, Tom Honermann wrote:
> For example, consider a file name consisting of the octets { 0x86, 0x89,
> 0x93, 0x85, 0x59 }. In EBCDIC code page 37, this denotes a file name
> "fileß", but in EBCDIC code page 273 denotes "file~". The suggestion
> then is, when generating the JSON file, if the current locale setting is
> CP37, to use the UTF-8 encoded name "fileß" as a normal JSON string.
> Tools consuming the file would then have to transcode the UTF-8 provided
> name back to the original locale to open the file.

This would require build tools to do more than "just" sling strings
around. iconv is not a light dependency… It also means that compilers
can't do the (trivial) `is_valid_utf8` check and instead have to also do
transcoding. And know the name of the encoding used. On Linux, you don't
have that information at all. For example, my locale is all
`en_US.UTF-8`, but nothing stops me from having a Shift-JIS filename
anywhere (and I do have a few in archives of mid-2000-era software). How
is a compiler supposed to know what the encoding of `readdir->d_name` is
here?

> Previously, I had suggested that the locales must match for the producer
> and consumer and that it be UB otherwise (effectively leading to file
> not found errors). However, I think it would be better to store the
> encoding used to interpret the file name at generation time (if it isn't
> UTF-8) in the file to allow tools to accurately reverse the UTF-8
> encoding. The supported encodings and the spelling of their names
> would, of course, be implementation/platform defined.

Build tools already have enough things to worry about. Transcoding and
code pages is not something I want a /dependency file format/ to require
them to handle.

> On 3/7/19 10:30 AM, Ben Boeckel wrote:
> > Note that if you'd like to have a readable filename, adding it as a
> > `_readable` key with a human-readable utf-8 transcoding to the filename
> > would be supported (see my message with the JSON schema bits from
> > yesterday).
>
> That seems reasonable to me for file names that really can't be
> represented as UTF-8, but seems like noise otherwise. In other words, I
> think we should try to minimize use of raw8, raw16, etc... where possible.

Then we should probably look for a core format that doesn't require
UTF-8 and intead supports byte arrays natively (effectively making it a
binary format as far as text editors are concerned).

--Ben

Received on 2019-03-07 17:13:49