C++ Logo

sg15

Advanced search

Re: [Tooling] [isocpp-modules] Dependency information for module-aware build tools

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 7 Mar 2019 11:16:15 -0500
On 3/7/19 10:59 AM, Tom Honermann wrote:
> On 3/7/19 10:30 AM, Ben Boeckel wrote:
>> On Thu, Mar 07, 2019 at 00:15:34 -0500, Tom Honermann wrote:
>>> I don't know of any that use 32-bit code units for file names.
>>>
>>> I find myself thinking (as I so often do these days much to the surprise
>>> of my past self), how does EBCDIC and z/OS fit in here? If we stick to
>>> JSON and require the dependency file to be UTF-8 encoded, would all file
>>> names in these files be raw8 encoded and effectively unreadable (by
>>> humans) on z/OS? Perhaps we could allow more flexibility, but doing so
>>> necessarily invites locales into the discussion (for those that are
>>> unaware, EBCDIC has code pages too). For example, we could require that
>>> the selected locale match between the producers and consumers of the
>>> file (UB if they don't) and permit use of the string representation by
>>> transcoding from the locale interpreted physical file name to UTF-8, but
>>> only if reverse-transcoding produces the same physical file name,
>>> otherwise the appropriate raw format must be used.
>> I first tried saying "treat these strings as if they were byte arrays"
>> with allowances for escaping `"` and `\`, but there was pushback on the
>> previous thread about it. This basically makes a new dialect of JSON
>> which is (usually) an error in existing implementations. It would mean
>> that tools are implementing their own JSON parsers (or even writers)…
> This isn't what I was suggesting. Rather, I was suggesting that
> standard UTF-8 encoded JSON be used, but that, for platforms where the
> interpretation of the filename may differ based on locale settings,
> that, if the file name can be losslessly round-tripped to UTF-8 and
> back, that the UTF-8 encoding of it (transcoded from the locale) be used
> in the JSON file as a (well-formed) UTF-8/JSON string even though that
> name wouldn't reflect the exact code units of the file name.
>
> For example, consider a file name consisting of the octets { 0x86, 0x89,
> 0x93, 0x85, 0x59 }. In EBCDIC code page 37, this denotes a file name
> "fileß", but in EBCDIC code page 273 denotes "file~". The suggestion
> then is, when generating the JSON file, if the current locale setting is
> CP37, to use the UTF-8 encoded name "fileß" as a normal JSON string.
> Tools consuming the file would then have to transcode the UTF-8 provided
> name back to the original locale to open the file.
>
> Previously, I had suggested that the locales must match for the producer
> and consumer and that it be UB otherwise (effectively leading to file
> not found errors). However, I think it would be better to store the
> encoding used to interpret the file name at generation time (if it isn't
> UTF-8) in the file to allow tools to accurately reverse the UTF-8
> encoding. The supported encodings and the spelling of their names
> would, of course, be implementation/platform defined.

Strawman update to the JSON schema to support this:

   {
     ...
     "definitions": {
+ "filename-encoding": {
+ "$id": "#filename-encoding",
+ "type": [
+ "string",
+ ],
+ "description": "The name of the character encoding used to
interpret filenames",
+ },
       ...
+ "bikeshed-filename-encoding": {
+ "$id": "#bikeshed-filename-encoding",
+ "title": "filename encoding",
+ "$ref": "#/definitions/filename-encoding",
+ },
     }
   }

Tom.

>
>> Note that if you'd like to have a readable filename, adding it as a
>> `_readable` key with a human-readable utf-8 transcoding to the filename
>> would be supported (see my message with the JSON schema bits from
>> yesterday).
> That seems reasonable to me for file names that really can't be
> represented as UTF-8, but seems like noise otherwise. In other words, I
> think we should try to minimize use of raw8, raw16, etc... where possible.
>
> Tom.
>
>> --Ben
>
> _______________________________________________
> Tooling mailing list
> Tooling_at_[hidden]
> http://www.open-std.org/mailman/listinfo/tooling

Received on 2019-03-07 17:16:18