Date: Fri, 8 Mar 2019 11:52:30 -0500
On 3/8/19 10:59 AM, Ben Boeckel via Modules wrote:
> On Wed, Mar 06, 2019 at 12:13:39 -0500, Ben Boeckel wrote:
>> On Mon, Mar 04, 2019 at 17:57:53 -0500, Ben Boeckel wrote:
>>> Defined formats (I'm fine with bikeshedding these names once the overall
>>> format has been hammered out):
>>>
>>> - "raw8": interpret `data` as an array of uint8_t bytes to be passed
>>> to platform-specific filesystem APIs as an 8-bit encoding
>>> - "raw16": interpret `data` as an array of uint16_t bytes to be passed
>>> to platform-specific filesystem APIs as a 16-bit encoding
>>>
>>> This basically means "check if it is UTF-8, if it is, escape `\` and `"`
>>> and output that, otherwise indicate the byte size of the data and write
>>> it as an integer array".
> Idea that came up here to improve support almost-UTF-8 and non-UTF-8
> filenames:
>
> - Filenames may contain URL escape sequences (cf RFC 1738) for
> non-UTF-8 bytes. This also means that `%` must be url-encoded.
> - For systems for which UTF-8 is not (generally) a valid filepath, if
> the path can be converted unabiguously and losslessly *from* UTF-8
> to the native API requirements, converting that path to UTF-8 is
> valid (consuming tools would need to know to transcode from UTF-8
> anyways). Invalid characters for this case would force a fallback to
> `raw8` or `raw16` depending on the native API requirements.
> * This helps Windows where there is a useful transformation between
> the system API and UTF-8 except for a handful of cases.
> * Also likely helps macOS where the wrong collation is present in
> the source code, but can be restored for the native API.
> * Additionally helps EBCDIC since most filenames there should be
> able to roundtrip.
> - Specify `readable` as an optional string property in the spec on
> filepath for which consumers MUST NOT (RFC 2119) infer semantic
> meaning (treat it like a comment).
>
> Does this ease concerns about non-UTF-8 paths?
Yes, I think this sounds good. I think we should also specify the
transcoding requirement (though probably not the set of supported
encodings) on a per-platform basis. This is to ensure that different
tools on the same platform implement the same strategy.
Corentin observed last night that Unicode normalization can interfere
with filename round tripping. E.g., if the filename is read as UTF-8,
stored in the JSON file, and then the JSON file is normalized (perhaps
when read), then filenames may no longer match the on disk name. I
think we need to address this somehow, but I don't have any specific
suggestions other than "don't do that".
Tom.
>
> --Ben
> _______________________________________________
> Modules mailing list
> Modules_at_[hidden]
> Subscription: http://lists.isocpp.org/mailman/listinfo.cgi/modules
> Link to this post: http://lists.isocpp.org/modules/2019/03/0227.php
> On Wed, Mar 06, 2019 at 12:13:39 -0500, Ben Boeckel wrote:
>> On Mon, Mar 04, 2019 at 17:57:53 -0500, Ben Boeckel wrote:
>>> Defined formats (I'm fine with bikeshedding these names once the overall
>>> format has been hammered out):
>>>
>>> - "raw8": interpret `data` as an array of uint8_t bytes to be passed
>>> to platform-specific filesystem APIs as an 8-bit encoding
>>> - "raw16": interpret `data` as an array of uint16_t bytes to be passed
>>> to platform-specific filesystem APIs as a 16-bit encoding
>>>
>>> This basically means "check if it is UTF-8, if it is, escape `\` and `"`
>>> and output that, otherwise indicate the byte size of the data and write
>>> it as an integer array".
> Idea that came up here to improve support almost-UTF-8 and non-UTF-8
> filenames:
>
> - Filenames may contain URL escape sequences (cf RFC 1738) for
> non-UTF-8 bytes. This also means that `%` must be url-encoded.
> - For systems for which UTF-8 is not (generally) a valid filepath, if
> the path can be converted unabiguously and losslessly *from* UTF-8
> to the native API requirements, converting that path to UTF-8 is
> valid (consuming tools would need to know to transcode from UTF-8
> anyways). Invalid characters for this case would force a fallback to
> `raw8` or `raw16` depending on the native API requirements.
> * This helps Windows where there is a useful transformation between
> the system API and UTF-8 except for a handful of cases.
> * Also likely helps macOS where the wrong collation is present in
> the source code, but can be restored for the native API.
> * Additionally helps EBCDIC since most filenames there should be
> able to roundtrip.
> - Specify `readable` as an optional string property in the spec on
> filepath for which consumers MUST NOT (RFC 2119) infer semantic
> meaning (treat it like a comment).
>
> Does this ease concerns about non-UTF-8 paths?
Yes, I think this sounds good. I think we should also specify the
transcoding requirement (though probably not the set of supported
encodings) on a per-platform basis. This is to ensure that different
tools on the same platform implement the same strategy.
Corentin observed last night that Unicode normalization can interfere
with filename round tripping. E.g., if the filename is read as UTF-8,
stored in the JSON file, and then the JSON file is normalized (perhaps
when read), then filenames may no longer match the on disk name. I
think we need to address this somehow, but I don't have any specific
suggestions other than "don't do that".
Tom.
>
> --Ben
> _______________________________________________
> Modules mailing list
> Modules_at_[hidden]
> Subscription: http://lists.isocpp.org/mailman/listinfo.cgi/modules
> Link to this post: http://lists.isocpp.org/modules/2019/03/0227.php
Received on 2019-03-08 17:52:33