C++ Logo

sg16

Advanced search

Re: [SG16-Unicode] P1689: Encoding of filenames for interchange

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 5 Sep 2019 11:20:58 -0400
Thank you for writing this up, Thiago!

On 9/5/19 12:12 AM, Thiago Macieira wrote:
> == Transport ==
> P1689 suggests using JSON. I'm comparing that in the context of the three
> options with a binary format (CBOR).
>
> One thing SG16 is completely in agreement of is that if you go with JSON, you
> must obey RFC 8259: there must not be a BOM and the file must be encoded in
> UTF-8.

We haven't polled anything, so saying we're all in agreement is
premature. Additionally, we discussed this further in the SG16 meeting
yesterday and I think we determined that a BOM *may* be present.

RFC 8259 section 8.1 states: (emphasis mine)

    JSON text exchanged between systems *that are not part of a closed
    ecosystem* MUST be encoded using UTF-8 [RFC3629].

    Previous specifications of JSON have not required the use of UTF-8
    when transmitting JSON text. However, the vast majority of
    JSON-based software implementations have chosen to use the UTF-8
    encoding, to the extent that it is the only encoding that achieves
    interoperability.

    Implementations MUST NOT add a byte order mark (U+FEFF) to the
    beginning of a *networked-transmitted JSON text*. In the interests
    of interoperability, implementations that parse JSON texts *MAY
    ignore the presence of a byte order mark* rather than treating it as
    an error.

My reading of this is that RFC 8259 permits use of non-UTF-8 encodings
in some situations. Whether the situation that P1689 is defined for
qualifies is something that could be debated. If we consider the build
system and compiler invocations to form a closed system, then the
dependency file could be, for example, EBCDIC encoded JSON and still
conform to RFC 8259. I'm not arguing for or against such a position at
this time; but rather noting that, if SG15 requires UTF-8 encoded JSON,
that requirement is arguably more restrictive than what RFC 8259 requires.

My reading of the BOM requirements is that they only apply to UTF-8 data
sent over the network and that use of a BOM in file contents is permitted.

ECMA 404 does not specify any requirements on encoding of the JSON
content, nor the presence or absence of a BOM.

My conclusions are, if we choose to adopt either RFC 8259 or ECMA 404 as
the JSON specification deferred to, and if we don't add additional
restrictions, that:

 1. Implementations could choose whatever encoding they like for the
    JSON file.
 2. Implementations could choose whether to produce and consume a BOM.

Tom.


Received on 2019-09-05 17:21:01