C++ Logo


Advanced search

Re: [SG16-Unicode] Comments on D1629R1 Standard Text Encoding

From: Henri Sivonen <hsivonen_at_[hidden]>
Date: Mon, 19 Aug 2019 20:34:34 +0300
On Mon, Aug 19, 2019 at 9:33 AM Thiago Macieira <thiago_at_[hidden]> wrote:
> On Sunday, 18 August 2019 12:47:27 PDT Henri Sivonen wrote:
> > On Sun, Aug 18, 2019, 19:07 Thiago Macieira <thiago_at_[hidden]> wrote:
> > > On Saturday, 17 August 2019 12:25:57 PDT Henri Sivonen wrote:
> > > > To the extent other programming languages that have encoding
> > > > conversion in their standard library, such as Java, focus on
> > > > contiguous buffers rather than iteration, it's worthwhile to study if
> > > > application developers really feel that something important is
> > > > missing.
> > >
> > > We were just discussing URLs in the cpplang Slack and that reminded me:
> > > there's exactly one in 10 years case that I've needed to decode a non-
> > > contiguous byte range and that's when parsing a URL.
> >
> > Can you elaborate on this? Per spec, URL parsing doesn't invoke a decoder
> > but an encoder:
> > https://url.spec.whatwg.org/#query-state
> You need to decode before you can encode again. Just try normalising the
> following URL/IRI excerpt:
> %C3%C3%A9%A9

Web-compatible URL parsing does not involve decoding and then
encoding. However, display to users in the URL bar does involved
percent-decoding, checking for UTF-8ness, and then UTF-8 decoding if
the UTF-8ness check passed.

In terms of attacks that could confuse the users, it's probably a bad
idea to try to take a URLs and merely _minimize_ its percent encodes
to produce a Unicode string that has some percent encodes left but
replaces all valid UTF-8 percent encode sequences with Unicode, which
seems to be the operation you suggest. In contrast, in Firefox and
Safari, the moment you introduce a non-UTF-8 percent encode to the
URL, the URL bar will only show you ASCII (with percent escapes) for
the whole URL. That is, it's an all or nothing deal. In Chrome, this
is done independently for the query part and for the rest. (This
operation is not exposed to Web content, so the difference between
Firefox and Safari on one hand and Chrome on the other is a mere UI

> WHATWG is not normative.

It's normative for implementations that wish to be consistent with the
behavior of the Web Platform for compatibility.

> Please use RFC 3986 and 3987.

It's probably not productive to relitigate this on this mailing list.

On Saturday, 17 August 2019 12:25:57 PDT Henri Sivonen wrote:
> Presenting a low-level compile-time specializable interface
> but then offering unnecessarily runtime-dispatched encoding through it
> seems like a layering violation.

Oops. s/unnecessarily/a necessarily/

Henri Sivonen

Received on 2019-08-19 19:34:50