Date: Mon, 19 Aug 2019 14:43:06 -0700
On Monday, 19 August 2019 10:34:34 PDT Henri Sivonen wrote:
> > You need to decode before you can encode again. Just try normalising the
> > following URL/IRI excerpt:
> > %C3%C3%A9%A9
>
> Web-compatible URL parsing does not involve decoding and then
> encoding. However, display to users in the URL bar does involved
> percent-decoding, checking for UTF-8ness, and then UTF-8 decoding if
> the UTF-8ness check passed.
>
> In terms of attacks that could confuse the users, it's probably a bad
> idea to try to take a URLs and merely _minimize_ its percent encodes
> to produce a Unicode string that has some percent encodes left but
> replaces all valid UTF-8 percent encode sequences with Unicode, which
> seems to be the operation you suggest.
Indeed.
As I said, this is the only operation I've seen in 10 years that parsed or
generated UTF-8 that wasn't a contiguous byte storage of UTF-8 code units.
It's also the only case I needed to any control over the error conditions,
though it sufficed to know "this byte does not start a valid UTF-8 sequence".
If it doesn't, then keep that percent-encoded (or encode it if it wasn't) and
then try the next byte. Everywhere else, a failed decoding can either be
stopped completely (it's not UTF-8) or resumed after adding a replacement
character.
Like the JSON parser: no BOM accepted, any invalid UTF-8 sequence invalidates
the entire document.
> > You need to decode before you can encode again. Just try normalising the
> > following URL/IRI excerpt:
> > %C3%C3%A9%A9
>
> Web-compatible URL parsing does not involve decoding and then
> encoding. However, display to users in the URL bar does involved
> percent-decoding, checking for UTF-8ness, and then UTF-8 decoding if
> the UTF-8ness check passed.
>
> In terms of attacks that could confuse the users, it's probably a bad
> idea to try to take a URLs and merely _minimize_ its percent encodes
> to produce a Unicode string that has some percent encodes left but
> replaces all valid UTF-8 percent encode sequences with Unicode, which
> seems to be the operation you suggest.
Indeed.
As I said, this is the only operation I've seen in 10 years that parsed or
generated UTF-8 that wasn't a contiguous byte storage of UTF-8 code units.
It's also the only case I needed to any control over the error conditions,
though it sufficed to know "this byte does not start a valid UTF-8 sequence".
If it doesn't, then keep that percent-encoded (or encode it if it wasn't) and
then try the next byte. Everywhere else, a failed decoding can either be
stopped completely (it's not UTF-8) or resumed after adding a replacement
character.
Like the JSON parser: no BOM accepted, any invalid UTF-8 sequence invalidates
the entire document.
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Software Architect - Intel System Software Products
Received on 2019-08-19 23:43:10