[tech] [eli5] URI = IRI = ASCII = UTF-8 = Unicode

Stephane Bortzmeyer stephane at sources.org
Sun Jan 3 13:55:22 GMT 2021


On Wed, Dec 30, 2020 at 12:25:31AM +0100,
 Petite Abeille <petite.abeille at gmail.com> wrote 
 a message of 14 lines which said:

> > URI's use UTF-8 encoded octets only by popular convention and not by
> > any hard rule. You can stick any kind of binary data into a URI as
> > long as you percent-encode the non-ASCII bytes.
> 
> Yes, indeed. Any random binary will do, e.g. the query portion could
> contain any weird binary data one sees fit to put there.
> 
> Not so much in other parts of the URI though, UTF-8 rules there. 

This is not true. As Michael said, URI are bytes, not characters. The
encoding is anyone's guess.

Two details:

* paths (not just queries) can contain any binary garbage but there
are special rules for hostnames.

* the RFC has provisions for "a new URI scheme" which may apply to
us. We can decide here that URI of scheme "gemini" MUST be entirely in
UTF-8.






More information about the Gemini mailing list