[tech] [eli5] URI = IRI = ASCII = UTF-8 = Unicode

Stephane Bortzmeyer stephane at sources.org
Sun Jan 3 13:55:22 GMT 2021

On Wed, Dec 30, 2020 at 12:25:31AM +0100,
 Petite Abeille <petite.abeille at gmail.com> wrote 
 a message of 14 lines which said:

> > URI's use UTF-8 encoded octets only by popular convention and not by
> > any hard rule. You can stick any kind of binary data into a URI as
> > long as you percent-encode the non-ASCII bytes.
> Yes, indeed. Any random binary will do, e.g. the query portion could
> contain any weird binary data one sees fit to put there.
> Not so much in other parts of the URI though, UTF-8 rules there. 

This is not true. As Michael said, URI are bytes, not characters. The
encoding is anyone's guess.

Two details:

* paths (not just queries) can contain any binary garbage but there
are special rules for hostnames.

* the RFC has provisions for "a new URI scheme" which may apply to
us. We can decide here that URI of scheme "gemini" MUST be entirely in

More information about the Gemini mailing list