[SPEC] Encouraging HTTP Proxies to support Gemini hosts self-blacklisting
sean at conman.org
Mon Feb 22 01:43:58 GMT 2021
It was thus said that the Great Mansfield once stated:
> I must admit, I'm woefully lacking skill or background with robots.txt. It
> seems like it could be a great answer.
> A few questions to help me educate myself:
> 1. How often should that file be referenced by the proxy? It feels like
> an answer might be, to check that URL before every request, but that goes
> in the direction of some of the negative feedback about the favicon. One
> user action -> one gemini request and more.
I would say once per "visit" would be good enough (say you have 50
requests to make to a site---check before doing all 50). Checking
robots.txt for *every* request is a bit too much.
> 2. Is 'webproxy' a standard reference to any proxy, or is that something
> left to us to decide?
The guide for Gemini  says:
Below are definitions of various "virtual user agents", each of
which corresponds to a common category of bot. Gemini bots should
respect directives aimed at any virtual user agent which matches
their activity. Obviously, it is impossible to come up with perfect
definitions for these user agents which allow unambiguous
categorisation of bots. Bot authors are encouraged to err on the
side of caution and attempt to follow the "spirit" of this system,
rather than the "letter". If a bot meets the definition of multiple
virtual user agents and is not able to adapt its behaviour in a fine
grained manner, it should obey the most restrictive set of
directives arising from the combination of all applicable virtual
# Web Proxies
Gemini bots which fetch content in order to translate said content
into HTML and publicly serve the result over HTTP(S) (in order to
make Geminispace accessible from within a standard web browser)
should respect robots.txt directives aimed at a User-agent of
So for example, if you are writing a gopher proxy (user makes a gopher
request to get to a Gemini site), then you might want to check for
"webproxy", even though you aren't actually behind a wesite but a gopher
site. This is kind of a judgement call.
> 3. Are there globbing-like syntax rules for the Disallow field?
No. But it's not a complete literal match either.
will allow *all* requests.
will not allow any requests at all.
Will only disallow paths that *start* with the string '/foo', so '/foo',
'/foobar', '/foo/bar/baz/' will all be disallowed.
> 4. I'm assuming there could be multiple rules that need to be mixed. Is
> there a standard algorithm for that process? E.g.:
> User-agent: webproxy
> Disallow: /a
> Allow: /a/b
> Disallow: /a/b/c
Allow: isn't in the standard per se, but many crawlers do accept it. And
the rules for a user agent are applied in order they're listed. First match
> Again - it seems like this could work out really well.
> Thanks for helping me learn a bit more!
More about it can be read here .
More information about the Gemini