[SPEC] Encouraging HTTP Proxies to support Gemini hosts self-blacklisting

Sean Conner sean at conman.org
Mon Feb 22 01:43:58 GMT 2021


It was thus said that the Great Mansfield once stated:
> 
> I must admit, I'm woefully lacking skill or background with robots.txt. It
> seems like it could be a great answer.
> 
> A few questions to help me educate myself:
> 
>  1. How often should that file be referenced by the proxy? It feels like
> an answer might be, to check that URL before every request, but that goes
> in the direction of some of the negative feedback about the favicon. One
> user action -> one gemini request and more.

  I would say once per "visit" would be good enough (say you have 50
requests to make to a site---check before doing all 50).  Checking
robots.txt for *every* request is a bit too much.  

>  2. Is 'webproxy' a standard reference to any proxy, or is that something
> left to us to decide?

  The guide for Gemini [1] says:

	Below are definitions of various "virtual user agents", each of
	which corresponds to a common category of bot.  Gemini bots should
	respect directives aimed at any virtual user agent which matches
	their activity.  Obviously, it is impossible to come up with perfect
	definitions for these user agents which allow unambiguous
	categorisation of bots.  Bot authors are encouraged to err on the
	side of caution and attempt to follow the "spirit" of this system,
	rather than the "letter".  If a bot meets the definition of multiple
	virtual user agents and is not able to adapt its behaviour in a fine
	grained manner, it should obey the most restrictive set of
	directives arising from the combination of all applicable virtual
	user agents.

	...

	# Web Proxies

	Gemini bots which fetch content in order to translate said content
	into HTML and publicly serve the result over HTTP(S) (in order to
	make Geminispace accessible from within a standard web browser)
	should respect robots.txt directives aimed at a User-agent of
	"webproxy".

  So for example, if you are writing a gopher proxy (user makes a gopher
request to get to a Gemini site), then you might want to check for
"webproxy", even though you aren't actually behind a wesite but a gopher
site.  This is kind of a judgement call.

>  3. Are there globbing-like syntax rules for the Disallow field?

  No.  But it's not a complete literal match either.  

	Disallow:

will allow *all* requests.

	Disallow: /

will not allow any requests at all.

	Disallow: /foo

Will only disallow paths that *start* with the string '/foo', so '/foo',
'/foobar', '/foo/bar/baz/' will all be disallowed.

>  4. I'm assuming there could be multiple rules that need to be mixed. Is
> there a standard algorithm for that process? E.g.:
> User-agent: webproxy
> Disallow: /a
> Allow: /a/b
> Disallow: /a/b/c

  Allow: isn't in the standard per se, but many crawlers do accept it.  And
the rules for a user agent are applied in order they're listed.  First match
wins.

> Again - it seems like this could work out really well.
> 
> Thanks for helping me learn a bit more!

  More about it can be read here [2].

  -spc

[1]	https://portal.mozz.us/gemini/gemini.circumlunar.space/docs/companion/robots.gmi

[2]	http://www.robotstxt.org/robotstxt.html



More information about the Gemini mailing list