Metadata Without A Proposal

nothien at uber.space nothien at uber.space
Fri Feb 26 10:51:08 GMT 2021


Hi!

I've lost track of the currently raging metadata thread entirely, and so
I've started this as a new post.

Thus far, I think there's general consensus on the following needs for
any metadata proposal:

1. Must degrade gracefully for clients that don't understand metadata.

2. Must not be English-specific.
  
   Although the majority of gemtext/Gemini content is in English at the
   moment, we want more diversity.  "Forcing" (by convention) the usage
   of English upon non-English users is unwanted.
  
   This rules out some of the current proposals which are oriented around
   'tags', e.g. 'author' or 'license'.  Theoretically, you could have a
   list of tags for different languages, but that would grow into a
   horrifically long list, and is generally unsustainable.
  
3. Must be machine-parsable.
  
   Search engines, archivers, and other crawler-style clients need to be
   attended to.  Some of the information they need is: date, author, and
   license.
  
4. Should affect presentation.
  
   gemtext as a whole is about separating content from presentation.
   Some of the earlier metadata proposals referred to metadata for
   presentation, e.g. to specify a color to view the text in.  This is
   against the spirit of gemtext/Gemini (if not the spec).
  
5. Must be difficult to extend.
  
   Again, this comes from the general Gemini philosophy that anything
   that can be misused will be misused.  This rules out lots of current
   proposals because they specify tags, and the usage of tags can only be
   controlled by convention, which is subject to change.
  
6. Must be accessible.
  
   Some proposals discussed the usage of emojis, and others have opted
   for creating new unofficial line types.  These don't degrade
   gracefully for things like screen readers, until they adopt the
   metadata proposal.  That's not great.

I think that we don't need a "metadata proposal" to solve any of these
problems.  We already have everything we need in pre-existing formats
and specifications.  Only three metadata fields are really necessary:
date, author, and license.  New fields, if completely necessary, need to
be handled on a case-by-case basis.

## Dates

Dating content is mostly relevant to search engines, so that old (or
new) results can be filtered out.  My proposal with dates is to use what
we already have - the gmisub companion spec.  If any content (e.g. an
article) has an associated date, the index page should in gmisub format
list the content page with the date.  If content pages don't have any
associated date, simply don't list a date in the index.  Search engines
and crawlers can still choose to include date information based on when
they last crawled the page.

=> gemini://gemini.circumlunar.space/docs/companion/

One question this raises is what index page to use.  I think that the
engine should search through parent directories until it finds one which
fits the gmisub format and has the content page (they would need to do
this anyways in order to crawl the capsule containing the content page).
If the engine already knows about an index page which is on the same
capsule and that has the content page, it can use that.

## Licenses

We already have a great convention for licenses: giving it on the last
line of the document, with the line starting with `--`.  For example:

```An example of page with a license line
Hello world!

-- CC-BY-SA nothien
```

All we need to do with this convention is to formalize it as a companion
specification, maybe as `-- [SPDX license identifier] [owner]`.

## Authors

There are two possibilities I see with author metadata: either take it
from the license line, discussed above, or extend the gmisub spec to
also allow for an optional author field.

```A possible extension to the gmisub syntax with an author field
=> URL YYYY-MM-DD (Author) Title
```

We can tweak the format around a bit so that currently existing titles
which start with parenthesized text aren't misinterpreted.  In addition,
one shouldn't have to repeat the author field for every line; we can
have some system like only requiring the author field when it is
different from the immediately previous author.  I prefer the first
option, but I haven't explored when the license owner would differ from
the author (which I think is the case for e.g. news companies).

## Other Fields

Clearly, other fields aren't supported by this.  If you want to place
additional metadata in your content, then I suggest writing it in
natural language.  If it is absolutely necessary to have it
machine-parsable (so that it can be specially understood by e.g. search
engines) then we can talk about that here on the ML, but others have
argued against e.g. tags because they allow easily manipulating search
results.  Expect resistance.

## Metadata for Storage

Author and license metadata is stored within the page itself, and so
that's not a problem.  Personally, I store date information in the file
name of the document (e.g. 2021-02-26-proposal.gmi), but I understand
that this doesn't work for everyone: in that case, see below.

There are legitimate uses for additional metadata when storing gemtext,
such as for capsule-local tagging.  These fields should be stored using
any arbitrary convention in the content: after all, these fields are not
meant to be parsed by external client software (i.e. search engines and
crawlers), but are only parsed by capsule-local software (such as to
organize content by tag).

## Conclusion

I don't think we need a 'metadata proposal' to achieve the goals we're
looking for.  The format conventions are already mostly in place; we
just need to formalize them.

~aravk | ~nothien


More information about the Gemini mailing list