Metadata for Grandma

Last update: September 13, 2002

Metadata, or "data about data", seems like it should be straightforward. Intuitively, we can relate to card catalogs, indexes, and tables of contents. So why does advanced web metadata have such a low adoption rate?

I don't know, but I can venture a guess.

Metadata is subject to network effects--the more widely it's used, the more useful. The converse is true also. For example, a card catalog is hardly worth the effort if only a fraction of the books in the library are represented. On the Web, a look at real world sites shows that the HTML <meta> tag is widely used, particularly in the context of achieving high search engine rankings, while more advanced technologies such as RDF and Topic Maps are not. This creates a viscous circle--developers aren't learning metadata, which makes it not worth learning! The key to metadata success, to a both-human-and-machine-readable "semantic web", is widespread deployment of metadata. In this line, I will take a stab at defining a metadata syntax that:

Is conceptually similar to HTML <meta>, as currently defined and used, but still suitable as a generic XML technology
Can be easily grasped, even by my Grandma, and put into immediate use
Provides the barest of features needed for metadata to be useful

Metadata basics

Metadata can be defined as a series of statements. Each statement contains 1) a subject, or what the statement is about; 2) a property of the subject; and 3) an object, the value or target of the property. The object of one statement can be the subject of another, creating a tree-like web of statements. The remainder of this document proposes a general-purpose metadata vocabulary suitable for use in XHTML or any XML-based language.

Element <meta>

The only element in this syntax has the local name "meta". It takes the following attributes:

about - the subject. Three conditions are possible with this attribute:
1. When omitted, the implied subject is the containing document (as in HTML).
2. When present, this attribute must be either an absolute URI, indicating that the identified resource is the subject,
3. Or an IDREF of another <meta> element, in which case the pointed-to object is the subject of this statement. See the examples below.
name - the property. This attribute is required. It must be a QName.
content - the object. This attribute may be any string value, or omitted under certain conditions listed below.
id - an optional unique identifier for the element.

This element has empty content.

One special case exists, namely where the object is inconvenient to represent in a machine-addressable manner but still requires additional statements about it. For example, a specification editor is a human being (well, in most cases). As of 2002, people don't yet have IP addresses, DNS names and so on, so the content attribute can be left off entirely, although separate additional statements could still be made. See the examples below.

Open Issues

Confusion about the subject: what is being described: the abstract 'resource' that is ultimately being referred to, or the specific representation that comes over the wire?

Should fragments be allowed on the subject URI? If so, what does that mean?

As of August 2002, the W3C TAG is actively debating these issues.

Namespaces

The element <meta>, when generally used within any XML document, must have the namespace URI http://dubinko.info/2002/meta/. The only exception to this occurs for compatibility within XHTML, where the element may instead have the namespace of the XHTML version in use, and the value http://dubinko.info/2002/meta/ specified on the attribute profile. (The profile attribute is the signal for the processor to interpret the attributes on <meta> specially, as defined here.)

Worked Examples:

Copyright notice in XML

NS: ...xmlns:dc="http://purl.org/dc/elements/1.1/"
NS: ...xmlns="http://dubinko.info/2002/meta/"
1: <meta name="dc:rights" content="Copyright 2002 Micah Dubinko. All rights reserved."/>

Line 1, lacking the about attribute, is referring to the containing document, with "rights", as defined by the Dublin Core, being the copyright message indicated. Note namespace defaulting on <meta>.

Copyright notice in backwards compatible XHTML

NS: ...xmlns:dc="http://purl.org/dc/elements/1.1/"
NS: ...xmlns="http://www.w3.org/1999/xhtml"
1: <head profile="http://dubinko.info/2002/meta/">...
2:   <meta name="dc:rights" content="Copyright 2002 Micah Dubinko. All rights  reserved."/>
3: </head>

Line 1 contains the standard XHTML <head> element, in the XHTML namespace as expected. The value of the little-used attribute profile tells the HTML processor to treat the following <meta> tags specially, as defined here.

Line 2 serves the same function as in the previous example, even though here the <meta> element is in the XHTML namespace.

Line 3 closes the html:head tag. Note that everything in this example is XHTML strict valid.

A paraphrased example from the RDF specification:

There exists a document (this one) with a title, "Metadata for Grandma" and this document has an editor, the editor has a name "Micah Dubinko" and a home page "http://dubinko.info". As markup:

NS: ...xmlns:dc="http://purl.org/dc/elements/1.1/"
NS: ...xmlns:ex="http://example.com/mymetadefinitions"
NS: ...xmlns="http://dubinko.info/2002/meta/"
1: <meta name="dc:title" content="Metadata for Grandma"/>
2: <meta name="ex:editor" xml:id="ed" />
<!-- there is no URI resource that directly identifies the _person_ of the editor -->
3: <meta about="ed" name="ex:fullName" content="Micah Dubinko"/>
4: <meta about="ed" name="ex:homepage" content="http://dubinko.info"/>

Line 1 talks about this very document (no 'about' attribute). Is says the document has a "title" (as defined by Dublin Core) of "Metadata for Grandma".

Line 2 also talks about this document, stating that this document has an "editor", as defined by the namespace mapped to ex:, and no further information is avaliable at this point--though later statements referencing the ID may provide additional information about the editor.

Line 3 references the earlier ID to talk about the object of line 2, the person, stating that the person has a "fullName" of "Micah Dubinko".

Line 4 also references the person from line 2, stating that the person has a "homepage" of "http://dubinko.info".

Alternate Approaches

Don't Overload 'about'. Having the 'about' attribute take either a URI or an IDREF might not be the ideal approach. It could be split into two attributes:

about = URI reference
aboutref = IDREF

Which would lead to the question of what to do if both attributes are present. One possibility is simply that statements are being made about two things at the same time. Or the <meta> tags could nest, with the same meaning as the ID/IDREF connection.

<meta name="ex:editor">
  <meta name="ex:fullName" content="Micah Dubinko"/>
  <meta about="ed" name="ex:homepage" content="http://dubinko.info"/>
</meta>

Use a better name than 'name'. This proposal used an attribute named 'name' because that's what the <meta> tag in previous (X)HTML versions. A better name could be chosen, perhaps, "property".

Use element content. Modern XML browsers will correctly refuse (by default) to render things in the <head>, so a better representation might make use of the element content rather than having every <meta> empty. For instance, the 'content' could be the element content:

<meta property="dc:rights">Copyright 2002 Micah Dubinko. All rights reserved.</meta>

Use QNames in element names, not attribute values. A different approach, which gets rid of the need for a QName in an attribute value, would be to allow a single level of child elements, though this would probably not be readily expressible in a DTD. For example:

<meta>
  <dc:rights>Copyright 2002 Micah Dubinko. All rights reserved.</dc:rights>
</meta>

<head profile="http://dubinko.info/2002/meta/">...
  <meta>
    <dc:rights>Copyright 2002 Micah Dubinko. All rights  reserved."</dc:rights>
  </meta>
</head>
<meta>
  <dc:title>Metadata for Grandma</dc:title>
  <ex:editor xml:id="ed"/>
</meta>
<meta aboutref="ed">
  <ex:fullName>Micah Dubinko"<ex:fullname>
  <ex:homepage>http://dubinko.info</ex:homepage>
</meta>

Allowing multiple levels of child elements would essentially be RDF "striping" syntax, which quickly gets to complexity levels that Grandma can't figure out.

Conclusion

Several capabilities of full RDF/TM are missing, such as containers, statements about statements, and non-addressable subjects. That's OK. Leaving these features out is a trade-off of expressiveness vs. adoption rate. By keeping the syntax simple and familiar, chances of 'typical' web developers deciding to learn and use metadata are much better. When more people use it, everyone benefits. And such features could be later added back in.

Still, such a simple and limited proposal could never replace RDF or Topic Maps. Instead, this is a complementary technology that provides an important but incremental improvement over HTML <meta> tags, which have passed the critical test of getting adopted en masse.

This approach is fully compatible with XHTML, especially when making statements about the containing document itself, which doesn't require the non-standard attribute about. This proposal would be suitable for inclusion in a future version of the XHTML specification.

Here's to a better future Web! -m