Elemental XHTML

Examining XHTML 2.0 attribute vs. element decisions

Editor: Micah Dubinko (mdubinko at yahoo dot com)

Last update: November 12, 2002

1. Introduction

“[The] DTD for Internet drafts puts human text in attrs. bummer, that.” -- Dan Connolly

One reason often given for keeping human-readable text out of attribute values is for internationalization reasons. The W3C XML Schema Part 2 Recommendation, under the xsd:string datatype, says:

NOTE: Many human languages have writing systems that require child elements for control of aspects such as bidirectional formating or ruby annotation.... Thus, string, as a simple type that can contain only characters but not child elements, is often not suitable for representing text. In such situations, a complex type that allows mixed content should be considered.

In so many words, “don't use attributes for human-readable text”.

The question “elements vs. attributes” has raged on since about 5 minutes after elements and attributes both existed. Some of the element-vs-attribute decisions in HTML date back to what is essentially antiquity. This paper examines what XHTML would look like with a few possible constraints on the element-vs-attribute design. The purpose of this is for exploration, not advocacy. The constraints are:

All human-readable text must occur as element content, not attribute values
No 'structured attribute values' (e.g. space-separated list) allowed. Use markup.

2. Attribute types subject to change

The following attributeTypes require change

AttributeType	Used in XHTML2?	Human-readable (HR)	Space-separated (SS) list	Comma-separated (CS) list	Micro-parse (MP) value
CDATA	Yes	X
PCDATA		X
IDREFS	Yes		X
NMTOKENS			X
Charsets			X
Class	Yes		X
ContentTypes				X
Coordinates	Yes			X
Length	Yes				%
LinkTypes	Yes		X
MediaDesc	Yes			X
MultiLength	Yes				*
MultiLengths				X	*
Text	Yes	X
URIs	Yes		X
URI List	Yes			X

(Note that a few of these are defined but never used in XHTML 2.0, and there's some confusion in the spec among CDATA, PCDATA, and Text.)

By looking for attributes with these datatypes, a few immediate consequences can be discerned.

One large-scale change is with the 'title' attribute, which is defined on nearly every element. One way to do this is to define a new content model “advisory”, which would be by default not be rendered. (Of course, stylesheets would remain free to do as they please with any element.) XHTML already has a <title> element, currently defined as a required element in the <head>. Allowing <title> elements throughout the document would seem to be a straightforward progression. The end of this document has several examples.

Another interesting case it the 'class' attribute, which can take a space-separated list of classes. In theory, it would be better to use markup to delimit the list. Examples of this are shown later.

Other than changes related to 'class' and 'title', the following elements would also need to be adjusted:

Element	Reasons
<a>	SS ('rel' and 'rev'); CS ('coords')
<area>	HR ('alt'); CS ('coords')
<link>	SS ('rel' and 'rev'); CS ('media')
<object>	HR ('standby'); CS ('archive'); MP ('content-length')
<param>	HR ('value')
<style>	CS ('media')
<table>	HR ('summary')
<tr>	HR ('abbr'); SS ('headers'); CS ('axis')
<td>	HR ('abbr'); SS ('headers'); CS ('axis')

Nine elements in all; this was less than I thought it would be.

3. Examples of Elemental XHTML

<p><title>My favorite opener</title>Call me Ishmael.</p>

<p><title>title one</title>What does this do?<title>title two</title></p>

<p>I <em><title>emphasis added by us</title>really</em> want to go!</p>

<br>
  <title>Three days passed...</title>
  <class>separator</class>
  <class>compact</class>
</br>

<head><title>My Document</title></head>
<!-- note that currently the head element allows a title attribute.
This becomes redundant -->

<p>Further details <a href='..'><rel>Alternate</rel><rev>Index</rev>
<title>The Wumpus Website</title>here</a></p>

<object classid='http://www.observer.mars/TheEarth.py'>
  <title>The Earth as seen from space</title>
  <standby>Loading applet...</standby>
  <!-- Else, try the MPEG video -->
  <object data='TheEarth.mpeg' type='application/mpeg'>
    <standby>Loading movie...</standby>
    <!-- Else, try the GIF image -->
    <object data='TheEarth.gif' type='image/gif'>
        <!-- Else render the text -->
        The <strong>Earth</strong> as seen from space.
    </object>
  </object>
</object>

<table>
  <summary>This table charts the number of cups
           of coffee consumed by each senator, the type 
           of coffee (decaf or regular), and whether 
           taken with sugar.</summary>
  <caption>Cups of coffee consumed by each senator</caption>
  <title>Last update, Nov 1</title>
  <tbody>
    <tr>
      <th id='t1'><abbr>Name</abbr>Senator Name</th>
      <th id='t2'>Cups</th>
      <th id='t3'><abbr>Type</abbr>Type of Coffee</th>
      <th id='t4'>Sugar?</th>
    </tr>
    <tr>
      <td><headers>t1</headers><title>Dem, ND</title>T. Sexton</td>
      <td><headers>t2</headers>10</td>
      <td><headers>t3</headers>Espresso</td>
      <td><headers>t4</headers>No</td>
    </tr>
   ...
  </tbody>
</table>

4. Conclusions

Having done this, my conclusions so far are:

Moving human-readable text out of attributes in XHTML was less painful than I initially thought.
Previously, the <head> section was entirely devoted to the customarily 'hidden' elements. This no longer has to be the case.
Elemental XHTML heavily relies upon the mixed content, often as <container><advisory/>text</container>.
While it's worth thinking about converting lists into markup, in some cases it's not worth it (particularly with <class>).
More?

5. Possible Interesting Directions

Due to the exploratory nature of this paper, this section is subject to changes and suggestion.

One possibility is to leverage mixed content to provide a substitute for character entities—using elements. For example:

<div>This is copyright <char:copy/> 2002.</div>

Different predefined character sets (HTML, MathML, etc.) can be assigned different namespaces. Additionally, for environments that don't understand character elements, inner text content can be used:

<div>This is copyright <char:copy>(c)</char:copy> 2002.</div>

A processor that understands character elements would suppress the inner content. A processor that doesn't understand it could safely ignore it.

More?