Building a tokenizer for XPath or XQuery

In researching for an XPath 2.0 implementation, I ran across this curious document from the W3C. Despite being labeled a Working Draft (as opposed to a Note), it appears to be a one-shot document with no future hope for updates or enhancements.

In short, it outlines several options for the first stage or two of an XPath 2.0 or XQuery implementation. (Despite the title, it talks about more than just a tokenizer; additionally a parser and a possible intermediate stage). Tokenizing and parsing XPath are significantly more difficult than other languages, because things like this are perfectly legitimate (if useless):

if(if) then then else else- +-++-**-* instance
of element(*)* * * **---++div- div -div

The document tries to standardize on some terminology for various approaches toward dealing with XPath. The remaining bulk of the document sketches out some lexical states that would be useful for one particular implementation approach. I guess the vibrant, thriving throngs of XPath 2.0 developers didn’t see the need for this kind of assistance.

In short, I didn’t find it terribly useful. Maybe some readers have, though. Feel free to comment below. Subsequent articles here will describe how I approached the problem. Stay sharp! -m

One Response to “Building a tokenizer for XPath or XQuery”

  1. Dimitre Novatchev http://fxsl.sf.net

    Hi Micah,
    Recently I had fun implementing the parsing stage of an XPath 2.0 implementation in pure XSLT.

    I followed the XPath 2.0 lexical description and syntax as per Michael Kay’s book and everything went pretty straightforward.

    I found only a couple of typographical errors in the grammar specification, such as using a chevron instead of a ‘>’, or some improper nesting of quotes.

    For the lexical analysis phase I am using the RegEx support of XSLT 2.0.

    For the parsing phase I am using the same generic LR parsing framework from FXSL, which I already used successfully in parsing JSON and converting it to XML.

    I also found the W#C document not of too-big value. I had similar question like you and posted it in the xsl-list:

    “Grammars for XPath 2.0: which to use?” In this thread Michael Kay posted a very useful answer.

    See my blog for a description of the JSON to XML parser and the FXSL Generic LR Parsing Framework.

    I would be glad to provide more information on my XPath 2.0 parsing experience. It was great fun.

    Cheers,
    Dimitre Novatchev