Push Button Paradise

Micah Dubinko

Fri, 08 Jul 2005

Applying XML Patterns to other domains

Lately, I've been thinking deeply about patterns. And reading. One thing that strikes me is similarities between SGML-world and XML-world procesing. For example both DSSSL and XSLT are declarative, functional, and side-effect-free. This isn't coincidence, or even an artifact that some of the same people worked on both technologies. There are deep reasons why specific things like transformations work well, and any standardization effort will trend toward certain realities. Patterns.

XML in general has become a sort of universal pattern for tree processing. For example, look at Oleg Paraschenko's Reusing XML Processing Code in non-XML Applications. Even in an application that deals with trees (and what app doesn't?) but without a DOM or angle bracket in sight, it can still make sense to use XML technology, because XML people have thought so much about trees and tree processing.

XPath expresses the pattern of navigating or filtering through structured data. Now, XPath is carefully defined against an abstract "data model" (which is in turn based on the Infoset), but since the whole point of the Infoset is to provide a mapping between syntax and abstract structure, let's look at the lexical implications of XPath.

At the simplest level, an XPath expression can be a simple locator, like p or svg:foreignObject or @xml:id. All simple node tests. If you are a text-head, you can think about what these mean lexically. All of them are selecting a portion of the document based on delimiters. Ignoring for a moment some details about XML syntax, the first example is using delimiters of "<p>" and "</p>"; the second between "<svg:foreignObject>" and "</svg:foreignObject>"; the third between "xml:id='" and "'". Of course, there's room for variation in XML: elements can contain attributes, and attributes can use single or double quotes, etc. The take-home point, though, is that a node test as simple as a single character is really doing two things, delimiting the start and end of a range of the bigger document.

What if the document isn't necessarily XML? The pattern of an XPath-like language still makes sense, however the equivalent of a "node test" gets a little messier.

Enter TextPath.

TextPath fills the same role for Plain Text (which means Normalized Unicode with uniform line endings) as XPath does for XML and other treelike formats. It consists of a series of steps, separated with '/'; each step consists of a match rules for a starting and ending delimiter. In between the two, an operator '...' means roughly 'skip over a bunch of stuff between these delimiters'. In fact, the part that's skipped over this way is usually the interesting bit the user is looking for. Similarly, in XPath, if you select 'p', you're not interested in the '<p>' and '</p>' portions of the document, but rather what's found inside.

A single-step TextPath expression is no more powerful than a regular expression, though somewhat easier to read. Multi-step expressions, however, are truly powerful. Each step can return multiple results, which serve as sources for subsequent steps.

The link above contains a full specification for TextPath. Be warned, though: this is experimental and can change--possibly radically--without notice.

I have implemented TextPath in Python. Here is a sample of one of the unit tests. (I realize that having so many tests crammed together isn't good practice, but I wanted all of the specification examples in one place.)

def testSpecExamples(self):
    doc = ("Telephone List\n"
           "==============\n"
           "\n"
           "Name: John Doe\n"
           "Home: (123) 456-7890\n"
           "\n"
           "Name: Jane Doe\n"
           "Work: +31 253 234 234\n"
           "VOIP: (480) 458-1235\n"
           "\n"
           "Name: John Deere\n"
           "Work: (312) 624-2673\n"
           "Home: (123) 675-3364")

    tp = TextPathEngine()

    exp = tp.compile('(name:\s*) ... eol()')
    res = exp.selectStrings(doc)
    self.assertEqual([], res) # case-sensitive = no match

    exp = tp.compile('i::(name:\s*) ... eol()')
    res = exp.selectStrings(doc)
    self.assertEqual(["John Doe", "Jane Doe", "John Deere"], res)

    exp = tp.compile('(Work:)looka( \+) ... eol()')
    res = exp.selectStrings(doc)
    self.assertEqual([" +31 253 234 234"], res)

    exp = tp.compile("/bol()...eol()")
    res = exp.selectStrings(doc)
    self.assertEqual(["Telephone List"], res)

    # find all names that have a number in the (123) area code
    exp = tp.compile("sep()...sep()/(Name: )...eol()looka(.*(123))")
    res = exp.selectStrings(doc)
    self.assertEqual(["John Doe", "John Deere"], res)

I've already talked to one person who thinks this would be a great tool for processing certain kinds of semistructured medical data. Comments are welcome, either in email or publicly on the Discussion page for TextPath on the wiki.

What should I do with this code? For now, I want to use it to advertise my consulting business. If you are working with semistructured text data and would find this useful, contact me via the link on this page. We can arrange a code license, actual consulting, or any combination. Let me know! -m

P.S. On Monday, look for more patterns-related writing in this space.

Update: What's the difference between TextPath and unadorned regexen?

A regex is conceptually a single thing, and you focus on what you want to match. A TextPath is for situations where you don't know exactly what will match, and is conceptually two things, a start and end delimiter.
TextPath naturally works better with recursion--for finding all of something in a document (2nd example above), or finding matches that occur within other matches (last example).
TextPath is just much better for general-purpose queries on text documents. For example, the last one above is 'find all names that have any number in the (123) area code.' Is that even possible to express in regex? If it is, could mortals comprehend it?
TextPath follows on from the work of Rick Jelliffe's team on XMLWiki, and my port PyXMLWiki, which does have source code available in CVS.
Simply put, TextPath follows the pattern of recursive structured selection in a way that bare regex doesn't. That format had separate specifications for startConsume, endConsume, startLookaheadandendLookahead`, so all the same reasons why a single regex didn't cut it still apply.

TODO: I'd like to make the syntax even more simple. Suggestions are welcome.

posted at: 16:41 | under: 2005-06 | 0 comment(s)

Syndicate:

What am I reading?
Don Quixote
Self-Editing for Fiction Writers
The Complete Joy of Homebrewing
Analog magazine
Compilers
TAOCP

Archives:
Link