Super simple tokenizer in XQuery

0

A lexer might seem like one of the boringest pieces of code to write, but every language brings it’s own little wrinkles to the problem. Elegant solutions are more work, but also more rewarding.

There is, of course, a large body of work on table-driven approaches, several of them listed here (and bigger list), though XQuery seems to have been largely left out of the fun.

In MarkLogic Search API, we implemented a recursive tokenizer. Since a search string can contain quoted pieces which need to be carefully maintained, first we split (in the fn:tokenize-sense, discarding matched delimiters) on the quote character, then iterate through the pieces. Odd-numbered pieces are chunks of tokens outside of any quoting, and even-numbered pieces are a single quoted string, to be preserved as-is. We recurse through the odd chunks, further breaking them down into individual tokens, as well as normalizing whitespace and a few other cleanup operations. This code is aggressively optimized, and it removes any searches for tokens known to not appear in the overall string. It also preserves the character offset positions of each token relative to the starting string, which gets used downstream, so this makes for some of the most complicated code in the Search API. But it’s blazingly fast.

When prototyping, it’s nice to have something simpler and more straightforward. So I came up with an approach using fn:analyze-string. This function, introduced in XSLT 2.0 and later ported to XQuery 3.0, takes a regular expression, and returns all of the target string, neatly divided into match and non-match portions. This is great, but difficult to apply across the entire string. For example, potential matches can have different meaning depending on where they fall (again, quoted strings as an example.) But if every regex starts with ^ which anchors the match to the front of the string, the problem simplifies to peeling off a single token from the front of the string. Keep doing this until there’s no string left.

This is a particularly nice approach when parsing a grammar that’s formally defined in EBNF. You can pretty much take the list of terminal expressions, port them to XQuery-style regexes, add a ^ in front of each, and roll.

Take SPARQL for example. It’s a reasonably rich grammar. The W3C draft spec has 35 productions for terminals. I sketched out some of the terminal rules (note these are simplified):

declare variable $spq:WS     := "^\s+";
declare variable $spq:QNAME  := "^[a-zA-Z][a-zA-Z0-9]*:[a-zA-Z][a-zA-Z0-9]*";
declare variable $spq:PREFIX := "^[a-zA-Z][a-zA-Z0-9]*:";
declare variable $spq:NAME   := "^[a-zA-Z][a-zA-Z0-9]*";
declare variable $spq:IRI    := "^<[^>]+>";
...

Then going through the input string, and seeing which of these expressions match, and if so, calling analyze-string and adding the matched portion as a token, and recursing on the non-matched portion. Note that we need to go through longer matches first, so the rule for ‘prefix:qname’ comes before the rule for ‘prefix:’ which comes before the rule for ‘string’

declare function spq:tokenize-recurse($in as xs:string, $tl as json:array) {
    if ($in eq "")
    then ()
    else spq:tokenize-recurse(
        switch(true())
        case matches($in, $spq:WS)     return spq:discard-tok($in, $spq:WS)
        case matches($in, $spq:QNAME)  return spq:peel($in, $spq:QNAME, $tl, "qname")
        case matches($in, $spq:PREFIX) return spq:peel($in, $spq:PREFIX, $tl, "prefix", 0, 1)
        case matches($in, $spq:NAME)   return spq:peel($in, $spq:NAME, $tl, "name")
        ...

Here, we’re co-opting a json:array mutable object as a convenient way to store tokens as we peel them off. There’s not actually any JSON involved here. The actual peeling looks like this:

declare function spq:peel(
    $in as xs:string,
    $regex as xs:string,
    $toklist as json:array,
    $type as xs:string,
    $triml, $trimr) {
    let $split := analyze-string($in, $regex)
    let $match := string($split/str:match)
    let $match := if ($triml gt 0) then substring($match, $triml + 1) else $match
    let $match := if ($trimr gt 0) then substring($match, 1, string-length($match) - $trimr) else $match
    let $_ := json:array-push($toklist, <searchdev:tok type="{$type}">{$match}</searchdev:tok>)
    let $result := string($split/str:non-match)
    return $result
};

Some productions, like a <iri> inside angle brackets, contain fixed delimiters which get trimmed off. Some productions, like whitespace, get thrown away. And that’s it. As it stands, it’s pretty close to a table-driven approach. It’s also more flexible than the recursive approach above–even for things like escaped quotes inside a string, if you can write a regex for it, you can lex it.

Performance

But is it fast? Short answer is that I don’t know. A full performance analysis would take some time. But a few quick inspections shows that it’s not terrible, and certainly good enough for prototype work. I have no evidence for this, but I also suspect that it’s amenable to server-side optimization–inside the regular expression matching code, paths that involve start-anchored matches should be easy to identify and in many cases avoid work farther down the string. There’s plenty of room on the XQuery side for optimization as well.

If you’ve experimented with different lexing techniques, or are interested in more details of this approach, drop me a line in the comments. -m

Related Posts

© All Right Reserved
Proudly powered by WordPress | Theme: Shree Clean by Canyon Themes.