Archive for the 'xml' Category

Tuesday, October 29th, 2013

Skunklink a decade later

Alex Milowski asks on Twitter about my thoughts on Skunklink, now a decade old.

Linking has long been thought one of the cornerstones of the web, and thereby a key part of XML and related syntaxes. It’s also been frustratingly difficult to get right. XLink in particular once showed great promise, but when it came down to concrete syntax, didn’t get very far. My thinking at the time is still well-reflected in what is, to my knowledge, the only fiction ever published on A Hyperlink Offering. That story ends on a hopeful note, and a decade out, I’m still hoping.

For what it purports to do, Skunklink still seems like a good solution to me. It’s easy to explain. The notion of encoding the author’s intent, then letting devices work out the details, possibly with the aid of stylesheets and other such tools, is the right way to tackle this kind of a problem. Smaller specifications like SkunkLink would be a welcome breath of fresh air.

But a bigger question lurks behind the scenes: that of requirements. Does the world need a vocabulary-independent linking mechanism? The empirical answer is clearly ‘no’ since existing approaches have not gained anything like widespread use, and only a few voices in the wilderness even see this as a problem. In fact, HTML5 has gone in quite the opposite direction, rejecting the notion of even a vocabulary-independent syntax, to say nothing of higher layers like intent. I have to admit this mystifies me.

That said, it seems like the attribute name ‘href’ has done pretty well in representing intended hyperlinks. The name ‘src’ not quite as well. I still consider it best practice to use these names instead of making something else up.

What do you think? -m


Saturday, August 10th, 2013

XForms in 2013

This year’s Balisage conference was preceded by the international symposium on Native XML User Interfaces, which naturally enough centered around XForms.

As someone who’s written multiple articles surveying XForms implementations, I have to say that it’s fantastic to finally see one break out of the pack. Nearly every demo I saw in Montreal used XSLTForms if it used XForms at all. And yet, one participant I conversed with afterwards noted that very little that transpired at the symposium couldn’t have been done ten years ago.

It’s safe to say I have mixed emotions about XForms. One one hand, watching how poorly the browser makers have treated all things XML, I sometimes muse about what it would look like if we started fresh today. If we were starting anew, a namespace-free specification might be a possibility. But with  XForms 2.0 around the corner, it’s probably more fruitful to muse about implementations. Even though XSLTForms is awesome, I still want more. :-)

  • A stronger JavaScript interface. It needs to be possible to incrementally retrofit an existing page using POHF (plain old HTML forms) toward using XForms in whole or in part. We need an obvious mapping from XForms internals to HTML form controls.
  • Better default UI. I still see InfoPath as the leader here. Things designed in that software just look fantastic, even if quickly tossed together.
  • Combining the previous two bullets, the UI needs to be more customizable, and easier so. It needs to be utterly straightforward to make XForms parts of pages fit in with non-XForms parts of pages.
  • Rich text: despite several assertions during the week, XForms can actually handle mixed text, just not very well. One of the first demo apps (in the DENG engine–remember that?) was an HTML editor. The spec is explicitly designed in such a way as to allow new and exciting forms widgets, and a mixed-content widget would be A Big Deal, if done well.
  • Moar debugging tools

During the main conference, Michael Kay demonstrated Saxon-CE, an impressive tour-de-force in routing around the damage that is browser vendors’ attitudes toward XML. And though he didn’t make a big deal of it, it’s now available freely under an open source license. This just might change everything.

Curious about what others think here–I welcome your comments.


Monday, September 17th, 2012

MarkLogic 6 is here

MarkLogic 6 launched today, and it’s full of new and updated goodies. I spent some time designing the new Application Builder including the new Visualization Widgets. If you’ve used Application Builder in the past, you’ll be pleasantly surprised at the changes. It’s leaner and faster under the hood. I’d love to hear what people think of the new architecture, and how they’re using it in new and awesome ways.

If I had to pick out a common theme for the release, it’s all about expanding the appeal of the server to reach new audiences. The Java API makes working with the server feel like a native extension to the language, and the REST API makes it easy to extend the same to other languages.

XQuery support is stronger than ever. I liked Ryan Dew’s take on some of the smaller, but still useful features.

This wouldn’t be complete without thanking my teammates who really made this possible. I had the great pleasure of working with some top-notch front-end people recently, and it’s been a great experience. -m


Thursday, August 23rd, 2012

Super simple tokenizer in XQuery

A lexer might seem like one of the boringest pieces of code to write, but every language brings it’s own little wrinkles to the problem. Elegant solutions are more work, but also more rewarding.

There is, of course, a large body of work on table-driven approaches, several of them listed here (and bigger list), though XQuery seems to have been largely left out of the fun.

In MarkLogic Search API, we implemented a recursive tokenizer. Since a search string can contain quoted pieces which need to be carefully maintained, first we split (in the fn:tokenize-sense, discarding matched delimiters) on the quote character, then iterate through the pieces. Odd-numbered pieces are chunks of tokens outside of any quoting, and even-numbered pieces are a single quoted string, to be preserved as-is. We recurse through the odd chunks, further breaking them down into individual tokens, as well as normalizing whitespace and a few other cleanup operations. This code is aggressively optimized, and it removes any searches for tokens known to not appear in the overall string. It also preserves the character offset positions of each token relative to the starting string, which gets used downstream, so this makes for some of the most complicated code in the Search API. But it’s blazingly fast.

When prototyping, it’s nice to have something simpler and more straightforward. So I came up with an approach using fn:analyze-string. This function, introduced in XSLT 2.0 and later ported to XQuery 3.0, takes a regular expression, and returns all of the target string, neatly divided into match and non-match portions. This is great, but difficult to apply across the entire string. For example, potential matches can have different meaning depending on where they fall (again, quoted strings as an example.) But if every regex starts with ^ which anchors the match to the front of the string, the problem simplifies to peeling off a single token from the front of the string. Keep doing this until there’s no string left.

This is a particularly nice approach when parsing a grammar that’s formally defined in EBNF. You can pretty much take the list of terminal expressions, port them to XQuery-style regexes, add a ^ in front of each, and roll.

Take SPARQL for example. It’s a reasonably rich grammar. The W3C draft spec has 35 productions for terminals. I sketched out some of the terminal rules (note these are simplified):

declare variable $spq:WS     := "^\s+";
declare variable $spq:QNAME  := "^[a-zA-Z][a-zA-Z0-9]*:[a-zA-Z][a-zA-Z0-9]*";
declare variable $spq:PREFIX := "^[a-zA-Z][a-zA-Z0-9]*:";
declare variable $spq:NAME   := "^[a-zA-Z][a-zA-Z0-9]*";
declare variable $spq:IRI    := "^<[^>]+>";

Then going through the input string, and seeing which of these expressions match, and if so, calling analyze-string and adding the matched portion as a token, and recursing on the non-matched portion. Note that we need to go through longer matches first, so the rule for ‘prefix:qname’ comes before the rule for ‘prefix:’ which comes before the rule for ‘string’

declare function spq:tokenize-recurse($in as xs:string, $tl as json:array) {
    if ($in eq "")
    then ()
    else spq:tokenize-recurse(
        case matches($in, $spq:WS)     return spq:discard-tok($in, $spq:WS)
        case matches($in, $spq:QNAME)  return spq:peel($in, $spq:QNAME, $tl, "qname")
        case matches($in, $spq:PREFIX) return spq:peel($in, $spq:PREFIX, $tl, "prefix", 0, 1)
        case matches($in, $spq:NAME)   return spq:peel($in, $spq:NAME, $tl, "name")

Here, we’re co-opting a json:array mutable object as a convenient way to store tokens as we peel them off. There’s not actually any JSON involved here. The actual peeling looks like this:

declare function spq:peel(
    $in as xs:string,
    $regex as xs:string,
    $toklist as json:array,
    $type as xs:string,
    $triml, $trimr) {
    let $split := analyze-string($in, $regex)
    let $match := string($split/str:match)
    let $match := if ($triml gt 0) then substring($match, $triml + 1) else $match
    let $match := if ($trimr gt 0) then substring($match, 1, string-length($match) - $trimr) else $match
    let $_ := json:array-push($toklist, <searchdev:tok type="{$type}">{$match}</searchdev:tok>)
    let $result := string($split/str:non-match)
    return $result

Some productions, like a <iri> inside angle brackets, contain fixed delimiters which get trimmed off. Some productions, like whitespace, get thrown away. And that’s it. As it stands, it’s pretty close to a table-driven approach. It’s also more flexible than the recursive approach above–even for things like escaped quotes inside a string, if you can write a regex for it, you can lex it.


But is it fast? Short answer is that I don’t know. A full performance analysis would take some time. But a few quick inspections shows that it’s not terrible, and certainly good enough for prototype work. I have no evidence for this, but I also suspect that it’s amenable to server-side optimization–inside the regular expression matching code, paths that involve start-anchored matches should be easy to identify and in many cases avoid work farther down the string. There’s plenty of room on the XQuery side for optimization as well.

If you’ve experimented with different lexing techniques, or are interested in more details of this approach, drop me a line in the comments. -m

Tuesday, August 7th, 2012

Balisage Bound

I’m en route to Balisage 2012, though beset by multiple delays. The first leg of my flight was more than two hours delayed, which made the 90 minute transfer window…problematic. My rebooked flight, the next day (today, that is) is also delayed. Then through customs. Maybe all I’ll get out of Tuesday is Demo Jam. But I will make it.

I’m speaking on Thursday about exploring large XML datasets. Looking forward to it!


Monday, June 4th, 2012

Relax NG vs XML Schema: ten year anniversary

Today is the 10-year anniversary of this epic message from James Clark on the relative merits of Relax NG vs. XML Schema, and whether the latter should receive preferential treatment. Still relevant today–the discussion is still going, although an increasing number of human-readable web specifications have adopted RelaxNG in some form. -m

Thursday, December 8th, 2011

Resurgence of MVC in XQuery

There’s been an increasing amount of talk about MVC in XQuery, notably David Cassel’s great discussion and to an extent Kurt Cagle’s platform discussion that touched on forms interfaces. Lots of Smart People are thinking in this area, and that’s a good thing.

A while back I recorded my thoughts on what I called MET, or the Model Endpoint Template organizational pattern, as used in MarkLogic Application Builder. One difference between 2009 and now, though, is that browsers have distanced themselves even farther from XML, which tends to undercut the eliminate-the-impedance-mismatch argument. In particular, the forms model in HTML5 continues to prefer flat data, which to me indicates that models still play an important role in XQuery web apps.

So I envision the app lifecycle like this:

  1. The browser requests a particular page, say the one that lets you configure sorting options in the app you’re building
  2. An HTML page loads.
  3. Client-side script requests the project state from a designated endpoint, the server transforms the XML into a flat list, and delivers it as JSON (as an optimization, the server can package the initial data into the page delivered in the prior step)
  4. Standard form interaction and client-side scripting happens, including manipulation of repeating structures mediated by JavaScript
  5. A standard form submit happens (possibly via script), sending a flat list back to the client, which performs an update to the stored XML.
It’s pretty easy to envision data-mapping tools and libraries that help automate the construction of the transforms mentioned in steps 3 and 5.

Another thing that’s changed is the emergence of XQuery plugin technology in MarkLogic. There’s a rapidly-growing library of reusable components, initially centered around Information Studio but soon to cover more ground. This is going to have a major impact on XQuery app designs as components of the app (think visualization widgets) can be seamlessly added to apps.

Endpoints still make a ton of sense for XQuery apps, and provide the additional advantage that you now have a testable, concern-separated data layer for your app. Other apps have a clean way to interop, and even command-line operaton is possible with off-the-shelf-tools like wget.

Lastly, Templates. Even if you use plugins for the functional core of your app, there’s still a lot of boilerplate stuff you’d not want to repeat. Something like Mustache.xq is a good fit for this.

Which is all good–but is it MVC? This organizational pattern (let’s call it MET 2.0) is a lot closer to it. Does MET need a controller? Probably. (MarkLogic now ships a pretty good one called rest:rewrite) Like MVC, MET separates the important essences of your application. XQuery will never be Ruby or Java, and its frameworks will never be Rails or Spring, but rather something uniquely poised to capture the expressive power of the language to build apps on top of unstructured and big data. -m

Thursday, February 3rd, 2011

We’ll always have Prague

Today I exchanged electrons with a major airline, which will ultimately result in them removing a certain amount of abstract currency units from my account.

In other words, see you all at XML Prauge 2011. I’ve never been to this conference before, and each year I hear better and better things. Looking forward to it. -m

Thursday, August 5th, 2010

Balisageurs: XML and JSON

At David Lee’s nocturne about XML and JSON round-trippimg, several folks were talking about a site that listed several “off-the-shelf” conversion methods, but nobody could remember the site.

Late that night, with 15 minutes of battery remaining, I found it. The operative search term is XSLTJSON. -m

Tuesday, August 3rd, 2010

Heard, overheard, and misheard at Balisage

The opening day of the conference was not Balisage proper, but a separate symosium on “XML for the long haul”.

Some interesting tidbits overheard, in no particular order…

“it is not necessarily clear that this approach would capture the difference between the ridiculous and the merely implausible.”

Complexity — what is the relationship betwen complexity and long-term data storage?

“Narratives with fancy words in them”

How do you store, say, a video in a format that will be readable in 100 years?

Order of magnitude scale changes produce discontinuities

“The Da Vinci Schema”

Dandelion DNA (Free license)

“Indispensible” — “I don’t think that means what you think it does”

“Keeping electrons alive is really difficult”

“I wondered…with my Topic Map brain damage…”


Sunday, May 30th, 2010

Balisage contest: solving the wikiml problem

I wish I could say I had something to do with the planning of this: part of Balisage 2010 is a contest to “encourage markup experts to review and to research the current state of wiki markup languages and to generate a proposal that serves to de-babelize the current state of affairs for the long haul.”  To enter, you must propose a set of concrete steps (organizational, social, and/or technological) that will enable wiki content interchange, a real WYSIWYG editor, and/or wiki syntax standardization.

This pushes all of my buttons. It’s got structured documents, Web, parser geekery, writing, engineering, and standards. There’s a bunch of open source prior art, including PyXMLWiki, which I adapted from some fantastic earlier work from Rick Jelliffe.

Sadly, MarkLogic employees aren’t eligible to enter. Get your write-up done by July 15 and sent to balisage-2010-contest at marklogic dot com. The winner will be announced at Balisage and will take home some serious prize winnings, and also will be strongly encouraged (but not required) to give a brief summary (~10 minutes) of their winning entry.

Can’t wait to see what comes out of this. -m

Tuesday, May 11th, 2010

XProc is ready

Brief note: The W3C XProc specification, edited by my partner-in-crime Norm Walsh, has advanced to Recommendation status. Now go use it. -m

Sunday, April 18th, 2010

The challenge of an XProc GUI

I’ve been thinking lately about what a sleek UI for creating XProc would look like. There’s plenty of big-picture inspiration to go around, from Yahoo Pipes to Mac OSX Automator, but neither of these are as XML-focused as something working with XProc would be.

XML, or to be really specific, XML Namespaces, comes with its own set of challenges. Making an interface that’s usable is no small task, particularly when your target audience includes the 99.9% of people that don’t completely understand namespaces. Take for example a simple step, like p:delete.

In brief, that step takes an XSLTMatchPattern, following the same rules as @match in XSLT, which ends up selecting various nodes from the document, then returns a document without any of those nodes. An XSLTMatchPattern has a few limitations, but it is a very general-purpose selection mechanism. In particular, it could reference an arbitrary number of XML Namespace prefix mappings. Behind a short string like a:b lies a much longer namespace URI mapping to each prefix.

What would an intuitive user interface look like to allow entry of these kinds of expressions? How can a user keep track of unbound prefixes and attach them properly? A data-driven approach could help, say offering a menu of existing element, attribute, or namespace names taken from a pool of existing content. But by itself this falls short in 1) richer selectors, like xhtml:p[@class = “invalid”] and 2) doesn’t help in the general case, when the nodes you’re manipulating might have come from the pipeline, not your original content. (Imagine one step in the pipeline translates your XML to XHTML followed by a delete step that cleans out some unwanted nodes).

So yeah, this seems like a Really Hard Problem, but one that’s worth taking a crack at. If this sounds like the kind of thing you’d enjoy working on, my team is hiring–drop me a note.


Friday, April 2nd, 2010

Recalibrating expectations of XML performance

Working at MarkLogic has forced me to recalibrate my expectations around XML-related performance issues. Not to brag or anything, but it’s screaming fast. Conventional wisdom of avoiding // in paths doesn’t apply, since that’s the sort of thing the indexes are made to do, and that’s just the start. Single milliseconds are now a noteworthy amount of time for something showing up in the profiler.

This is what XML was supposed to be like. Now that XML has fallen off the hype cycle, we’re getting some serious work done. -m

Friday, March 5th, 2010

A Hyperlink Offering revisited

The xml-dev mailing list has been discussing XLink 1.1, which after a long quiet period popped up as a “Proposed Recommendation”, which means that a largely procedural vote is is all that stands between the document becoming a full W3C Recommendation. (The previous two revisions of the document date to 2008 and 2006, respectively)

In 2005 I called continued development of XLink a “reanimated spectre”. But even earlier, in 2002 I wrote one of the rare fiction pieces on, A Hyperlink Offering, which using the format of a Carrollian dialog between Tortoise and Achilles, explained a few of the problems with the XLink specification. It ended with this:

What if the W3C pushed for Working Groups to use a future XLink, just not XLink 1.0?

Indeed, this version has minor improvements. In particular, “simple” links are simpler now–you can drop an xlink:href attribute where you please and it’s now legit. The spec used to REQUIRE additional xlink:type=”simple” attributes all over the place. But it’s still awkward to use for multi-ended links, and now even farther away from the mainstream hyperlinking aspects of HTML5, which for all of its faults, embodies the grossly predominant description of linking on the web.

So in many ways, my longstanding disappointment with XLink is that it only ever became a tiny sliver of what it could have been. Dashed visions of Xanadu dance through my head. -m

Monday, February 22nd, 2010

Mark Logic User Conference 2010

Are you coming? Link. It starts on May 4 (Star Wars day!) at the InterContinental Hotel in San Francisco. Guest speakers include Chris Anderson, Editor-in-Chief of Wired and Michelle Manafy, Editor-in-Chief of EContent magazine.

Early bird registration ends Feb 28. -m

Tuesday, February 16th, 2010

There is no honor in namespaces

As heard from my friend and Mark Logic contractor Ryan Grimm. -m

Wednesday, October 7th, 2009

US Federal Register in XML

Fed Thread is a front end for the newly XMLified Federal Register. Why is this a big deal? It’s a daily publication of the goings-on of the US government. It’s a primary source for all kinds of things that normally only get rehashed through news organizations. And it is bulky–nobody can read through it on a regular basis. A yearly subscription (printed) would cost nearly $1000 and fill over 80,000 pages.

Having it in XML enables all kinds of searching, syndication, and annotation via flexible front ends like this one. Yay for transparency. -m

Tuesday, August 18th, 2009

Geek Thoughts: reading XProc code

All the input/output/port stuff in XProc seemed incomprehensible to me until I recognized something simple. Every time you see a <pipe> element, read it as “comes from”. For example

  <p:output port="result">
    <p:pipe step="validated" port="result"/>

reads as ‘output to the “result” port comes from the port “result” on step “validated”‘ and

  <p:input port="source">
    <p:pipe step="included" port="result"/>

reads as ‘input for the “source” port comes from the port “result” on step “included”‘. If you keep this in mind it all makes much more sense.

More collected Geek Thoughts at

Wednesday, August 5th, 2009

Misunderstanding Markup

On this comic‘s panel 9 describes XHTML 1.1 conformance as:

the added unrealistic demand that documents must be served with an XML mime-type

I can understand this viewpoint. XHTML 1.1 is a massively misunderstood spec, particularly around the modularization angle. But because of IE, it’s pretty rare to see the XHTML media-type in use on the open web. Later, panel 23 or thereabouts:

If you want, you can even serve your documents as application/xhtml+xml, instantly transforming them from HTML 5 to XHTML 5.

Why the shift in tone? What makes serving the XML media type more realistic in the HTML 5 case? IE? Nope, still doesn’t work. I’ve observed this same shift in perspective from multiple people involved in the HTML5 work, and it baffles me. In XHTML 1.1 it’s a ridiculous demand showing how out of touch the authors were with reality. In HTML5 the exact same requirement is a brilliant solution, wink, wink, nudge, nudge.

As it stands now, the (X)HTML5 situation demotes XHTML to the backwaters of the web. Which is pretty far from “Long Live XHTML…”, as the comic concludes. Remember when X stood for Extensible?


Friday, July 24th, 2009

Java-style namespaces for markup

I’m noodling around with requirements and exploring existing work toward a solution for “decentralized extensability” on xml-dev, particularly for HTML. The notion of “Java-style” syntax, with reverse dns names and all, has come up many times in the context of these kinds of discussions, but AFAICT never been fully fleshed out. This is ongoing, slowly, in available time–which as been a post or two per week.  (In case there is any doubt, this is a spare-time effort not connected with my employer)

Check it out and add your knowledge to the thread. -m

Saturday, July 11th, 2009

The decline of the DBMS era

Several folks have been pointing to this article which has some choice quotes along the lines of

If we examine the nontrivial-sized DBMS markets, it turns out that current relational DBMSs can be beaten by approximately a factor of 50 in most any market I can think of.

My employer is specifically mentioned:

Even in XML, where the current major vendors have spent a great deal of energy extending their engines, it is claimed that specialized engines, such as Mark Logic or Tamino, run circles around the major vendors

And it’s true, but don’t take my word for it. :-) The DBMS world has lots of inertia, but don’t let that blind you to seeing another way to solve problems. Particularly if that extra 50x matters. -m

Tuesday, July 7th, 2009

Demo Jam at Balisage 2009

Come join me at the Demo Jam at Balisage this year. August 11 at 6:30 pm. There will be lots of cool demos, judged by audience participation. I’d love to see you there. -m

Wednesday, June 3rd, 2009

See you at Balisage

Balisage, formerly Extreme Markup, is the kind of conference I’ve always wanted to attend.

Historically my employers have been not quite enough involved in the deep kinds of topics at this conference (or too cash-strapped, but let’s not go there) to justify spending a week on the road. So I’m glad that’s no longer the case: Mark Logic is sponsoring the conference this year. I’m looking forward to the show, and since I’m not speaking, I might be able to relax a little and soak in some of the knowledge.

See you there! -m

Tuesday, March 24th, 2009

XIN: Implicit namespaces

An interesting proposal from Liam Quin, relating to the need for huge rafts of namespace declarations on mixed namespace documents.

In practice, though, almost all elements [in the given example] are going to be unambiguous if you take their ancestors into account, and attributes too.

Amen. I’ve been saying things like this for five years now. Look at any introductory text on XML, and the example used to show the need for namespaces will be embarrassingly contrived. That’s not a dig against authors, it’s a dig against over-engineered solutions to non-problems.


Wednesday, February 25th, 2009

Brian May explains relativity

This is fantastic. Brian May (yes THAT Brian May) not only blogs, but talks about all kinds of challenging subjects. Like how and why space and time are linked. Worth a read. -m

Thursday, December 11th, 2008

XML 2008 liveblog: Introduction to eXist and XQuery

Greg Watson, IT Specialist, Defense Intelligence Agency Missile and Space Intelligence Center (apparently it IS rocket science). I installed eXist last night to follow along with the talk.

“If you have a larger dataset, eXist may not be the best choice.” Recommended reading: XQuery by Priscilla Walmsley, XQuery wikibook.

Download and install. Needs a full JDK (Mac includes this already in /Library/Java/Home), a mere JRE is insufficient. Start up with bin/

eXist-specific useful functions: request:get-parameter() from the URI query string. transform:transform() function invokes XSLT from within XQuery.

Example uses doc() to fetch an external URL of RSS, check individual items with contains(). Every example is a fully-formed, click-on-a-link-to-run program.

XQuery and PHP: reallly basic integration with simplexml_load_file($myXQueryURL).

Loading scripts into eXist: He uses XML Spy. eXist has a Jaa Web Start admin client.

Q&A: How bit is too big? Maybe 10-20-30 thousand docs. Generate indexes? Yes.


(Production note: somehow, this one didn’t get published live. Now it is.)

Tuesday, December 9th, 2008

XML 2008 liveblog: Introduction to Schematron

Wendell Piez, Mulberry Technologies

Assertion-based schema language. A way to test XML documents. Rule-based validation language. Cool report generator. Good for capturing edge cases.

Same architecture as XSLT. (Schematron specifies, does not perform)

<schema xmlns="">
  <title>Check sections 12/07</title>
  <pattern id="section-check">
    <rule context="section">
      <assert test="title">This section has no title</assert>
      <report test="p">This section has paragraphs</report>

Demo. OxygenXML has support. Assert vs. Report – essentially opposites. Assert means “tell me if this if false”. Report means “tell me if this is true”.

“Almost as if Schematron is a harness for XPath testing.”

More examples:

<rule context="note">
  <report test="ancestor::note">A note appears in a note. OK?</report>

Binding: Default is XSLT 1, but flexible enough to allow other query langauges via attribute @queryBinding at the top. Many processors allow mix-and-match between XSLT and Schematron. Examples showing just that.

Some tests can be very useful:

test=”every $line in tokenize(., $newline) satisfies string-length($line) le 72″

Q: What if the destination is not a human, but another part of a pipeline? Varies by implementation, but SVRL is standardized as an annex in the ISO spec, part of DSDL.

Use as little or as much as you want, at different times in the document lifecycle. “Schematron is a feather duster that reaches areas other schema languages cannot.” – Rick Jelliffe

As time permits section of the talk:

Other top-level elements: title, pattern, ns, let, p, include, phase, diagnostics.


Tuesday, December 9th, 2008

XML 2008 liveblog: Automating Content Analysis with Trang and Simple XSLT Scripts

Bob DuCharme, Innodata Isogen

Content analysis: why? You’ve “inherited” content. Need to save time or effort.

Handy tool 1: “sort”. As in the Unix command line tool. (Even Windows)

Handy tool 2: “uniq -c”  (flag -c means include counts)

Elsevier contest: interface for reading journals. Download a bunch of articles, and see what’s all in there.

Handy tool 3: Trang. Schema language converter. But can infer a schema from one or more input documents. Concat all sample documents under one root, and infer–this gives a list of all doctypes in use.

trang article.dtd article.rng
trang issueContents.xml issueContents.rng
saxon article.rng compareElsRNG.xsl | sort > compareElsRNG.out

compareElsRNG.xsl has text mode output, ignores input text nodes, and checks whether the RNG has references to each element, outputing “Yes: elementname” or “No: elemenname”. (which gets sorted in step 3)

Helps ferret out places where the schema says 40 different child elements are possible but in practice only 4 are used.

Handy tool 4: James Clark’s sx, converts SGML to XML.

Another stylesheet counts elements producing a histogram. [Ed. I would do this in XQuery in CQ.] Again, can help prioritize parts of the XML to use first. Similar logic for parent/child counts; where @id gets used; find all values for a particular attribute.

Another stylesheet goes through multiple converted-to-rng schemas, looking for common substructure. Lists generated this way can be pulled into a stylesheet.

Analyze a SGML DTD? dtd2html -> tidy -> XSLT. Clients like reports (especially spreadsheets). The is more like lego bricks.


Tuesday, December 9th, 2008

XML 2008 liveblog: Using RDFa for Government Information

Mark Birbeck, Web Backplane.

Problem statement: You shouldn’t have to “scrape” government sites.

Solution: RDFa

<div typeof="arg:Vacancy">
  Job title: <span property="dc:title">Assistant Officer</span>
  Description: <span property="dc:description">To analyse... </span>

This resolves to two full RDF triples. No separate feeds, uses existing publishing systems. Two of the most ambitious RDFa projects are taking place in the UK. Flexible arrangements possible.

Steps: 1. Create vocabulary. 2. Create demo. 3. Evangelize.

Vocabulary under Google Code: Argot Hub. Reuse terms (dc:title, foaf:name) where possible, developed in public.

Demos: Yahoo! SearchMonkey, (good for helping not-so-technical people to “get it”) then a Drupal hosted one (a little more control).

Next level, a new server that aggregates specific info (like all job openeings for Electricians), incuding geocoding. Ubiquity RDFa helps here.

Evangelizing: Detailed tutorials. Drupal code will go open source. More opportunities with companies currently screen-scrapting. More info @

Q&A: Asking about predicate overloading (dc:title). A general SemWeb issue. Context helps. Is RDFa tied to HTML? No, SearchMonkey itself uses RDFa–it’s just attributes.