Archive for the 'xml' Category

Tuesday, December 9th, 2008

XML 2008 liveblog: Exploring the New Features of XSLT 2.0

Priscilla Walmsley, Datypic.

“I feel like crying every time I have to go back to 1.0.” Normally this is a full-day course. Familiarity with XSLT 1.0 assumed here. Venn diagram… Much of what people think of as “XQuery” is actually XPath 2.0.

XPath differences: root node -> “document node”. Namespace nodes, axis are deprecated. More atomic types, based on XML Schema. Node-set -> sequence. Path steps can be expressions, like product/(if (desc) then desc else name). Last step can return an atomic value, like sum(//item/(@price * @qty)).

Comparison operators apply to strings, dates, times. (Backwards compatibility note: comparing strings now is done by Unicode code point, not by conversion to number() as in XPath 1.0). Arithmetic possible on dates, durations. Missing value returns empty sequence rather than NaN.

(a,b) to concat sequences. New operators: idiv, union, intersect, except (latter 3 for nodes only)

<xsl:for-each select="1 to $count"> is handy. Operators << and >> test ‘precedes’ and ‘follows’ based on document order. Operator ‘is’ tests node identity.

Statement if/then/else is a more compact xsl:choose. Simplified FLWOR (only one for, no let or where).

Useful functions: ends-with(), string-join(), current-date(), distinct-values(), deep-equal().

From XPath to XSLT: <xsl:for-each-group> with current-group() and current-grouping-key(). Useful for turning a flat document (like HTML with h1, h2, etc. into nested structure. group-starting-with=”html:h1″, etc. The instruction <xsl:function> allows defining a new function. Major benefits in reuse, clarity, and handling recursion. Custom functions can be called from more places, like @select, @group-by, @match, but have the same expressive power of a named template.

Regular expressions: some XPath functions matches(), tokenize(), replace() (including subexpressions). <xsl:analyze-string> splits a string into matching and non-matching parts, handled separately in <xsl:matching-substring> and <xsl:non-matching-substring> child elements and regex-group().

I/O: Instruction <xsl:result-document> allows multiple output files. unparsed-text() allows input of non-XML documents (particularly in conjunction with regex).

Do I have to pay attention to types? “Usually, no.” BUT schemas can help catch errors, improve performance, and open new avenues of processing (like matching a template based on a schema-type).

Odds and ends: tunneling parameters (don’t have to repeat all the params for named templates), multiple modes, @select in more places, @separator attribute on xsl:attribute and xsl:value-of.

Brief Q&A: No test suite available. Probably better for new users to jump straight into 2.0. But going back to 1.0 is still painful. -m

Monday, December 8th, 2008

Overheard and overseen

Overheard at XML 2008: “Wow, it’s a good thing Mark Logic sponosred, otherwise nobody would be here.” (there were only five tables in the expo area.)

Overseen on the XML 2008 schedule: only one mention of XQuery, and that’s in relation to eXist, not the aforementioned sponsor.

This conference does have a different feel to it. Is XML at the ASCII-tipping-point, where it becomes so obvious that conferences aren’t needed? -m

Monday, December 8th, 2008

XML 2008 non-liveblog: Content Authoring Schemas

I was on the panel with Bob DuCharme, Frank Miller, and Evan Lenz discussing content authoring, from DITA to DocBook with some WordML sprinkled in for good measure. It was a good discussion, nothing earth-shaking. This session was laptopless, so I don’t have any significant notes. -m

Monday, December 8th, 2008

XML 2008 liveblog: Accelerated DITA Publishing

Roy Amodeo, Stilo.

Only 4 people in attendance when the talk starts. Quick overview of DITA. Transclusion (conref), topic-level maps, specialization, metadata-based filtering. XML and SGML flavors available. Open Toolkit has been a big part of DITA’s success. Replacable components (XSLT and FO). Many editing environments and CMS’s include this.

Topic-based publishing. Works best with many small, fairly independent topics. How well does the Open Toolkit work when pushing the boundaries? DITA stress test. Raising file size increases processing time faster than linear. Average file size 300k crashed. For overall number of files, roughly linear progression, but still blows up at large volumes.

Enter the OmniMark DITA Accelerator. Behavior modeled after toolkit, but minus the limits (streaming). Uses referents (placeholders left in place, filled in later; 2-pass algorithm). Base speed improvement 4X. Works well past where the Toolkit runs out of memory. Because DITA is standardized, the accelerated implementation can be easily plugged in.

Usability: XSLT exists somewhat uneasily with DITA. DITA Accelerator augments OmniMark with DITA-specific rules.

Conclusion: Standards are about choice of tools. (But how many OmniMark implementations are there?) Still, this makes me think I should check out the OmniMark language. I remain skeptical on DITA.


Monday, December 8th, 2008

XML 2008 liveblog: Content Modeling with XSD Schema

Delivered by Pradeep Jain, Ictect Inc. He has a handout available: “Intelligent Content Plug-In for Microsoft Word”, though it’s not obvious from the program that Word is involved.

What is content modeling? “Getting inside of” content, semantics, from there syntax and XML tagging.

Challenges: art vs. science, tacit vs. written documentation, future-proofing, technical vs. business communication, flexibility vs. stability. Getting knowledge workers to participate. Correctness (an emphasis of Ictect).

What is correctness of a model? More than valid XML. Litmus test: SME says “yep, I think you got it!”. But some machine-generated tests are possible.

Shows a Word doc with different kinds of bibliographic references (articles vs. books). Shows Schema code not visible from the back of the room. Word plug-in displays sidebar with a “convert” function, with several possible Schemas available to work against. Automatically detected sections in the document and added <section> elements. Progressively more complex examples of generated markup.

It seems like this is actually a pretty clever application, though it is hard to tell from this talk. -m

Monday, December 8th, 2008

XML 2008 liveblog: Ubiquity XForms

I will talk about one or more sessions from XML 2008 here.

Mark Birbeck of Web Backplane talking about Ubiquity XForms.

Browsers are slow to adopt new standards. Ajax libraries have attempted to work around this. Lots of experimentation which is both good and bad, but at least has legitimzed extensions to browsers. JavaScript is the assembly language of the web.

Ubiquity XForms is part of a library, which wil also include RDFa and SMIL. Initially based on YUI, but in theory sould be adaptable to other libraries like jQuery.

Declarative: tools for creation and validation. Easier to read. Ajax libraries are approaching the level of being their own language anyway, so might as well take advantage of a standard.

Example: setting the “inner value” of a span: <span value="now()"></span>.

Script can do this easily: onclick="this.innerHTML = Date().toLocaleString();" But crosses the line from semantics to specific behavior. The previous one is exactly how xforms:output works.

Another exapmple: tooltips. Breaks down to onmouseover, onmouseout event handlers, show and hide. A jQuery-like approach can search the document for all tooltip elements and add the needed handlers, avoiding explicit behavioral code. This is the essence of Ubiquity XForms (and in fact XForms itself).

Patterns like these compose under XForms. A button (xf:trigger) or any form control can easily have a tooltip (xf:hint). These are all regular elements, stylable with CSS, accesible via DOM, and so forth. Specific events (like xforms-hint) fire for specific events, and a spreadsheet-like engine can update interdependencies.

Question: Is this client-side? A: Yes, all running within Firefox. The entire presentation is one XForms document.

Demo: a range control with class=”geolocation” that displays as a map w/ Google Maps integration. The Ubiquity XForms library contains many such extensibility points.

Summary: Why? Simple, declarative. Not a programming language. Speeds up development. Validatable. Link:

Q&A: Rich text? Not yet, but not hard (especially with YUI). Formally XForms compliant? Very nearly 1.1 conforming.


Thursday, July 10th, 2008

Easing back into xml-dev

Traffic ain’t what it used to be there. But since I’m at a core xml technology company, it makes sense to participate again. Now, are there any topics left that haven’t been hashed to death? (hint: yes) -m

Wednesday, July 9th, 2008

Google Protocol Buffers: what’s missing from this picture?

Today Google announced Protocol Buffers, described as “think XML, but smaller, faster, and simpler“. Language bindings for C++, Java, and Python. Oddly not even a whisper about JSON, which is a much more apt comparison. And along with that, no JavaScript implementation. So why the omission?

My guess is that it wouldn’t compare that favorably with JSON. The extra needed compile step is a hassle, and doesn’t give enough of a relative benefit for Ajax applications. But perhaps this will unleash a torrent of people asking for ‘binary JSON’. OK, maybe not… -m

Wednesday, May 21st, 2008

XQuery Annoyances…

If you are used to XSLT 1.0 and XForms, you see { $book/bk:title } and think nothing of it. XSLT 1.0 calls the curly-brace construct an Attribute Value Template, which is pretty descriptive of where it’s used. Always in an attribute, always converted into a string, even if you are actually pointing to an element.

In XQuery, though, the curly-brace construct can be used in many different places. Depending on the context, the above code might well insert a bk:title element into your output. The proper thing to do, of course, is { $book/bk:title/text() }. Many XSLT and XForms authors would omit the extra text() selector as superfluous, but in XQuery it matters.

What’s worse, depending on your browser, you might not see any output on the page within a <bk:title> element (or a title element of any namespace). Caveat browser! -m


Friday, January 25th, 2008

WebPath: Python XPath 2 engine now up on Sourceforge

I’ve taken this opportunity to ditch CVS on all my existing Sourceforge projects (pyxmlwiki, xfv) while setting up my newest project. Here’s the browable subversion source. Have at it.

Where should you start with this code? Step zero, if you haven’t already, is to look through my XML 2007 slides on my site. First thing is to grab a copy of PLY, which is a dependency. Then with all these files in your current directory, run python with no parameters. At the interpreter prompt type import demo then demo.demo1(), demo.demo2(), and so on. This will give you a feel for how the system works. Look at the source of to see how it works at the high level.

To actually get into the code, I suggest opening and scrolling down to the end, where a large series of unit tests begins. Tracing through these will be (I hope!) instructive on how the various details of the engine are put together.

There are many missing pieces (a few intentionally so). So have a look around the code and start thinking about what you could do with it. One thing I would love to have happen soon is getting rid of minidom, replacing it with something more robust.

If you want developer access on Sourceforge, drop me a note with your sf username. -m

Thursday, January 24th, 2008

WebPath wants to be free (BSD licensed, specifically)

WebPath, my experimental XPath 2.0 engine in Python is now an open source project with a liberal BSD license. I originally developed this during a Yahoo! Hack Day, and now I get to announce it during another Hack Day. Seems appropriate.

The focus of WebPath was rapid development and providing an experimental platform. There remains tons of potential work left to do on it…watch this space for continued discussion. I’d like to call out special thanks to the Yahoo! management for supporting me on this, and to Douglas Crockford for turning me on to Top Down Operator Precedence parsers. Have a look at the code. You might be pleasantly surprised at how small and simple a basic XPath 2 engine can be. So, who’s up for some XPath hacking?

Code download. (Coming to SourceForge with CVS, etc., in however many days it takes them to approve a new project) I hope this inspires more developers to work on similar projects, or better yet, on this one! -m

Monday, December 31st, 2007

Should documents self-version?

This blog page at the W3C discusses the TAG finding that a data format specification SHOULD provide for version information, specifically reconsidering that suggestion. As a few data points, XML 1.1 (with explicit version identifiers) is something of a non-starter, while Atom (without explicit version identifiers) is doing OK so far–though a significant revision to the core hasn’t happened and perhaps never will.

In a chat with Dave Orchard at XML 2007, I suggested that the evolution of browser User-Agent strings might be a useful model, since it developed in response to the actual kinds of problems that versioning needs to solve.

Indeed, the idea seemed familiar in my mind. In fact, I posted it here, in Feb 2004. The remainder of this posting republishes it with minor edits for clarity:

‘Standard practice’ of x.y.z versioning, where x is major, y is minor, and z is sub-minor (often build number) is not best practice. If you look at how systems actually evolve over time, a more ‘organic’ approach is needed.

For example, look at how browser user agent strings have evolved. Take this, for example:

Mozilla/4.0 (compatible; MSIE 6.0; MSIE 5.5; Windows 98) Opera 7.02 [en]

Wow, if detection code is looking for a substring of “Mozilla” or “Mozilla/4” or “Mozilla/4.0”, or “MSIE” or “MSIE 6” or “MSIE 6.0” or “Opera” or “Opera 7” or “Opera 7.0” or “Opera 7.0.2” it will hit. If you look at the kind of code to determine what version of Windows is running, or the exact make and model of processor, you will see a similar pattern.

Since this is the way of nature, don’t fight it with artificial, fixed-length major.minor versioning. Embrace organically growing versions.

The first version of anything should be “1.” including the dot. (letters will work in practice too) All sample code, etc. that checks versions must stop at the first dot character; anything beyond that is on a ‘needs-to-know’ basis. A check-this-version API would be extremely useful, though a basic string compare SHOULD work.

Then, whenever revisions come out, the designers need to decide if the revision is compatible or not. A completely incompatible release would then be “2.”. However, a compatible release would be “1.1.”. All version checking code would continue to look only up to the first dot, unless it has a specific reason to need more details. Then it can go up to the 2nd dot, no more.

Now, even code that is expecting version “1.1.” will work fine with “1.1.1.” or 1.1.86.” or “”.

Every new release needs to decide (and explicitly encode in the version string) how compatible it is with the entire tree of earlier versions.

Now, as long as compatible revisions keep coming out, the version string gets longer and longer. This is the key benefit, and why fixed-field version numbers are so inflexible. (and why you get silly things like Samba reporting itself as “Windows 4.9”).

One possible enhancement, purely to make version numbers look more like what folks are used to, is to allow a superfluous zero at the end. This the first version is 1.0, followed by 1.1.0,, (this next one made an incompatible change) 1.2.0, and so on.

So if a document needs to self-version at all, perhaps a scheme like this should be used? -m

Monday, December 31st, 2007

XPath puzzler: solution

Thanks to all the folks who showed interest in this little XPath puzzler published here a few weeks ago. Some asked to see the dataset, but I’m not able to release it at this time (but ask me again in 3 months).

Turns out it was a combination of two bugs, one mine, one somebody else’s. Careful observers noted that I wasn’t using any namespace prefixes in the XPath, and since I did specify that it was XPath 1.0, that technically rules out XHTML as the source language. Like nearly all XML I work with these days, the first thing I do is strip off the namespaces to make it easier to work with. Bug #1 was that in a few cases, the namespaces didn’t get stripped.

Bug #2 was in the XPath engine itself. Which one? Uh, whatever one ships with the “XPath” plugin for JEdit. It’s hard to tell directly, but I think it might be an older version of Xalan-J. In the case of the expression //meta, it properly located only those elements part of no namespace. But in the case of //meta/@property, it was including all the nodes that would have been selected by //*[local-name(.)='meta']/@property. Hence, a larger number of returned nodes.

Confusing? You bet!  -m

P.S. WebPath would not have this problem, since in the default mode it matches local-names only to begin with.

Friday, December 21st, 2007

XML 2007 buzz: XForms 1.1

One whole evening of the program was devoted to XForms, focused around the new 1.1 Candidate Recommendation. I admit that some of the early 1.1 drafts gave me pause, but these guys did a good job cleaning up some of the dim corners and adding the right features in the right places. This is worth a careful look. -m

Friday, December 21st, 2007

XML 2007 buzz: Hadoop

OK, the majority of the buzz came from my talk, where I strongly encouraged folks to take a look at Hadoop. This article seems to be saying much the same things. If you’re curious about the future of distributed computation and storage, it’s worth a look. -m

Sunday, December 16th, 2007

Slides from XML 2007: WebPath: Querying the Web as XML

Here’s the slides from my presentation at XML 2007, dealing with an implementation of XPath 2.0 in Python. I hope to have even more news in this area soon.

WebPath (html)

WebPath (OpenDocument, 4.7 megs)

Did you notice the OpenOffice has nice slide export, that generates both graphically-accurate slides and highly indexable and accessible text versons? -m

Saturday, December 15th, 2007

XPath puzzler

While I’ve got your attention, here’s an XPath (1.0) puzzler. I have an RDFa dataset compiled from various and sundry sources. It’s all wrapped up in a single XML file. I run this XPath to see how many meta elements are present: //meta and it returns a node-set of size 762. Now, I want to see how many property elements are present, so I run the query: //meta/@property and it returns a node-set of size 764. How is it that the second node-set can be bigger than the first? -m

Saturday, December 15th, 2007

XML spell check

Surely somebody has implemented this in at least one tool.

In a text editor, I come across a misspelled close tag like </xsl:stylsheet>. My editor highlights the line as an error, which is is, not matching the start tag and all. Why can’t it go the extra step and give me the same kind of interface as I get for misspelled words, which an easy option to repair the spelling? This seems like a much simpler problem than all the hairy cases around human-language spell check…

So, what tools already do this today? -m

Thursday, November 29th, 2007

XPath 2.0 implementation details

Well, my plans for a series of postings about details of implementing XPath 2.0 fell rather short, so let’s skip straight to the good stuff.

An article by Mike Kay giving the details of the Saxon architecture. On the surface it’s about performance, but it also has an excellent section in internals. Worth a look. This has been quite influential for me, and maybe you too. -m

Saturday, November 10th, 2007

RDFa question

What is the difference between placing instanceof=”prefix:val” vs. rel=”prefix:val” on something? How do I decide between the two?

In the example of hEvent data, why is it better/more accurate to use instanceof=”cal:Vevent” instead of a blank node via rel=”cal:Vevent”?


Monday, November 5th, 2007

A better name for CURIEs (?)

“Compact Clark Notation“. (Inspired by reading this) -m

Monday, October 22nd, 2007

Is there fertile ground between RDFa and GRDDL?

The more I look at RDFa, the more I like it. But still it doesn’t help with the pain-point of namespaces, specifically of unmemorable URLs all over the place and qnames (or CURIEs) in content.

Does GRDDL offer a way out? Could, for instance, the namespace name for Dublin Core metadata be assigned to the prefix “dc:” in an external file, linked via transformation to the document in question? Then it would be simpler, from a producer or consumer viewpoint, to simply use names like “dc:title” with no problems or ambiguity.

This could be especially useful not that discussions are reopening around XML in HTML.

As usual, comments welcome. -m

Saturday, October 20th, 2007

Building a tokenizer for XPath or XQuery

In researching for an XPath 2.0 implementation, I ran across this curious document from the W3C. Despite being labeled a Working Draft (as opposed to a Note), it appears to be a one-shot document with no future hope for updates or enhancements.

In short, it outlines several options for the first stage or two of an XPath 2.0 or XQuery implementation. (Despite the title, it talks about more than just a tokenizer; additionally a parser and a possible intermediate stage). Tokenizing and parsing XPath are significantly more difficult than other languages, because things like this are perfectly legitimate (if useless):

if(if) then then else else- +-++-**-* instance
of element(*)* * * **---++div- div -div

The document tries to standardize on some terminology for various approaches toward dealing with XPath. The remaining bulk of the document sketches out some lexical states that would be useful for one particular implementation approach. I guess the vibrant, thriving throngs of XPath 2.0 developers didn’t see the need for this kind of assistance.

In short, I didn’t find it terribly useful. Maybe some readers have, though. Feel free to comment below. Subsequent articles here will describe how I approached the problem. Stay sharp! -m

Monday, October 15th, 2007

XForms evening at XML 2007

Depending on who’s asking and who’s answering, W3C technologies take 5 to 10 years to get a strong foothold. Well, we’re now in the home stretch for the 5th anniversary of XForms Essentials, which was published in 2003. In past conferences, XForms coverage has been maybe a low-key tutorial, a few day sessions, and hallway conversation. I’m pleased to see it reach new heights this year.

XForms evening is on Monday December 3 at the XML 2007 conference, and runs from 7:30 until 9:00 plus however ERH takes on his keynote. :) The scheduled talks are shorter and punchier, and feature a lot of familiar faces, and a few new ones (at least to me). I’m looking forward to it–see you there! -m

Monday, October 8th, 2007

XML 2007 Schedule

As widely reported by now, the final schedule for XML 2007 this December in Boston is up. All I have to add is the suggestion of careful attention to the Tuesday program at 4:00. :) If you can’t wait, some technical details are forthcoming in this space. That is all. -m

Wednesday, October 3rd, 2007

XML Annoyance: do greater-than signs need to be escaped?

Let’s see how many downstream pieces of software trip over this post…

Do greater-than and less-than signs need to be escaped in XML? Conventional wisdom has it that less-than signs always do, since that character starts a fresh “tag”, but greater-than signs are safe.


There is a particular sequence, namely ]]> , not allowed to occur unescaped in XML “for compatibility“–a particular phrase the spec uses to indicate rules that only an SGML-head could love (but still strict requirements nonetheless). Does your software prevent this condition from causing an error? -m

Monday, October 1st, 2007

simple parsing of space-seprated attributes in XPath/XSLT

It’s a common need to parse space-separated attribute values from XPath/XSLT 1.0, usually @class or @rel. One common (but incorrect) technique is simple equality test, as in {@class=”vcard”}. This is wrong, since the value can still match and still have other literal values, like “foo vcard” or “vcard foo” or ” foo vcard bar “.

The proper way is to look at individual tokens in the attribute value. On first glance, this might require a call to EXSLT or some complex tokenization routine, but there’s a simpler way. I first discovered this on the microformats wiki, and only cleaned up the technique a tiny bit.

The solution involves three XPath 1.0 functions, contains(), concat() to join together string fragments, and normalize-space() to strip off leading and trailing spaces and convert any other sequences of whitespace into a single space.

In english, you

  • normalize the class attribute value, then
  • concatenate spaces front and back, then
  • test whether the resulting string contains your searched-for value with spaces concatenated front and back (e.g. ” vcard “

Or {contains(concat(‘ ‘,normalize-space(@class),’ ‘),’ vcard ‘)} A moment’s thought shows that this works well on all the different examples shown above, and is perhaps even less involved than resorting to extension functions that return nodes that require further processing/looping. It would be interesting to compare performance as well…

So next time you need to match class or rel values, give it a shot. Let me know how it works for you, or if you have any further improvements. -m

Friday, September 21st, 2007

Come see me at XML 2007

Watch this space for details. I’ll be speaking about something related to Python and XPath 2.0. Watch this blog for tidbits on the subject. :) -m

Saturday, September 8th, 2007

Steven Pemberton and Michael(tm) Smith on (X)HTML, XForms, mobile, etc.

Video from XTech, worth a look. -m

Wednesday, August 8th, 2007

New W3C Validator

Go check it out. It even has a Tidy option to clean up the markup. But they missed an important feature: it should include an option to run Tidy on the markup first then validate. This is becoming the defacto bar for web page validity anyway… -m