Archive for the 'standards' Category

Thursday, October 30th, 2008

XiX (XForms in XQuery)

I’m pondering implementing the computational parts of the XForms Model in XQuery. Doing so in a largely functional environment poses some challenges, though. Has anybody tackled this before? How about in any functional language, including ML, Haskell, Scheme, XSLT, or careful Python?

I borrowed the book Purely Functional Data Structures from a friend–this looks to be a good start. What else is out there? Comment below. -m

Thursday, October 23rd, 2008

RDFa is a Recommendation

Haven’t mentioned here that RDFa is a W3C Recommendation. I’m thrilled that something that I’ve been thinking about for a while is ready for prime time.

Also, as of this writing the first page of results at Google still prominently links to a terribly outdated draft of the spec. The first page of results at Yahoo! nails it. Just sayin’.


Friday, October 10th, 2008

More mobile XForms goodness

I haven’t tried this, but these guys claim to have a solution where

The form definitions are saved and exchanged as XForms, and the data as XForm[s] models. The data can be exchanged over http (if the phone users can afford GPRS and have a data connection) or over compressed SMS messages.

Sounds like they have the right idea… -m

Thursday, October 2nd, 2008

XForms spambots on the loose

A determined spambot has been submitting the XForms contact form on XForms Institute. OK, so it’s probably more Flash-aware than XForms-aware, but still. -m

Wednesday, September 17th, 2008

The case for native higher-order functions in XQuery

The XQuery Working Group is debating the need for higher-order functions in the language. I’m working on honing my description of why this is an important feature. Does this work? What would work better?

Imagine you are writing a smallish widget app, in an environment without a standard library. When you need to sort your widgets, you’d write a simple function with a signature like sort(sequence-of-widgets). That’s great.

Now imagine you find your app to be steadily growing. An accumulation of smaller one-off solutions won’t work anymore, you need a general solution. What you’ll end up with is something like qsort in C, which takes a pointer to a comparator function. By providing different comparators, you can sort anything any way you like, all through only a single sort function. C and C++ have something like this, as do PHP, Python, Java, JavaScript, and even assembly language. XSLT has it, as proven by Dimitre.

XQuery doesn’t. It should, because people are now using it for more than short queries. People are writing programs in it. -m

P. S. Comment please.

Monday, August 4th, 2008

Implementing RDFa in XQuery

Through the weekend I put most of the final touches on an implementation of RDFa in XQuery. The implementation is based on the functional specification of RDFa, an offshoot of the excellent work coming out of the W3C task force.

The spec contains a procedural description of the parsing algorithm, and several have successfully followed it to arrive at a conforming implementation. But you would have tough times explaining RDFa to someone that way. The functional description sort of fell out of the way I described RDFa to people.

“When you see an element with XXXX, you generate a triple, using SSSS as the subject, PPPP as the predicate, and OOOO as the object.”

Which arguably is the more natural way to express the algorithm for functional languages like XQuery or XSLT. Fill in the right blanks and you pretty much have it. In practice, it’s somewhat more complicated, but not nearly so much as with other W3C specs.

I hope to make the code available soon. You’ll hear about it first here.

I’ll write more when I’m not exhausted. :-) -m

Friday, July 25th, 2008

Complete this sequence…

In C, if you find yourself writing large switch statements (or rafts of if statements), you should consider using pointers to functions instead.

In C++, if you find yourself writing large switch statements (or rafts of if statements), you should consider using objects and polymorphism instead.

In XQuery, If you find yourself writing large typeswitch statements (or rafts of if statements), you should consider using _______________ instead.

Comment here. -m

Wednesday, July 16th, 2008

Starting to wrap my head around XQuery 1.1

Looks like a reasonably-sized revision. The first public working draft seems downright thin, in fact, relative to all the SHOULDs and MAYs in the requirements document. In particular, I’d like to see progress on 2.3.16 Higher order functions. (Then do we get a book XQuery: The Good Parts? …kidding..)


Tuesday, July 15th, 2008

Top Down Operator Precedence in Python

This article made my day. Very similar approach to what I did in WebPath, but even cleaner. Great explanation and performance numbers. -m

P.S. Thanks to Crock for pointing this out.

Thursday, July 10th, 2008

Easing back into xml-dev

Traffic ain’t what it used to be there. But since I’m at a core xml technology company, it makes sense to participate again. Now, are there any topics left that haven’t been hashed to death? (hint: yes) -m

Wednesday, July 9th, 2008

Google Protocol Buffers: what’s missing from this picture?

Today Google announced Protocol Buffers, described as “think XML, but smaller, faster, and simpler“. Language bindings for C++, Java, and Python. Oddly not even a whisper about JSON, which is a much more apt comparison. And along with that, no JavaScript implementation. So why the omission?

My guess is that it wouldn’t compare that favorably with JSON. The extra needed compile step is a hassle, and doesn’t give enough of a relative benefit for Ajax applications. But perhaps this will unleash a torrent of people asking for ‘binary JSON’. OK, maybe not… -m

Friday, June 20th, 2008

RDFa is a Candidate Recommendation

The result of tons of work by lots of smart people. Go forth and implement. And I need to put in a plug for Metadata for Grandma which (indirectly, as it turned out) influenced the spec. RDFa is already a big deal, used in places like SearchMonkey. The subset of RDFa used by SearchMonkey is 100% conforming to the CR.

I’ll have more thoughts and perhaps implementation notes on this later. -m

Thursday, June 5th, 2008

Microformat search done right

From the Yahoo! Developer blog, new search keywords you can use to hone in on indexed microformats.

For example, to see every hAtom-bearing page that mentions ‘dubinko’ use the query [ dubinko]. Works similarly for hCard, hCalendar, hReview, and XFN. I’m sure more are coming soon too. -m

Thursday, May 29th, 2008


Bumped into XRX today. XForms + REST + XQuery. I like the sound of this, and XForms on the client just got a whole bunch easier…

I’m seeing multiple signs that the confluence of XForms and XQuery has legs. (And REST just plain makes sense in any situation). -m

Wednesday, May 28th, 2008

XForms Validator on Google App Engine?

I registered ‘xfv’ on Google App Engine. Too bad there doesn’t appear to be any significant XML libraries supported. I have XPath covered by my pure-python WebPath, but what about Relax NG? Anyone know of anything in pure python? -m

Wednesday, May 28th, 2008

OK already, XQuery has FLWORs, I get it

A very short rant on the state of XQuery tutorial materials on the web (not naming any names or linking any links).

I get it. Thank you for your fanatical emphasis on FLWOR constructs, but there is much more to it than that.

A few introductory sources don’t fall in to this trap, though. Mike Kay’s stuff. Priscilla Walmsley’s O’Reilly book for another. I’m pretty much finishing up reading it so I’ll review it here soon. -m

Thursday, May 22nd, 2008

XForms Ubiquity

I just found out about a nice little XForms engine called Ubiquity. (Having dinner with Mark Birbeck, TV Raman, and Leigh Klotz certainly helps one find out about such things) :-)

It’s a JavaScript implementation done right. Open source under the Apache 2.0 license. Seems like a nice fit with, oh maybe MarkLogic Server? -m

Wednesday, May 21st, 2008

XQuery Annoyances…

If you are used to XSLT 1.0 and XForms, you see { $book/bk:title } and think nothing of it. XSLT 1.0 calls the curly-brace construct an Attribute Value Template, which is pretty descriptive of where it’s used. Always in an attribute, always converted into a string, even if you are actually pointing to an element.

In XQuery, though, the curly-brace construct can be used in many different places. Depending on the context, the above code might well insert a bk:title element into your output. The proper thing to do, of course, is { $book/bk:title/text() }. Many XSLT and XForms authors would omit the extra text() selector as superfluous, but in XQuery it matters.

What’s worse, depending on your browser, you might not see any output on the page within a <bk:title> element (or a title element of any namespace). Caveat browser! -m


Saturday, May 17th, 2008

Are microformats right for your site?

Yeah, more than ever before. See my article on Yahoo! developer net. The stuff I talk about here is currently live in the indexer. -m

Friday, May 9th, 2008

FunctX XQuery library

In the new-to-me department, here’s a library and description of useful XQuery functions from my friend Priscilla Walmsley. XSLT 2, also. -m

P.S. Mark my words, more news is coming…

Thursday, April 3rd, 2008

US Census == paper technology

Never let anyone say that forms are easy. What seems like a boring, tedious topic on the surface is surprisingly deep and challenging. As evidence, the multi-billion-dollar plan to modernize the US census in 2010 has fallen back to paper technology. Sadly their plans didn’t involve XForms.

Highly-critical applications, like say voting, are even more difficult to get right. Possibly the government will get it in shape be 2020 or 2030. -m

Thursday, March 13th, 2008

The (lowercase) semantic web goes mainstream

So today Yahoo! announced a major facet of what I’ve been working on lately: making the web more meaningful. Lots of fantastic coverage, including TechCrunch and ReadWriteWeb (and others, please link in the comments), and supportive responses and blog posts across the board. It’s been a while since I’ve felt this good about being a Yahoo.

So what exactly is it?

A few months ago I went through the pages on this very blog and added hAtom markup. As a result of this change…well, nothing happened. I had a good experience learning about exactly what is involved in retrofitting an existing site with microformats, but I didn’t get any tangible benefit. With the “SearchMonkey” platform, any site using microformats, or RDFa or eRDF, is exposed to developers who can enhance search results. An enhanced result won’t directly make my my site rank higher in search, it it most certainly make it prone to more clicks, and ultimately more readership, more inlinks, and better organic ranking.

How about some questions and answers:

Q: Is this Tim Berners-Lee‘s vision of the Semantic Web finally getting fulfilled?

A: No.

Q: Does this presuppose everybody rushing to change their sites to include microformats, RDF, etc?

A: No. After all, there is a developer platform. Naturally, developers will have an easier time with sites that use official and community standards for structuring data, but there is no obligation for any site to make changes in order to participate and benefit.

Q: Why would a site want to expose all its precious data in an easily-extractable way?

A: Because within a healthy ecosystem it results in a measurable increase in traffic and customer satisfaction. Data on the public web is already extractable, given enough eyeballs. An openness strategy pays off (of which SearchMonkey is an existence proof).

Q: What about metacrap? We can never trust sites to provide honest metadata.

A: The system does have significant spam deterrents built in, of which I won’t say more. But perhaps more importantly, the plugin nature of the platform uses the power of the community to shape itself. A spammy plugin won’t get installed by users. A site that mixes in fraudulent RDFa metadata with real content will get exposed as fraudulent, and users will abandon ship.

Q: Didn’t prove that having a better user interface doesn’t help gain search market share?

A: Perhaps. But this isn’t about user interface–it’s about data (which enables a much better interface.)

Q: Won’t (Google|Microsoft|some startup) just immediately clone this idea and take advantage of all the new metadata out there?

A: I’m sure these guys will have some kind of response, and it’s true that a rising tide lifts all boats. But I don’t see anyone else cloning this exactly. The way it’s implemented has a distinctly Yahoo! appeal to it. Nobody has cloned Yahoo! Answers yet, either. In some ways, this is a return to roots, since Yahoo! started off as a human-guided directory. SearchMonkey is similar, except a much broader group of people can now participate. And there are some specific human, technical and financial reasons why as well, but I suggest inviting me out for beers if you want specifics. :-)

Disclaimer: as always, I’m not speaking for my employer. See the standard disclaimer. -m

Update: more Q and A

Q: How is SearchMonkey related to the recently announced Yahoo! Microsearch?

A: In brief, Microsearch is a research project (and a very cool one) with far-reaching goals, while SearchMonkey is targeted as imminently shipping software. I frequently talk to and compare notes with Peter Mika, the lead researcher for Microsearch.

Monday, March 10th, 2008

Getting what you asked for

Some time ago, Doug Crockford’s excellent blog pointed me to this page on “excessive DTD traffic” at the W3C. Go ahead and follow that link, I’ll wait…

All the standard templates that show how to construct a basic XHTML page include a public identifier of and often a namespace name of As the blog points out, these are not actually hyperlinks, they only play them on TV. Huge quantities of software are requesting these URLs 24×7, putting a load on their servers. Often times this results from unfortunate defaults in off-the-shelf XML components such as parsers.

But what did you expect?

This is the web equivalent of having a front-desk receptionist hand out a stacks of self-addressed, stamped postcards, then complaining about how much mail the company gets from all around the world.

HTTP URLs are great for identifiers on a technical basis: they are based on DNS names and have the important qualities of uniqueness and persistence. But as far as human factors go, they are a terrible choice (though with a great deal of inertia at this point). -m

Thursday, March 6th, 2008

microformat search at Yahoo!

Somehow I missed this posting and the underlying news that a Y Research project has a nice public demo of semantic search, driven by RDF, RDFa, and microformats. Still a rough sketch of a full solution, with multiple-second access times. But I particularly like the query for renaissance faire. -m

Monday, March 3rd, 2008

WebPath and Wikipedia

The WebPath bug reports continue to roll in. For one, queries against *.wikipedia.* don’t seem to work. You get something back, but it has no resemblance to the page you were looking for. The problem comes from the W3C tidy service that I use, specifically that the (understandably overworked and understaffed) admins at the Wikimedia Foundation seem to have blocked it. It seems like more than a simple IP or user-agent-based block. I’ve emailed them about it but haven’t heard back yet.

So, this highlights the limitation of having a single-source converter in the Platonic Web module of WebPath. So I turn to my readers: do you know of any other tidy servers? Or converters of a non-tidy origin? For any of these to work, they need to return clean XML corresponding to the original page (as opposed to, say, returning something with big headers/footers or ampersand-encoded). This seems like an outstanding need for the open source community.

Please comment below with ideas. Thanks! -m

UPDATE: heard back from the Wikipedia admins, and although professional and helpful-as-can-be-expected, they won’t be changing anything on their end. Still looking for more open source options.

Wednesday, February 13th, 2008

WebPath on

It’s been an exhausting past couple of weeks, but life goes on. WebPath made front page at I’m starting to get feedback from developers who are actually using it, filing bugs, suggesting features, and it’s gratifying. The community is still building up. Won’t you join too? -m

Friday, January 25th, 2008

WebPath: Python XPath 2 engine now up on Sourceforge

I’ve taken this opportunity to ditch CVS on all my existing Sourceforge projects (pyxmlwiki, xfv) while setting up my newest project. Here’s the browable subversion source. Have at it.

Where should you start with this code? Step zero, if you haven’t already, is to look through my XML 2007 slides on my site. First thing is to grab a copy of PLY, which is a dependency. Then with all these files in your current directory, run python with no parameters. At the interpreter prompt type import demo then demo.demo1(), demo.demo2(), and so on. This will give you a feel for how the system works. Look at the source of to see how it works at the high level.

To actually get into the code, I suggest opening and scrolling down to the end, where a large series of unit tests begins. Tracing through these will be (I hope!) instructive on how the various details of the engine are put together.

There are many missing pieces (a few intentionally so). So have a look around the code and start thinking about what you could do with it. One thing I would love to have happen soon is getting rid of minidom, replacing it with something more robust.

If you want developer access on Sourceforge, drop me a note with your sf username. -m

Thursday, January 24th, 2008

WebPath wants to be free (BSD licensed, specifically)

WebPath, my experimental XPath 2.0 engine in Python is now an open source project with a liberal BSD license. I originally developed this during a Yahoo! Hack Day, and now I get to announce it during another Hack Day. Seems appropriate.

The focus of WebPath was rapid development and providing an experimental platform. There remains tons of potential work left to do on it…watch this space for continued discussion. I’d like to call out special thanks to the Yahoo! management for supporting me on this, and to Douglas Crockford for turning me on to Top Down Operator Precedence parsers. Have a look at the code. You might be pleasantly surprised at how small and simple a basic XPath 2 engine can be. So, who’s up for some XPath hacking?

Code download. (Coming to SourceForge with CVS, etc., in however many days it takes them to approve a new project) I hope this inspires more developers to work on similar projects, or better yet, on this one! -m

Monday, January 7th, 2008

Yahoo! introduces mobile XForms

Admittedly, their marketing folks wouldn’t describe it that way, but essentially that’s what was announced today. (documentation in PDF format, closely related to what-used-to-be Konfabulator tech; here’s the interesting part in HTML) The press release talks about reaching “billions” of mobile consumers; even if you don’t put too much emphasis on press releases (you shouldn’t) it’s still talking about serious use of and commitment to XForms technology.

Shameless plug: Isn’t it time to refresh your memory, or even find out for the first time about XForms? There is this excellent book available in printed format from Amazon, as well as online for free under an open content license. If you guys express enough interest, good things might even happen, like a refresh to the content. Let’s make it happen.

From a consumer standpoint, this feels like a welcome play against Android, too. Yahoo! looks like it’s placing a bet on working with more devices while making development easier at the same time. I’ll bet an Android port will be available, at least in beta, before the end of the year.

Disclaimer: I have been out of Yahoo! mobile for several months now, and can’t claim any credit for or inside knowledge of these developments. -m

P. S. Don’t forget the book.

Monday, December 31st, 2007

Should documents self-version?

This blog page at the W3C discusses the TAG finding that a data format specification SHOULD provide for version information, specifically reconsidering that suggestion. As a few data points, XML 1.1 (with explicit version identifiers) is something of a non-starter, while Atom (without explicit version identifiers) is doing OK so far–though a significant revision to the core hasn’t happened and perhaps never will.

In a chat with Dave Orchard at XML 2007, I suggested that the evolution of browser User-Agent strings might be a useful model, since it developed in response to the actual kinds of problems that versioning needs to solve.

Indeed, the idea seemed familiar in my mind. In fact, I posted it here, in Feb 2004. The remainder of this posting republishes it with minor edits for clarity:

‘Standard practice’ of x.y.z versioning, where x is major, y is minor, and z is sub-minor (often build number) is not best practice. If you look at how systems actually evolve over time, a more ‘organic’ approach is needed.

For example, look at how browser user agent strings have evolved. Take this, for example:

Mozilla/4.0 (compatible; MSIE 6.0; MSIE 5.5; Windows 98) Opera 7.02 [en]

Wow, if detection code is looking for a substring of “Mozilla” or “Mozilla/4” or “Mozilla/4.0”, or “MSIE” or “MSIE 6” or “MSIE 6.0” or “Opera” or “Opera 7” or “Opera 7.0” or “Opera 7.0.2” it will hit. If you look at the kind of code to determine what version of Windows is running, or the exact make and model of processor, you will see a similar pattern.

Since this is the way of nature, don’t fight it with artificial, fixed-length major.minor versioning. Embrace organically growing versions.

The first version of anything should be “1.” including the dot. (letters will work in practice too) All sample code, etc. that checks versions must stop at the first dot character; anything beyond that is on a ‘needs-to-know’ basis. A check-this-version API would be extremely useful, though a basic string compare SHOULD work.

Then, whenever revisions come out, the designers need to decide if the revision is compatible or not. A completely incompatible release would then be “2.”. However, a compatible release would be “1.1.”. All version checking code would continue to look only up to the first dot, unless it has a specific reason to need more details. Then it can go up to the 2nd dot, no more.

Now, even code that is expecting version “1.1.” will work fine with “1.1.1.” or 1.1.86.” or “”.

Every new release needs to decide (and explicitly encode in the version string) how compatible it is with the entire tree of earlier versions.

Now, as long as compatible revisions keep coming out, the version string gets longer and longer. This is the key benefit, and why fixed-field version numbers are so inflexible. (and why you get silly things like Samba reporting itself as “Windows 4.9”).

One possible enhancement, purely to make version numbers look more like what folks are used to, is to allow a superfluous zero at the end. This the first version is 1.0, followed by 1.1.0,, (this next one made an incompatible change) 1.2.0, and so on.

So if a document needs to self-version at all, perhaps a scheme like this should be used? -m