In the new-to-me department, here’s a library and description of useful XQuery functions from my friend Priscilla Walmsley. XSLT 2, also. -m
P.S. Mark my words, more news is coming…
Friday, May 9th, 2008
In the new-to-me department, here’s a library and description of useful XQuery functions from my friend Priscilla Walmsley. XSLT 2, also. -m
P.S. Mark my words, more news is coming…
Thursday, April 3rd, 2008
Never let anyone say that forms are easy. What seems like a boring, tedious topic on the surface is surprisingly deep and challenging. As evidence, the multi-billion-dollar plan to modernize the US census in 2010 has fallen back to paper technology. Sadly their plans didn’t involve XForms.
Highly-critical applications, like say voting, are even more difficult to get right. Possibly the government will get it in shape be 2020 or 2030. -m
Thursday, March 13th, 2008
So today Yahoo! announced a major facet of what I’ve been working on lately: making the web more meaningful. Lots of fantastic coverage, including TechCrunch and ReadWriteWeb (and others, please link in the comments), and supportive responses and blog posts across the board. It’s been a while since I’ve felt this good about being a Yahoo.
So what exactly is it?
A few months ago I went through the pages on this very blog and added hAtom markup. As a result of this change…well, nothing happened. I had a good experience learning about exactly what is involved in retrofitting an existing site with microformats, but I didn’t get any tangible benefit. With the “SearchMonkey” platform, any site using microformats, or RDFa or eRDF, is exposed to developers who can enhance search results. An enhanced result won’t directly make my my site rank higher in search, it it most certainly make it prone to more clicks, and ultimately more readership, more inlinks, and better organic ranking.
How about some questions and answers:
Q: Is this Tim Berners-Lee’s vision of the Semantic Web finally getting fulfilled?
A: No.
Q: Does this presuppose everybody rushing to change their sites to include microformats, RDF, etc?
A: No. After all, there is a developer platform. Naturally, developers will have an easier time with sites that use official and community standards for structuring data, but there is no obligation for any site to make changes in order to participate and benefit.
Q: Why would a site want to expose all its precious data in an easily-extractable way?
A: Because within a healthy ecosystem it results in a measurable increase in traffic and customer satisfaction. Data on the public web is already extractable, given enough eyeballs. An openness strategy pays off (of which SearchMonkey is an existence proof).
Q: What about metacrap? We can never trust sites to provide honest metadata.
A: The system does have significant spam deterrents built in, of which I won’t say more. But perhaps more importantly, the plugin nature of the platform uses the power of the community to shape itself. A spammy plugin won’t get installed by users. A site that mixes in fraudulent RDFa metadata with real content will get exposed as fraudulent, and users will abandon ship.
Q: Didn’t ask.com prove that having a better user interface doesn’t help gain search market share?
A: Perhaps. But this isn’t about user interface–it’s about data (which enables a much better interface.)
Q: Won’t (Google|Microsoft|some startup) just immediately clone this idea and take advantage of all the new metadata out there?
A: I’m sure these guys will have some kind of response, and it’s true that a rising tide lifts all boats. But I don’t see anyone else cloning this exactly. The way it’s implemented has a distinctly Yahoo! appeal to it. Nobody has cloned Yahoo! Answers yet, either. In some ways, this is a return to roots, since Yahoo! started off as a human-guided directory. SearchMonkey is similar, except a much broader group of people can now participate. And there are some specific human, technical and financial reasons why as well, but I suggest inviting me out for beers if you want specifics. :-)
Disclaimer: as always, I’m not speaking for my employer. See the standard disclaimer. -m
Update: more Q and A
Q: How is SearchMonkey related to the recently announced Yahoo! Microsearch?
A: In brief, Microsearch is a research project (and a very cool one) with far-reaching goals, while SearchMonkey is targeted as imminently shipping software. I frequently talk to and compare notes with Peter Mika, the lead researcher for Microsearch.
Monday, March 10th, 2008
Some time ago, Doug Crockford’s excellent blog pointed me to this page on “excessive DTD traffic” at the W3C. Go ahead and follow that link, I’ll wait…
All the standard templates that show how to construct a basic XHTML page include a public identifier of http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd and often a namespace name of http://www.w3.org/1999/xhtml. As the blog points out, these are not actually hyperlinks, they only play them on TV. Huge quantities of software are requesting these URLs 24×7, putting a load on their servers. Often times this results from unfortunate defaults in off-the-shelf XML components such as parsers.
But what did you expect?
This is the web equivalent of having a front-desk receptionist hand out a stacks of self-addressed, stamped postcards, then complaining about how much mail the company gets from all around the world.
HTTP URLs are great for identifiers on a technical basis: they are based on DNS names and have the important qualities of uniqueness and persistence. But as far as human factors go, they are a terrible choice (though with a great deal of inertia at this point). -m
Thursday, March 6th, 2008
Somehow I missed this posting and the underlying news that a Y Research project has a nice public demo of semantic search, driven by RDF, RDFa, and microformats. Still a rough sketch of a full solution, with multiple-second access times. But I particularly like the query for renaissance faire. -m
Monday, March 3rd, 2008
The WebPath bug reports continue to roll in. For one, queries against *.wikipedia.* don’t seem to work. You get something back, but it has no resemblance to the page you were looking for. The problem comes from the W3C tidy service that I use, specifically that the (understandably overworked and understaffed) admins at the Wikimedia Foundation seem to have blocked it. It seems like more than a simple IP or user-agent-based block. I’ve emailed them about it but haven’t heard back yet.
So, this highlights the limitation of having a single-source converter in the Platonic Web module of WebPath. So I turn to my readers: do you know of any other tidy servers? Or converters of a non-tidy origin? For any of these to work, they need to return clean XML corresponding to the original page (as opposed to, say, returning something with big headers/footers or ampersand-encoded). This seems like an outstanding need for the open source community.
Please comment below with ideas. Thanks! -m
UPDATE: heard back from the Wikipedia admins, and although professional and helpful-as-can-be-expected, they won’t be changing anything on their end. Still looking for more open source options.
Wednesday, February 13th, 2008
It’s been an exhausting past couple of weeks, but life goes on. WebPath made front page at next.yahoo. I’m starting to get feedback from developers who are actually using it, filing bugs, suggesting features, and it’s gratifying. The community is still building up. Won’t you join too? -m
Friday, January 25th, 2008
I’ve taken this opportunity to ditch CVS on all my existing Sourceforge projects (pyxmlwiki, xfv) while setting up my newest project. Here’s the browable subversion source. Have at it.
Where should you start with this code? Step zero, if you haven’t already, is to look through my XML 2007 slides on my site. First thing is to grab a copy of PLY, which is a dependency. Then with all these files in your current directory, run python with no parameters. At the interpreter prompt type import demo then demo.demo1(), demo.demo2(), and so on. This will give you a feel for how the system works. Look at the source of demo.py to see how it works at the high level.
To actually get into the code, I suggest opening webpath.py and scrolling down to the end, where a large series of unit tests begins. Tracing through these will be (I hope!) instructive on how the various details of the engine are put together.
There are many missing pieces (a few intentionally so). So have a look around the code and start thinking about what you could do with it. One thing I would love to have happen soon is getting rid of minidom, replacing it with something more robust.
If you want developer access on Sourceforge, drop me a note with your sf username. -m
Thursday, January 24th, 2008

WebPath, my experimental XPath 2.0 engine in Python is now an open source project with a liberal BSD license. I originally developed this during a Yahoo! Hack Day, and now I get to announce it during another Hack Day. Seems appropriate.
The focus of WebPath was rapid development and providing an experimental platform. There remains tons of potential work left to do on it…watch this space for continued discussion. I’d like to call out special thanks to the Yahoo! management for supporting me on this, and to Douglas Crockford for turning me on to Top Down Operator Precedence parsers. Have a look at the code. You might be pleasantly surprised at how small and simple a basic XPath 2 engine can be. So, who’s up for some XPath hacking?
Code download. (Coming to SourceForge with CVS, etc., in however many days it takes them to approve a new project) I hope this inspires more developers to work on similar projects, or better yet, on this one! -m
Monday, January 7th, 2008
Admittedly, their marketing folks wouldn’t describe it that way, but essentially that’s what was announced today. (documentation in PDF format, closely related to what-used-to-be Konfabulator tech; here’s the interesting part in HTML) The press release talks about reaching “billions” of mobile consumers; even if you don’t put too much emphasis on press releases (you shouldn’t) it’s still talking about serious use of and commitment to XForms technology.
Shameless plug: Isn’t it time to refresh your memory, or even find out for the first time about XForms? There is this excellent book available in printed format from Amazon, as well as online for free under an open content license. If you guys express enough interest, good things might even happen, like a refresh to the content. Let’s make it happen.
From a consumer standpoint, this feels like a welcome play against Android, too. Yahoo! looks like it’s placing a bet on working with more devices while making development easier at the same time. I’ll bet an Android port will be available, at least in beta, before the end of the year.
Disclaimer: I have been out of Yahoo! mobile for several months now, and can’t claim any credit for or inside knowledge of these developments. -m
P. S. Don’t forget the book.
Monday, December 31st, 2007
This blog page at the W3C discusses the TAG finding that a data format specification SHOULD provide for version information, specifically reconsidering that suggestion. As a few data points, XML 1.1 (with explicit version identifiers) is something of a non-starter, while Atom (without explicit version identifiers) is doing OK so far–though a significant revision to the core hasn’t happened and perhaps never will.
In a chat with Dave Orchard at XML 2007, I suggested that the evolution of browser User-Agent strings might be a useful model, since it developed in response to the actual kinds of problems that versioning needs to solve.
Indeed, the idea seemed familiar in my mind. In fact, I posted it here, in Feb 2004. The remainder of this posting republishes it with minor edits for clarity:
‘Standard practice’ of x.y.z versioning, where x is major, y is minor, and z is sub-minor (often build number) is not best practice. If you look at how systems actually evolve over time, a more ‘organic’ approach is needed.
For example, look at how browser user agent strings have evolved. Take this, for example:
Mozilla/4.0 (compatible; MSIE 6.0; MSIE 5.5; Windows 98) Opera 7.02 [en]
Wow, if detection code is looking for a substring of “Mozilla” or “Mozilla/4″ or “Mozilla/4.0″, or “MSIE” or “MSIE 6″ or “MSIE 6.0″ or “Opera” or “Opera 7″ or “Opera 7.0″ or “Opera 7.0.2″ it will hit. If you look at the kind of code to determine what version of Windows is running, or the exact make and model of processor, you will see a similar pattern.
Since this is the way of nature, don’t fight it with artificial, fixed-length major.minor versioning. Embrace organically growing versions.
The first version of anything should be “1.” including the dot. (letters will work in practice too) All sample code, etc. that checks versions must stop at the first dot character; anything beyond that is on a ‘needs-to-know’ basis. A check-this-version API would be extremely useful, though a basic string compare SHOULD work.
Then, whenever revisions come out, the designers need to decide if the revision is compatible or not. A completely incompatible release would then be “2.”. However, a compatible release would be “1.1.”. All version checking code would continue to look only up to the first dot, unless it has a specific reason to need more details. Then it can go up to the 2nd dot, no more.
Now, even code that is expecting version “1.1.” will work fine with “1.1.1.” or 1.1.86.” or “1.1.2.1.42.1.536.”.
Every new release needs to decide (and explicitly encode in the version string) how compatible it is with the entire tree of earlier versions.
Now, as long as compatible revisions keep coming out, the version string gets longer and longer. This is the key benefit, and why fixed-field version numbers are so inflexible. (and why you get silly things like Samba reporting itself as “Windows 4.9″).
One possible enhancement, purely to make version numbers look more like what folks are used to, is to allow a superfluous zero at the end. This the first version is 1.0, followed by 1.1.0, 1.1.1.0, (this next one made an incompatible change) 1.2.0, and so on.
So if a document needs to self-version at all, perhaps a scheme like this should be used? -m
Monday, December 31st, 2007
Thanks to all the folks who showed interest in this little XPath puzzler published here a few weeks ago. Some asked to see the dataset, but I’m not able to release it at this time (but ask me again in 3 months).
Turns out it was a combination of two bugs, one mine, one somebody else’s. Careful observers noted that I wasn’t using any namespace prefixes in the XPath, and since I did specify that it was XPath 1.0, that technically rules out XHTML as the source language. Like nearly all XML I work with these days, the first thing I do is strip off the namespaces to make it easier to work with. Bug #1 was that in a few cases, the namespaces didn’t get stripped.
Bug #2 was in the XPath engine itself. Which one? Uh, whatever one ships with the “XPath” plugin for JEdit. It’s hard to tell directly, but I think it might be an older version of Xalan-J. In the case of the expression //meta, it properly located only those elements part of no namespace. But in the case of //meta/@property, it was including all the nodes that would have been selected by //*[local-name(.)='meta']/@property. Hence, a larger number of returned nodes.
Confusing? You bet! -m
P.S. WebPath would not have this problem, since in the default mode it matches local-names only to begin with.
Friday, December 21st, 2007
One whole evening of the program was devoted to XForms, focused around the new 1.1 Candidate Recommendation. I admit that some of the early 1.1 drafts gave me pause, but these guys did a good job cleaning up some of the dim corners and adding the right features in the right places. This is worth a careful look. -m
Friday, December 21st, 2007
OK, the majority of the buzz came from my talk, where I strongly encouraged folks to take a look at Hadoop. This article seems to be saying much the same things. If you’re curious about the future of distributed computation and storage, it’s worth a look. -m
Sunday, December 16th, 2007
Here’s the slides from my presentation at XML 2007, dealing with an implementation of XPath 2.0 in Python. I hope to have even more news in this area soon.
WebPath (html)
WebPath (OpenDocument, 4.7 megs)
Did you notice the OpenOffice has nice slide export, that generates both graphically-accurate slides and highly indexable and accessible text versons? -m
Saturday, December 15th, 2007
While I’ve got your attention, here’s an XPath (1.0) puzzler. I have an RDFa dataset compiled from various and sundry sources. It’s all wrapped up in a single XML file. I run this XPath to see how many meta elements are present: //meta and it returns a node-set of size 762. Now, I want to see how many property elements are present, so I run the query: //meta/@property and it returns a node-set of size 764. How is it that the second node-set can be bigger than the first? -m
Saturday, December 15th, 2007
Surely somebody has implemented this in at least one tool.
In a text editor, I come across a misspelled close tag like </xsl:stylsheet>. My editor highlights the line as an error, which is is, not matching the start tag and all. Why can’t it go the extra step and give me the same kind of interface as I get for misspelled words, which an easy option to repair the spelling? This seems like a much simpler problem than all the hairy cases around human-language spell check…
So, what tools already do this today? -m
Thursday, December 6th, 2007
I came away from the XML 2007 conference with lots of new ideas and inspirations. I’ll write some postings about individual technologies in the coming days.
But for now, another RDFa question. If I need to represent a list, what is the best way to do it? Does it differ between ordered and unordered lists? Let’s take some concrete examples, say a shopping list and an (ordered) todo list. How would you do it? -m
P.S. What about multi-level lists?
Thursday, November 29th, 2007
Well, my plans for a series of postings about details of implementing XPath 2.0 fell rather short, so let’s skip straight to the good stuff.
An article by Mike Kay giving the details of the Saxon architecture. On the surface it’s about performance, but it also has an excellent section in internals. Worth a look. This has been quite influential for me, and maybe you too. -m
Tuesday, November 13th, 2007
OK, let me take a step back from specific technologies like RDFa, let’s go through a really simple example.
On a certain web page, I refer to a book. That book has a price of 21.86 US dollars. The page is intended as primarily human-readable, but I want to include machine-readable data too, for a global audience.
What would you do? What specific markup choices would you make? What specific markup would you use? -m
Saturday, November 10th, 2007
What is the difference between placing instanceof=”prefix:val” vs. rel=”prefix:val” on something? How do I decide between the two?
In the example of hEvent data, why is it better/more accurate to use instanceof=”cal:Vevent” instead of a blank node via rel=”cal:Vevent”?
-m
Monday, November 5th, 2007
“Compact Clark Notation“. (Inspired by reading this) -m
Monday, October 29th, 2007
Many things in life are simpler when you only need to be within 5%:
Of course, there’s even more things that get more convenient when you have 10% or 20% to work with… -m
Monday, October 22nd, 2007
The more I look at RDFa, the more I like it. But still it doesn’t help with the pain-point of namespaces, specifically of unmemorable URLs all over the place and qnames (or CURIEs) in content.
Does GRDDL offer a way out? Could, for instance, the namespace name for Dublin Core metadata be assigned to the prefix “dc:” in an external file, linked via transformation to the document in question? Then it would be simpler, from a producer or consumer viewpoint, to simply use names like “dc:title” with no problems or ambiguity.
This could be especially useful not that discussions are reopening around XML in HTML.
As usual, comments welcome. -m
Saturday, October 20th, 2007
In researching for an XPath 2.0 implementation, I ran across this curious document from the W3C. Despite being labeled a Working Draft (as opposed to a Note), it appears to be a one-shot document with no future hope for updates or enhancements.
In short, it outlines several options for the first stage or two of an XPath 2.0 or XQuery implementation. (Despite the title, it talks about more than just a tokenizer; additionally a parser and a possible intermediate stage). Tokenizing and parsing XPath are significantly more difficult than other languages, because things like this are perfectly legitimate (if useless):
if(if) then then else else- +-++-**-* instance of element(*)* * * **---++div- div -div
The document tries to standardize on some terminology for various approaches toward dealing with XPath. The remaining bulk of the document sketches out some lexical states that would be useful for one particular implementation approach. I guess the vibrant, thriving throngs of XPath 2.0 developers didn’t see the need for this kind of assistance.
In short, I didn’t find it terribly useful. Maybe some readers have, though. Feel free to comment below. Subsequent articles here will describe how I approached the problem. Stay sharp! -m
Monday, October 15th, 2007
Depending on who’s asking and who’s answering, W3C technologies take 5 to 10 years to get a strong foothold. Well, we’re now in the home stretch for the 5th anniversary of XForms Essentials, which was published in 2003. In past conferences, XForms coverage has been maybe a low-key tutorial, a few day sessions, and hallway conversation. I’m pleased to see it reach new heights this year.
XForms evening is on Monday December 3 at the XML 2007 conference, and runs from 7:30 until 9:00 plus however ERH takes on his keynote. :) The scheduled talks are shorter and punchier, and feature a lot of familiar faces, and a few new ones (at least to me). I’m looking forward to it–see you there! -m
Monday, October 8th, 2007
As widely reported by now, the final schedule for XML 2007 this December in Boston is up. All I have to add is the suggestion of careful attention to the Tuesday program at 4:00. :) If you can’t wait, some technical details are forthcoming in this space. That is all. -m
Friday, October 5th, 2007
I’ll be doing some experimenting around here over maybe the next week or two. Specifically, setting up hAtom within these pages. Watch for falling debris and report any unusual observations. -m
Wednesday, October 3rd, 2007
Let’s see how many downstream pieces of software trip over this post…
Do greater-than and less-than signs need to be escaped in XML? Conventional wisdom has it that less-than signs always do, since that character starts a fresh “tag”, but greater-than signs are safe.
Wrong.
There is a particular sequence, namely ]]> , not allowed to occur unescaped in XML “for compatibility“–a particular phrase the spec uses to indicate rules that only an SGML-head could love (but still strict requirements nonetheless). Does your software prevent this condition from causing an error? -m
Monday, October 1st, 2007
It’s a common need to parse space-separated attribute values from XPath/XSLT 1.0, usually @class or @rel. One common (but incorrect) technique is simple equality test, as in {@class=”vcard”}. This is wrong, since the value can still match and still have other literal values, like “foo vcard” or “vcard foo” or ” foo vcard bar “.
The proper way is to look at individual tokens in the attribute value. On first glance, this might require a call to EXSLT or some complex tokenization routine, but there’s a simpler way. I first discovered this on the microformats wiki, and only cleaned up the technique a tiny bit.
The solution involves three XPath 1.0 functions, contains(), concat() to join together string fragments, and normalize-space() to strip off leading and trailing spaces and convert any other sequences of whitespace into a single space.
In english, you
Or {contains(concat(’ ‘,normalize-space(@class),’ ‘),’ vcard ‘)} A moment’s thought shows that this works well on all the different examples shown above, and is perhaps even less involved than resorting to extension functions that return nodes that require further processing/looping. It would be interesting to compare performance as well…
So next time you need to match class or rel values, give it a shot. Let me know how it works for you, or if you have any further improvements. -m