Archive for the 'annoyance' Category

Monday, March 3rd, 2008

WebPath and Wikipedia

The WebPath bug reports continue to roll in. For one, queries against *.wikipedia.* don’t seem to work. You get something back, but it has no resemblance to the page you were looking for. The problem comes from the W3C tidy service that I use, specifically that the (understandably overworked and understaffed) admins at the Wikimedia Foundation seem to have blocked it. It seems like more than a simple IP or user-agent-based block. I’ve emailed them about it but haven’t heard back yet.

So, this highlights the limitation of having a single-source converter in the Platonic Web module of WebPath. So I turn to my readers: do you know of any other tidy servers? Or converters of a non-tidy origin? For any of these to work, they need to return clean XML corresponding to the original page (as opposed to, say, returning something with big headers/footers or ampersand-encoded). This seems like an outstanding need for the open source community.

Please comment below with ideas. Thanks! -m

UPDATE: heard back from the Wikipedia admins, and although professional and helpful-as-can-be-expected, they won’t be changing anything on their end. Still looking for more open source options.

Tuesday, January 1st, 2008

My new year resolution

Holding steady at 1280 x 854 but due for an upgrade soon.

Seriously, if you find yourself setting various goals just because something on the calendar changed, you probably don’t have the long-term motivation needed to see it through, which is why so many new years’ resolutions lie in broken heaps by mid February. If you think something is worth doing (like this for example), then forget the calendar and do it.

-m

Monday, December 31st, 2007

Should documents self-version?

This blog page at the W3C discusses the TAG finding that a data format specification SHOULD provide for version information, specifically reconsidering that suggestion. As a few data points, XML 1.1 (with explicit version identifiers) is something of a non-starter, while Atom (without explicit version identifiers) is doing OK so far–though a significant revision to the core hasn’t happened and perhaps never will.

In a chat with Dave Orchard at XML 2007, I suggested that the evolution of browser User-Agent strings might be a useful model, since it developed in response to the actual kinds of problems that versioning needs to solve.

Indeed, the idea seemed familiar in my mind. In fact, I posted it here, in Feb 2004. The remainder of this posting republishes it with minor edits for clarity:

‘Standard practice’ of x.y.z versioning, where x is major, y is minor, and z is sub-minor (often build number) is not best practice. If you look at how systems actually evolve over time, a more ‘organic’ approach is needed.

For example, look at how browser user agent strings have evolved. Take this, for example:

Mozilla/4.0 (compatible; MSIE 6.0; MSIE 5.5; Windows 98) Opera 7.02 [en]

Wow, if detection code is looking for a substring of “Mozilla” or “Mozilla/4″ or “Mozilla/4.0″, or “MSIE” or “MSIE 6″ or “MSIE 6.0″ or “Opera” or “Opera 7″ or “Opera 7.0″ or “Opera 7.0.2″ it will hit. If you look at the kind of code to determine what version of Windows is running, or the exact make and model of processor, you will see a similar pattern.

Since this is the way of nature, don’t fight it with artificial, fixed-length major.minor versioning. Embrace organically growing versions.

The first version of anything should be “1.” including the dot. (letters will work in practice too) All sample code, etc. that checks versions must stop at the first dot character; anything beyond that is on a ‘needs-to-know’ basis. A check-this-version API would be extremely useful, though a basic string compare SHOULD work.

Then, whenever revisions come out, the designers need to decide if the revision is compatible or not. A completely incompatible release would then be “2.”. However, a compatible release would be “1.1.”. All version checking code would continue to look only up to the first dot, unless it has a specific reason to need more details. Then it can go up to the 2nd dot, no more.

Now, even code that is expecting version “1.1.” will work fine with “1.1.1.” or 1.1.86.” or “1.1.2.1.42.1.536.”.

Every new release needs to decide (and explicitly encode in the version string) how compatible it is with the entire tree of earlier versions.

Now, as long as compatible revisions keep coming out, the version string gets longer and longer. This is the key benefit, and why fixed-field version numbers are so inflexible. (and why you get silly things like Samba reporting itself as “Windows 4.9″).

One possible enhancement, purely to make version numbers look more like what folks are used to, is to allow a superfluous zero at the end. This the first version is 1.0, followed by 1.1.0, 1.1.1.0, (this next one made an incompatible change) 1.2.0, and so on.

So if a document needs to self-version at all, perhaps a scheme like this should be used? -m

Monday, December 31st, 2007

XPath puzzler: solution

Thanks to all the folks who showed interest in this little XPath puzzler published here a few weeks ago. Some asked to see the dataset, but I’m not able to release it at this time (but ask me again in 3 months).

Turns out it was a combination of two bugs, one mine, one somebody else’s. Careful observers noted that I wasn’t using any namespace prefixes in the XPath, and since I did specify that it was XPath 1.0, that technically rules out XHTML as the source language. Like nearly all XML I work with these days, the first thing I do is strip off the namespaces to make it easier to work with. Bug #1 was that in a few cases, the namespaces didn’t get stripped.

Bug #2 was in the XPath engine itself. Which one? Uh, whatever one ships with the “XPath” plugin for JEdit. It’s hard to tell directly, but I think it might be an older version of Xalan-J. In the case of the expression //meta, it properly located only those elements part of no namespace. But in the case of //meta/@property, it was including all the nodes that would have been selected by //*[local-name(.)='meta']/@property. Hence, a larger number of returned nodes.

Confusing? You bet!  -m

P.S. WebPath would not have this problem, since in the default mode it matches local-names only to begin with.

Saturday, December 15th, 2007

XML spell check

Surely somebody has implemented this in at least one tool.

In a text editor, I come across a misspelled close tag like </xsl:stylsheet>. My editor highlights the line as an error, which is is, not matching the start tag and all. Why can’t it go the extra step and give me the same kind of interface as I get for misspelled words, which an easy option to repair the spelling? This seems like a much simpler problem than all the hairy cases around human-language spell check…

So, what tools already do this today? -m

Thursday, November 22nd, 2007

Reducing my online profile

Due to some unauthorized activities on my webspace, I’m trimming my online profile, notably the Brain Attic sites. These were my home base for consulting, which I haven’t been doing for 2+ years. Less surface area exposed means less exposure to the bad guys. This site, and XForms Institute are staying up for now, as should be the email address you are currently using. There will be a few broken links that will take some time to eradicate.

If you notice anything amiss, any unseemly references to ‘viagra’ in my pages etc., email me at “mdubinko” in the reversed “com.yahoo” domain. -m

Update: whoops, looks like I cut a little too deep. Turns out that all my @dubinko.info mail was routing through one of the domains I chopped. For several hours overnight email sent to me was bouncing. If you ran into that, please re-sent. -m

Monday, November 19th, 2007

Kindle as a writing tool?

Amazon announced their ebook reader today, Kindle. Some of the earlier hype I’d read about it suggested that it would be not only a reading tool, but a writing tool as well.

Nope.

The obvious thing is the keyboard, an immediate non-starter for typing more than a few words. But if an external keyboard is possible, I could still live with it. For now, it seems like an OLPC would better serve my needs for an ultra-portable writing station. And for the same price, I get one and so does somebody else who need it. -m

Monday, November 5th, 2007

A better name for CURIEs (?)

“Compact Clark Notation“. (Inspired by reading this) -m

Monday, October 22nd, 2007

Is there fertile ground between RDFa and GRDDL?

The more I look at RDFa, the more I like it. But still it doesn’t help with the pain-point of namespaces, specifically of unmemorable URLs all over the place and qnames (or CURIEs) in content.

Does GRDDL offer a way out? Could, for instance, the namespace name for Dublin Core metadata be assigned to the prefix “dc:” in an external file, linked via transformation to the document in question? Then it would be simpler, from a producer or consumer viewpoint, to simply use names like “dc:title” with no problems or ambiguity.

This could be especially useful not that discussions are reopening around XML in HTML.

As usual, comments welcome. -m

Wednesday, October 3rd, 2007

XML Annoyance: do greater-than signs need to be escaped?

Let’s see how many downstream pieces of software trip over this post…

Do greater-than and less-than signs need to be escaped in XML? Conventional wisdom has it that less-than signs always do, since that character starts a fresh “tag”, but greater-than signs are safe.

Wrong.

There is a particular sequence, namely ]]> , not allowed to occur unescaped in XML “for compatibility“–a particular phrase the spec uses to indicate rules that only an SGML-head could love (but still strict requirements nonetheless). Does your software prevent this condition from causing an error? -m