Archive for December, 2008

Wednesday, December 31st, 2008

Geek Thoughts: new year’s resolution

1440×900, up from last year.

More collected Geek Thoughts at

Tuesday, December 30th, 2008

RDFa parser in XQuery now open source

After a delay, the code to my RDFa parser in XQuery is now available under an Apache license. Go get it. This is some of the earliest XQuery code I ever wrote, so go easy on me. It follows the earlier work on a functional definition of RDFa. And feel free to send in patches. -m

Monday, December 29th, 2008

Opera 9.6.3 includes a fantastic JavaScript debugger

Have you checked out Opera lately? You should. Their briliant strategy it to include a JavaScript debugger so excellent that you’d be willing to test on that browser just to use the tool.

If you’ve been having the same kinds of troubles that I have with Firebug lately (not to demean the thousands who use that tool daily, but I draw the line when the debugger is the source of bugs) check it out. -m

Wednesday, December 24th, 2008

Semi-spam on the rise

With tough times comes a rise in semi-spam. What’s that? There’s a grey area between solicited and unsolicted email. Take a company you’ve done business once in the past. These guys are dredging up their old databases and really searching for business. Since these are companies I actually like, I don’t have the heart to click the ‘Spam’ button on their emails… -m

Tuesday, December 23rd, 2008

Geek Thoughts: code zen

It is not the size of the codebase which makes it hard to grok, but the poorness of the design.

More collected Geek Thoughts at

Monday, December 22nd, 2008

XForms for HTML

I’ve heard not a peep about this before, but here it is: XForms for HTML. Let’s read this together. Feel free to drop any comments or observations below. -m

Friday, December 19th, 2008

XSLTForms looks promising

Implementing client-side forms libraries is, and has been, all the rage. I’ve seen Mozquito Factory do amazing things in Netscape 4, Technical Pursuits TIBET on the perpetual verge of release, UGO, and others. In a more recent time scale, Ubiquity XForms impresses me and many others, and it has the right combination of funding and willing developers.

From a comment on my recent posting about Ubiquity XForms, I was pleased to learn about XSLTforms, a rebirth of AjaxForms, which I thought well of two years ago until its developer mysteriously left the project. But Software Libre lives on, and a new developer has taken over, this time using client-side XSLT instead of server-side Java to do the first pass of processing. Given the strong foundation, the project has come a long way in a short time, and already runs against a wide array of non-trivial examples. Check it out.

I’d like to hear what others think about this project. -m

Wednesday, December 17th, 2008

Geek Thoughts: electricity neutrality

This month’s electric bill is brought to you by GE: “We bring good things to light.” ™ (c)

Usage by GE appliances: $0.18 per KW/h totaling $23.48

Usage by Kenmore appliances (see note 1): $0.53 per KW/h totaling $50.23

Usage by unlisted appliances: $0.26 per KW/h totaling $39.32

Total Due: $113.03

Note 1: PG&E is currently involved in litigation with Sears Kenmore, Inc. Since utility access fees are not being paid during the course of the pending legal action, full uncompensated rates apply.

More collected Geek Thoughts at

Friday, December 12th, 2008

Geek Thoughts: beer geekery 101

This article calls itself “Beer 101” but it leaves me pretty flat. I grew up in a place that had “both kinds of beer”, Bud and Bud Light, and thus thought I hated beer. But there’s way more out there. Ask for some of these:

Lagers: Oktoberfest, Bock, Doppelbock, Eisbock.

Hybrids (having characteristics of both ales and lagers): California Common (aka Anchor Steam),  Altbier.

Ales: Hefeweizen, Dunkelweizen, Scottish ale, Scotch ale, Lambic, Dubbel, Tripel, Barleywine.

More collected Geek Thoughts at

Thursday, December 11th, 2008

XML 2008 liveblog: Introduction to eXist and XQuery

Greg Watson, IT Specialist, Defense Intelligence Agency Missile and Space Intelligence Center (apparently it IS rocket science). I installed eXist last night to follow along with the talk.

“If you have a larger dataset, eXist may not be the best choice.” Recommended reading: XQuery by Priscilla Walmsley, XQuery wikibook.

Download and install. Needs a full JDK (Mac includes this already in /Library/Java/Home), a mere JRE is insufficient. Start up with bin/

eXist-specific useful functions: request:get-parameter() from the URI query string. transform:transform() function invokes XSLT from within XQuery.

Example uses doc() to fetch an external URL of RSS, check individual items with contains(). Every example is a fully-formed, click-on-a-link-to-run program.

XQuery and PHP: reallly basic integration with simplexml_load_file($myXQueryURL).

Loading scripts into eXist: He uses XML Spy. eXist has a Jaa Web Start admin client.

Q&A: How bit is too big? Maybe 10-20-30 thousand docs. Generate indexes? Yes.


(Production note: somehow, this one didn’t get published live. Now it is.)

Tuesday, December 9th, 2008

XML 2008 liveblog: Introduction to Schematron

Wendell Piez, Mulberry Technologies

Assertion-based schema language. A way to test XML documents. Rule-based validation language. Cool report generator. Good for capturing edge cases.

Same architecture as XSLT. (Schematron specifies, does not perform)

<schema xmlns="">
  <title>Check sections 12/07</title>
  <pattern id="section-check">
    <rule context="section">
      <assert test="title">This section has no title</assert>
      <report test="p">This section has paragraphs</report>

Demo. OxygenXML has support. Assert vs. Report – essentially opposites. Assert means “tell me if this if false”. Report means “tell me if this is true”.

“Almost as if Schematron is a harness for XPath testing.”

More examples:

<rule context="note">
  <report test="ancestor::note">A note appears in a note. OK?</report>

Binding: Default is XSLT 1, but flexible enough to allow other query langauges via attribute @queryBinding at the top. Many processors allow mix-and-match between XSLT and Schematron. Examples showing just that.

Some tests can be very useful:

test=”every $line in tokenize(., $newline) satisfies string-length($line) le 72″

Q: What if the destination is not a human, but another part of a pipeline? Varies by implementation, but SVRL is standardized as an annex in the ISO spec, part of DSDL.

Use as little or as much as you want, at different times in the document lifecycle. “Schematron is a feather duster that reaches areas other schema languages cannot.” – Rick Jelliffe

As time permits section of the talk:

Other top-level elements: title, pattern, ns, let, p, include, phase, diagnostics.


Tuesday, December 9th, 2008

XML 2008 liveblog: Automating Content Analysis with Trang and Simple XSLT Scripts

Bob DuCharme, Innodata Isogen

Content analysis: why? You’ve “inherited” content. Need to save time or effort.

Handy tool 1: “sort”. As in the Unix command line tool. (Even Windows)

Handy tool 2: “uniq -c”  (flag -c means include counts)

Elsevier contest: interface for reading journals. Download a bunch of articles, and see what’s all in there.

Handy tool 3: Trang. Schema language converter. But can infer a schema from one or more input documents. Concat all sample documents under one root, and infer–this gives a list of all doctypes in use.

trang article.dtd article.rng
trang issueContents.xml issueContents.rng
saxon article.rng compareElsRNG.xsl | sort > compareElsRNG.out

compareElsRNG.xsl has text mode output, ignores input text nodes, and checks whether the RNG has references to each element, outputing “Yes: elementname” or “No: elemenname”. (which gets sorted in step 3)

Helps ferret out places where the schema says 40 different child elements are possible but in practice only 4 are used.

Handy tool 4: James Clark’s sx, converts SGML to XML.

Another stylesheet counts elements producing a histogram. [Ed. I would do this in XQuery in CQ.] Again, can help prioritize parts of the XML to use first. Similar logic for parent/child counts; where @id gets used; find all values for a particular attribute.

Another stylesheet goes through multiple converted-to-rng schemas, looking for common substructure. Lists generated this way can be pulled into a stylesheet.

Analyze a SGML DTD? dtd2html -> tidy -> XSLT. Clients like reports (especially spreadsheets). The is more like lego bricks.


Tuesday, December 9th, 2008

XML 2008 liveblog: Using RDFa for Government Information

Mark Birbeck, Web Backplane.

Problem statement: You shouldn’t have to “scrape” government sites.

Solution: RDFa

<div typeof="arg:Vacancy">
  Job title: <span property="dc:title">Assistant Officer</span>
  Description: <span property="dc:description">To analyse... </span>

This resolves to two full RDF triples. No separate feeds, uses existing publishing systems. Two of the most ambitious RDFa projects are taking place in the UK. Flexible arrangements possible.

Steps: 1. Create vocabulary. 2. Create demo. 3. Evangelize.

Vocabulary under Google Code: Argot Hub. Reuse terms (dc:title, foaf:name) where possible, developed in public.

Demos: Yahoo! SearchMonkey, (good for helping not-so-technical people to “get it”) then a Drupal hosted one (a little more control).

Next level, a new server that aggregates specific info (like all job openeings for Electricians), incuding geocoding. Ubiquity RDFa helps here.

Evangelizing: Detailed tutorials. Drupal code will go open source. More opportunities with companies currently screen-scrapting. More info @

Q&A: Asking about predicate overloading (dc:title). A general SemWeb issue. Context helps. Is RDFa tied to HTML? No, SearchMonkey itself uses RDFa–it’s just attributes.


Tuesday, December 9th, 2008

XML 2008 liveblog: Sentiment Analysis in Open Source Information for the US Government

Ronald Reck, SAP; Kenneth Sall, SAIC

“I wish I knew when people were saying bad things about me.” Sentiment analysis. Kapow used initially. From 800k news articles (from 1996 and 1997), extracted 450M RDF assertions. The 13 Reuters standard metadata elements not used in this case. Used Redland for heavy RDF lifting. Inxight ThingFinder (commercial) for entity extraction, supplemented with enumerated lists (Bush Cabinet, Intellegence Agencies, negative adjectives, positive admire verbs, etc.) End result was RDF/XML.

(Kenneth takes the mic) SPARQL Sentiment Query Web UI. Heavy SPARQL ahead… Redland hasn’t implemented the UNION operator yet, making the examples more convoluted.

PREFIX sap: <>
SELECT ?ent ?type ?name
?ent sap:Method "Name Catalog" .
?ent sap:Type ?type .
?ent sap:Name ?name

Difficult learning curve. Need ability to do substring from entity URI -> article URI.

Next steps: current news stories. Leverage existing metadata. RDF at the sentence level. Improve name catalogs. Use rule-based pattern matching engine. Slides.


Tuesday, December 9th, 2008

XML 2008 liveblog: Exploring the New Features of XSLT 2.0

Priscilla Walmsley, Datypic.

“I feel like crying every time I have to go back to 1.0.” Normally this is a full-day course. Familiarity with XSLT 1.0 assumed here. Venn diagram… Much of what people think of as “XQuery” is actually XPath 2.0.

XPath differences: root node -> “document node”. Namespace nodes, axis are deprecated. More atomic types, based on XML Schema. Node-set -> sequence. Path steps can be expressions, like product/(if (desc) then desc else name). Last step can return an atomic value, like sum(//item/(@price * @qty)).

Comparison operators apply to strings, dates, times. (Backwards compatibility note: comparing strings now is done by Unicode code point, not by conversion to number() as in XPath 1.0). Arithmetic possible on dates, durations. Missing value returns empty sequence rather than NaN.

(a,b) to concat sequences. New operators: idiv, union, intersect, except (latter 3 for nodes only)

<xsl:for-each select="1 to $count"> is handy. Operators << and >> test ‘precedes’ and ‘follows’ based on document order. Operator ‘is’ tests node identity.

Statement if/then/else is a more compact xsl:choose. Simplified FLWOR (only one for, no let or where).

Useful functions: ends-with(), string-join(), current-date(), distinct-values(), deep-equal().

From XPath to XSLT: <xsl:for-each-group> with current-group() and current-grouping-key(). Useful for turning a flat document (like HTML with h1, h2, etc. into nested structure. group-starting-with=”html:h1″, etc. The instruction <xsl:function> allows defining a new function. Major benefits in reuse, clarity, and handling recursion. Custom functions can be called from more places, like @select, @group-by, @match, but have the same expressive power of a named template.

Regular expressions: some XPath functions matches(), tokenize(), replace() (including subexpressions). <xsl:analyze-string> splits a string into matching and non-matching parts, handled separately in <xsl:matching-substring> and <xsl:non-matching-substring> child elements and regex-group().

I/O: Instruction <xsl:result-document> allows multiple output files. unparsed-text() allows input of non-XML documents (particularly in conjunction with regex).

Do I have to pay attention to types? “Usually, no.” BUT schemas can help catch errors, improve performance, and open new avenues of processing (like matching a template based on a schema-type).

Odds and ends: tunneling parameters (don’t have to repeat all the params for named templates), multiple modes, @select in more places, @separator attribute on xsl:attribute and xsl:value-of.

Brief Q&A: No test suite available. Probably better for new users to jump straight into 2.0. But going back to 1.0 is still painful. -m

Monday, December 8th, 2008

Overheard and overseen

Overheard at XML 2008: “Wow, it’s a good thing Mark Logic sponosred, otherwise nobody would be here.” (there were only five tables in the expo area.)

Overseen on the XML 2008 schedule: only one mention of XQuery, and that’s in relation to eXist, not the aforementioned sponsor.

This conference does have a different feel to it. Is XML at the ASCII-tipping-point, where it becomes so obvious that conferences aren’t needed? -m

Monday, December 8th, 2008

XML 2008 non-liveblog: Content Authoring Schemas

I was on the panel with Bob DuCharme, Frank Miller, and Evan Lenz discussing content authoring, from DITA to DocBook with some WordML sprinkled in for good measure. It was a good discussion, nothing earth-shaking. This session was laptopless, so I don’t have any significant notes. -m

Monday, December 8th, 2008

XML 2008 liveblog: Accelerated DITA Publishing

Roy Amodeo, Stilo.

Only 4 people in attendance when the talk starts. Quick overview of DITA. Transclusion (conref), topic-level maps, specialization, metadata-based filtering. XML and SGML flavors available. Open Toolkit has been a big part of DITA’s success. Replacable components (XSLT and FO). Many editing environments and CMS’s include this.

Topic-based publishing. Works best with many small, fairly independent topics. How well does the Open Toolkit work when pushing the boundaries? DITA stress test. Raising file size increases processing time faster than linear. Average file size 300k crashed. For overall number of files, roughly linear progression, but still blows up at large volumes.

Enter the OmniMark DITA Accelerator. Behavior modeled after toolkit, but minus the limits (streaming). Uses referents (placeholders left in place, filled in later; 2-pass algorithm). Base speed improvement 4X. Works well past where the Toolkit runs out of memory. Because DITA is standardized, the accelerated implementation can be easily plugged in.

Usability: XSLT exists somewhat uneasily with DITA. DITA Accelerator augments OmniMark with DITA-specific rules.

Conclusion: Standards are about choice of tools. (But how many OmniMark implementations are there?) Still, this makes me think I should check out the OmniMark language. I remain skeptical on DITA.


Monday, December 8th, 2008

XML 2008 liveblog: Content Modeling with XSD Schema

Delivered by Pradeep Jain, Ictect Inc. He has a handout available: “Intelligent Content Plug-In for Microsoft Word”, though it’s not obvious from the program that Word is involved.

What is content modeling? “Getting inside of” content, semantics, from there syntax and XML tagging.

Challenges: art vs. science, tacit vs. written documentation, future-proofing, technical vs. business communication, flexibility vs. stability. Getting knowledge workers to participate. Correctness (an emphasis of Ictect).

What is correctness of a model? More than valid XML. Litmus test: SME says “yep, I think you got it!”. But some machine-generated tests are possible.

Shows a Word doc with different kinds of bibliographic references (articles vs. books). Shows Schema code not visible from the back of the room. Word plug-in displays sidebar with a “convert” function, with several possible Schemas available to work against. Automatically detected sections in the document and added <section> elements. Progressively more complex examples of generated markup.

It seems like this is actually a pretty clever application, though it is hard to tell from this talk. -m

Monday, December 8th, 2008

XML 2008 liveblog: Ubiquity XForms

I will talk about one or more sessions from XML 2008 here.

Mark Birbeck of Web Backplane talking about Ubiquity XForms.

Browsers are slow to adopt new standards. Ajax libraries have attempted to work around this. Lots of experimentation which is both good and bad, but at least has legitimzed extensions to browsers. JavaScript is the assembly language of the web.

Ubiquity XForms is part of a library, which wil also include RDFa and SMIL. Initially based on YUI, but in theory sould be adaptable to other libraries like jQuery.

Declarative: tools for creation and validation. Easier to read. Ajax libraries are approaching the level of being their own language anyway, so might as well take advantage of a standard.

Example: setting the “inner value” of a span: <span value="now()"></span>.

Script can do this easily: onclick="this.innerHTML = Date().toLocaleString();" But crosses the line from semantics to specific behavior. The previous one is exactly how xforms:output works.

Another exapmple: tooltips. Breaks down to onmouseover, onmouseout event handlers, show and hide. A jQuery-like approach can search the document for all tooltip elements and add the needed handlers, avoiding explicit behavioral code. This is the essence of Ubiquity XForms (and in fact XForms itself).

Patterns like these compose under XForms. A button (xf:trigger) or any form control can easily have a tooltip (xf:hint). These are all regular elements, stylable with CSS, accesible via DOM, and so forth. Specific events (like xforms-hint) fire for specific events, and a spreadsheet-like engine can update interdependencies.

Question: Is this client-side? A: Yes, all running within Firefox. The entire presentation is one XForms document.

Demo: a range control with class=”geolocation” that displays as a map w/ Google Maps integration. The Ubiquity XForms library contains many such extensibility points.

Summary: Why? Simple, declarative. Not a programming language. Speeds up development. Validatable. Link:

Q&A: Rich text? Not yet, but not hard (especially with YUI). Formally XForms compliant? Very nearly 1.1 conforming.


Saturday, December 6th, 2008

Recruiting at XML 2008

I’m off to XML 2008 in Arlington, VA. One thing I’ll be seeking is a top-tier QA candidate for XML technologies. If you are that person, look me up. :-)


Friday, December 5th, 2008

Mystery Python Theatre 3K

The long-awaited Python 3.0 is out. It fixes almost every annoyance I have with the language, particularly around Unicode handling, which is important in the kinds of projects I work on.

Now, to revisit some of my Open Source projects… -m

Wednesday, December 3rd, 2008

Geek Thoughts: pi gone wild

Pi, an irrational number, cannot be expressed exactly as a fraction of integers (and all real-world length units are ultimately based on integers). So either pi is not a circle’s ratio of circumference to diameter, or circles don’t exist (or both!)

More collected Geek Thoughts at

Monday, December 1st, 2008

Where have all the acorns gone?

First the bee colonies start to disappear. Next, acorns. Does anyone have a map of the acorn-devoid areas? -m