Archive for the 'xml' Category

Thursday, August 5th, 2010

Balisageurs: XML and JSON

At David Lee’s nocturne about XML and JSON round-trippimg, several folks were talking about a site that listed several “off-the-shelf” conversion methods, but nobody could remember the site.

Late that night, with 15 minutes of battery remaining, I found it. The operative search term is XSLTJSON. -m

Tuesday, August 3rd, 2010

Heard, overheard, and misheard at Balisage

The opening day of the conference was not Balisage proper, but a separate symosium on “XML for the long haul”.

Some interesting tidbits overheard, in no particular order…

“it is not necessarily clear that this approach would capture the difference between the ridiculous and the merely implausible.”

Complexity — what is the relationship betwen complexity and long-term data storage?

“Narratives with fancy words in them”

How do you store, say, a video in a format that will be readable in 100 years?

Order of magnitude scale changes produce discontinuities

“The Da Vinci Schema”

Dandelion DNA (Free license)

“Indispensible” — “I don’t think that means what you think it does”

“Keeping electrons alive is really difficult”

“I wondered…with my Topic Map brain damage…”

-m

Sunday, May 30th, 2010

Balisage contest: solving the wikiml problem

I wish I could say I had something to do with the planning of this: part of Balisage 2010 is a contest to “encourage markup experts to review and to research the current state of wiki markup languages and to generate a proposal that serves to de-babelize the current state of affairs for the long haul.”  To enter, you must propose a set of concrete steps (organizational, social, and/or technological) that will enable wiki content interchange, a real WYSIWYG editor, and/or wiki syntax standardization.

This pushes all of my buttons. It’s got structured documents, Web, parser geekery, writing, engineering, and standards. There’s a bunch of open source prior art, including PyXMLWiki, which I adapted from some fantastic earlier work from Rick Jelliffe.

Sadly, MarkLogic employees aren’t eligible to enter. Get your write-up done by July 15 and sent to balisage-2010-contest at marklogic dot com. The winner will be announced at Balisage and will take home some serious prize winnings, and also will be strongly encouraged (but not required) to give a brief summary (~10 minutes) of their winning entry.

Can’t wait to see what comes out of this. -m

Tuesday, May 11th, 2010

XProc is ready

Brief note: The W3C XProc specification, edited by my partner-in-crime Norm Walsh, has advanced to Recommendation status. Now go use it. -m

Sunday, April 18th, 2010

The challenge of an XProc GUI

I’ve been thinking lately about what a sleek UI for creating XProc would look like. There’s plenty of big-picture inspiration to go around, from Yahoo Pipes to Mac OSX Automator, but neither of these are as XML-focused as something working with XProc would be.

XML, or to be really specific, XML Namespaces, comes with its own set of challenges. Making an interface that’s usable is no small task, particularly when your target audience includes the 99.9% of people that don’t completely understand namespaces. Take for example a simple step, like p:delete.

In brief, that step takes an XSLTMatchPattern, following the same rules as @match in XSLT, which ends up selecting various nodes from the document, then returns a document without any of those nodes. An XSLTMatchPattern has a few limitations, but it is a very general-purpose selection mechanism. In particular, it could reference an arbitrary number of XML Namespace prefix mappings. Behind a short string like a:b lies a much longer namespace URI mapping to each prefix.

What would an intuitive user interface look like to allow entry of these kinds of expressions? How can a user keep track of unbound prefixes and attach them properly? A data-driven approach could help, say offering a menu of existing element, attribute, or namespace names taken from a pool of existing content. But by itself this falls short in 1) richer selectors, like xhtml:p[@class = "invalid"] and 2) doesn’t help in the general case, when the nodes you’re manipulating might have come from the pipeline, not your original content. (Imagine one step in the pipeline translates your XML to XHTML followed by a delete step that cleans out some unwanted nodes).

So yeah, this seems like a Really Hard Problem, but one that’s worth taking a crack at. If this sounds like the kind of thing you’d enjoy working on, my team is hiring–drop me a note.

-m

Friday, April 2nd, 2010

Recalibrating expectations of XML performance

Working at MarkLogic has forced me to recalibrate my expectations around XML-related performance issues. Not to brag or anything, but it’s screaming fast. Conventional wisdom of avoiding // in paths doesn’t apply, since that’s the sort of thing the indexes are made to do, and that’s just the start. Single milliseconds are now a noteworthy amount of time for something showing up in the profiler.

This is what XML was supposed to be like. Now that XML has fallen off the hype cycle, we’re getting some serious work done. -m

Friday, March 5th, 2010

A Hyperlink Offering revisited

The xml-dev mailing list has been discussing XLink 1.1, which after a long quiet period popped up as a “Proposed Recommendation”, which means that a largely procedural vote is is all that stands between the document becoming a full W3C Recommendation. (The previous two revisions of the document date to 2008 and 2006, respectively)

In 2005 I called continued development of XLink a “reanimated spectre”. But even earlier, in 2002 I wrote one of the rare fiction pieces on xml.com, A Hyperlink Offering, which using the format of a Carrollian dialog between Tortoise and Achilles, explained a few of the problems with the XLink specification. It ended with this:

What if the W3C pushed for Working Groups to use a future XLink, just not XLink 1.0?

Indeed, this version has minor improvements. In particular, “simple” links are simpler now–you can drop an xlink:href attribute where you please and it’s now legit. The spec used to REQUIRE additional xlink:type=”simple” attributes all over the place. But it’s still awkward to use for multi-ended links, and now even farther away from the mainstream hyperlinking aspects of HTML5, which for all of its faults, embodies the grossly predominant description of linking on the web.

So in many ways, my longstanding disappointment with XLink is that it only ever became a tiny sliver of what it could have been. Dashed visions of Xanadu dance through my head. -m

Monday, February 22nd, 2010

Mark Logic User Conference 2010

Are you coming? Link. It starts on May 4 (Star Wars day!) at the InterContinental Hotel in San Francisco. Guest speakers include Chris Anderson, Editor-in-Chief of Wired and Michelle Manafy, Editor-in-Chief of EContent magazine.

Early bird registration ends Feb 28. -m

Tuesday, February 16th, 2010

There is no honor in namespaces

As heard from my friend and Mark Logic contractor Ryan Grimm. -m

Wednesday, October 7th, 2009

US Federal Register in XML

Fed Thread is a front end for the newly XMLified Federal Register. Why is this a big deal? It’s a daily publication of the goings-on of the US government. It’s a primary source for all kinds of things that normally only get rehashed through news organizations. And it is bulky–nobody can read through it on a regular basis. A yearly subscription (printed) would cost nearly $1000 and fill over 80,000 pages.

Having it in XML enables all kinds of searching, syndication, and annotation via flexible front ends like this one. Yay for transparency. -m

Tuesday, August 18th, 2009

Geek Thoughts: reading XProc code

All the input/output/port stuff in XProc seemed incomprehensible to me until I recognized something simple. Every time you see a <pipe> element, read it as “comes from”. For example

  <p:output port="result">
    <p:pipe step="validated" port="result"/>
  </p:output>

reads as ‘output to the “result” port comes from the port “result” on step “validated”‘ and

  <p:input port="source">
    <p:pipe step="included" port="result"/>
  </p:input>

reads as ‘input for the “source” port comes from the port “result” on step “included”‘. If you keep this in mind it all makes much more sense.

More collected Geek Thoughts at http://geekthoughts.info.

Wednesday, August 5th, 2009

Misunderstanding Markup

On this comic’s panel 9 describes XHTML 1.1 conformance as:

the added unrealistic demand that documents must be served with an XML mime-type

I can understand this viewpoint. XHTML 1.1 is a massively misunderstood spec, particularly around the modularization angle. But because of IE, it’s pretty rare to see the XHTML media-type in use on the open web. Later, panel 23 or thereabouts:

If you want, you can even serve your documents as application/xhtml+xml, instantly transforming them from HTML 5 to XHTML 5.

Why the shift in tone? What makes serving the XML media type more realistic in the HTML 5 case? IE? Nope, still doesn’t work. I’ve observed this same shift in perspective from multiple people involved in the HTML5 work, and it baffles me. In XHTML 1.1 it’s a ridiculous demand showing how out of touch the authors were with reality. In HTML5 the exact same requirement is a brilliant solution, wink, wink, nudge, nudge.

As it stands now, the (X)HTML5 situation demotes XHTML to the backwaters of the web. Which is pretty far from “Long Live XHTML…”, as the comic concludes. Remember when X stood for Extensible?

-m

Friday, July 24th, 2009

Java-style namespaces for markup

I’m noodling around with requirements and exploring existing work toward a solution for “decentralized extensability” on xml-dev, particularly for HTML. The notion of “Java-style” syntax, with reverse dns names and all, has come up many times in the context of these kinds of discussions, but AFAICT never been fully fleshed out. This is ongoing, slowly, in available time–which as been a post or two per week.  (In case there is any doubt, this is a spare-time effort not connected with my employer)

Check it out and add your knowledge to the thread. -m

Saturday, July 11th, 2009

The decline of the DBMS era

Several folks have been pointing to this article which has some choice quotes along the lines of

If we examine the nontrivial-sized DBMS markets, it turns out that current relational DBMSs can be beaten by approximately a factor of 50 in most any market I can think of.

My employer is specifically mentioned:

Even in XML, where the current major vendors have spent a great deal of energy extending their engines, it is claimed that specialized engines, such as Mark Logic or Tamino, run circles around the major vendors

And it’s true, but don’t take my word for it. :-) The DBMS world has lots of inertia, but don’t let that blind you to seeing another way to solve problems. Particularly if that extra 50x matters. -m

Tuesday, July 7th, 2009

Demo Jam at Balisage 2009

Come join me at the Demo Jam at Balisage this year. August 11 at 6:30 pm. There will be lots of cool demos, judged by audience participation. I’d love to see you there. -m

Wednesday, June 3rd, 2009

See you at Balisage

Balisage, formerly Extreme Markup, is the kind of conference I’ve always wanted to attend.

Historically my employers have been not quite enough involved in the deep kinds of topics at this conference (or too cash-strapped, but let’s not go there) to justify spending a week on the road. So I’m glad that’s no longer the case: Mark Logic is sponsoring the conference this year. I’m looking forward to the show, and since I’m not speaking, I might be able to relax a little and soak in some of the knowledge.

See you there! -m

Tuesday, March 24th, 2009

XIN: Implicit namespaces

An interesting proposal from Liam Quin, relating to the need for huge rafts of namespace declarations on mixed namespace documents.

In practice, though, almost all elements [in the given example] are going to be unambiguous if you take their ancestors into account, and attributes too.

Amen. I’ve been saying things like this for five years now. Look at any introductory text on XML, and the example used to show the need for namespaces will be embarrassingly contrived. That’s not a dig against authors, it’s a dig against over-engineered solutions to non-problems.

-m

Wednesday, February 25th, 2009

Brian May explains relativity

This is fantastic. Brian May (yes THAT Brian May) not only blogs, but talks about all kinds of challenging subjects. Like how and why space and time are linked. Worth a read. -m

Thursday, December 11th, 2008

XML 2008 liveblog: Introduction to eXist and XQuery

Greg Watson, IT Specialist, Defense Intelligence Agency Missile and Space Intelligence Center (apparently it IS rocket science). I installed eXist last night to follow along with the talk.

“If you have a larger dataset, eXist may not be the best choice.” Recommended reading: XQuery by Priscilla Walmsley, XQuery wikibook.

Download and install. Needs a full JDK (Mac includes this already in /Library/Java/Home), a mere JRE is insufficient. Start up with bin/startup.sh.

eXist-specific useful functions: request:get-parameter() from the URI query string. transform:transform() function invokes XSLT from within XQuery.

Example uses doc() to fetch an external URL of RSS, check individual items with contains(). Every example is a fully-formed, click-on-a-link-to-run program.

XQuery and PHP: reallly basic integration with simplexml_load_file($myXQueryURL).

Loading scripts into eXist: He uses XML Spy. eXist has a Jaa Web Start admin client.

Q&A: How bit is too big? Maybe 10-20-30 thousand docs. Generate indexes? Yes.

-m

(Production note: somehow, this one didn’t get published live. Now it is.)

Tuesday, December 9th, 2008

XML 2008 liveblog: Introduction to Schematron

Wendell Piez, Mulberry Technologies

Assertion-based schema language. A way to test XML documents. Rule-based validation language. Cool report generator. Good for capturing edge cases.

Same architecture as XSLT. (Schematron specifies, does not perform)

<schema xmlns="http://purl.cclc.org/dsdl/schematron">
  <title>Check sections 12/07</title>
  <pattern id="section-check">
    <rule context="section">
      <assert test="title">This section has no title</assert>
      <report test="p">This section has paragraphs</report>
      ...

Demo. OxygenXML has support. Assert vs. Report – essentially opposites. Assert means “tell me if this if false”. Report means “tell me if this is true”.

“Almost as if Schematron is a harness for XPath testing.”

More examples:

<rule context="note">
  <report test="ancestor::note">A note appears in a note. OK?</report>
</rule>

Binding: Default is XSLT 1, but flexible enough to allow other query langauges via attribute @queryBinding at the top. Many processors allow mix-and-match between XSLT and Schematron. Examples showing just that.

Some tests can be very useful:

test=”every $line in tokenize(., $newline) satisfies string-length($line) le 72″

Q: What if the destination is not a human, but another part of a pipeline? Varies by implementation, but SVRL is standardized as an annex in the ISO spec, part of DSDL.

Use as little or as much as you want, at different times in the document lifecycle. “Schematron is a feather duster that reaches areas other schema languages cannot.” – Rick Jelliffe

As time permits section of the talk:

Other top-level elements: title, pattern, ns, let, p, include, phase, diagnostics.

-m

Tuesday, December 9th, 2008

XML 2008 liveblog: Automating Content Analysis with Trang and Simple XSLT Scripts

Bob DuCharme, Innodata Isogen

Content analysis: why? You’ve “inherited” content. Need to save time or effort.

Handy tool 1: “sort”. As in the Unix command line tool. (Even Windows)

Handy tool 2: “uniq -c”  (flag -c means include counts)

Elsevier contest: interface for reading journals. Download a bunch of articles, and see what’s all in there.

Handy tool 3: Trang. Schema language converter. But can infer a schema from one or more input documents. Concat all sample documents under one root, and infer–this gives a list of all doctypes in use.

trang article.dtd article.rng
trang issueContents.xml issueContents.rng
saxon article.rng compareElsRNG.xsl | sort > compareElsRNG.out

compareElsRNG.xsl has text mode output, ignores input text nodes, and checks whether the RNG has references to each element, outputing “Yes: elementname” or “No: elemenname”. (which gets sorted in step 3)

Helps ferret out places where the schema says 40 different child elements are possible but in practice only 4 are used.

Handy tool 4: James Clark’s sx, converts SGML to XML.

Another stylesheet counts elements producing a histogram. [Ed. I would do this in XQuery in CQ.] Again, can help prioritize parts of the XML to use first. Similar logic for parent/child counts; where @id gets used; find all values for a particular attribute.

Another stylesheet goes through multiple converted-to-rng schemas, looking for common substructure. Lists generated this way can be pulled into a stylesheet.

Analyze a SGML DTD? dtd2html -> tidy -> XSLT. Clients like reports (especially spreadsheets). The is more like lego bricks.

-m

Tuesday, December 9th, 2008

XML 2008 liveblog: Using RDFa for Government Information

Mark Birbeck, Web Backplane.

Problem statement: You shouldn’t have to “scrape” government sites.

Solution: RDFa

<div typeof="arg:Vacancy">
  Job title: <span property="dc:title">Assistant Officer</span>
  Description: <span property="dc:description">To analyse... </span>
</div>

This resolves to two full RDF triples. No separate feeds, uses existing publishing systems. Two of the most ambitious RDFa projects are taking place in the UK. Flexible arrangements possible.

Steps: 1. Create vocabulary. 2. Create demo. 3. Evangelize.

Vocabulary under Google Code: Argot Hub. Reuse terms (dc:title, foaf:name) where possible, developed in public.

Demos: Yahoo! SearchMonkey, (good for helping not-so-technical people to “get it”) then a Drupal hosted one (a little more control).

Next level, a new server that aggregates specific info (like all job openeings for Electricians), incuding geocoding. Ubiquity RDFa helps here.

Evangelizing: Detailed tutorials. Drupal code will go open source. More opportunities with companies currently screen-scrapting. More info @ rdfa.info.

Q&A: Asking about predicate overloading (dc:title). A general SemWeb issue. Context helps. Is RDFa tied to HTML? No, SearchMonkey itself uses RDFa–it’s just attributes.

-m

Tuesday, December 9th, 2008

XML 2008 liveblog: Exploring the New Features of XSLT 2.0

Priscilla Walmsley, Datypic.

“I feel like crying every time I have to go back to 1.0.” Normally this is a full-day course. Familiarity with XSLT 1.0 assumed here. Venn diagram… Much of what people think of as “XQuery” is actually XPath 2.0.

XPath differences: root node -> “document node”. Namespace nodes, axis are deprecated. More atomic types, based on XML Schema. Node-set -> sequence. Path steps can be expressions, like product/(if (desc) then desc else name). Last step can return an atomic value, like sum(//item/(@price * @qty)).

Comparison operators apply to strings, dates, times. (Backwards compatibility note: comparing strings now is done by Unicode code point, not by conversion to number() as in XPath 1.0). Arithmetic possible on dates, durations. Missing value returns empty sequence rather than NaN.

(a,b) to concat sequences. New operators: idiv, union, intersect, except (latter 3 for nodes only)

<xsl:for-each select="1 to $count"> is handy. Operators << and >> test ‘precedes’ and ‘follows’ based on document order. Operator ‘is’ tests node identity.

Statement if/then/else is a more compact xsl:choose. Simplified FLWOR (only one for, no let or where).

Useful functions: ends-with(), string-join(), current-date(), distinct-values(), deep-equal().

From XPath to XSLT: <xsl:for-each-group> with current-group() and current-grouping-key(). Useful for turning a flat document (like HTML with h1, h2, etc. into nested structure. group-starting-with=”html:h1″, etc. The instruction <xsl:function> allows defining a new function. Major benefits in reuse, clarity, and handling recursion. Custom functions can be called from more places, like @select, @group-by, @match, but have the same expressive power of a named template.

Regular expressions: some XPath functions matches(), tokenize(), replace() (including subexpressions). <xsl:analyze-string> splits a string into matching and non-matching parts, handled separately in <xsl:matching-substring> and <xsl:non-matching-substring> child elements and regex-group().

I/O: Instruction <xsl:result-document> allows multiple output files. unparsed-text() allows input of non-XML documents (particularly in conjunction with regex).

Do I have to pay attention to types? “Usually, no.” BUT schemas can help catch errors, improve performance, and open new avenues of processing (like matching a template based on a schema-type).

Odds and ends: tunneling parameters (don’t have to repeat all the params for named templates), multiple modes, @select in more places, @separator attribute on xsl:attribute and xsl:value-of.

Brief Q&A: No test suite available. Probably better for new users to jump straight into 2.0. But going back to 1.0 is still painful. -m

Monday, December 8th, 2008

Overheard and overseen

Overheard at XML 2008: “Wow, it’s a good thing Mark Logic sponosred, otherwise nobody would be here.” (there were only five tables in the expo area.)

Overseen on the XML 2008 schedule: only one mention of XQuery, and that’s in relation to eXist, not the aforementioned sponsor.

This conference does have a different feel to it. Is XML at the ASCII-tipping-point, where it becomes so obvious that conferences aren’t needed? -m

Monday, December 8th, 2008

XML 2008 non-liveblog: Content Authoring Schemas

I was on the panel with Bob DuCharme, Frank Miller, and Evan Lenz discussing content authoring, from DITA to DocBook with some WordML sprinkled in for good measure. It was a good discussion, nothing earth-shaking. This session was laptopless, so I don’t have any significant notes. -m

Monday, December 8th, 2008

XML 2008 liveblog: Accelerated DITA Publishing

Roy Amodeo, Stilo.

Only 4 people in attendance when the talk starts. Quick overview of DITA. Transclusion (conref), topic-level maps, specialization, metadata-based filtering. XML and SGML flavors available. Open Toolkit has been a big part of DITA’s success. Replacable components (XSLT and FO). Many editing environments and CMS’s include this.

Topic-based publishing. Works best with many small, fairly independent topics. How well does the Open Toolkit work when pushing the boundaries? DITA stress test. Raising file size increases processing time faster than linear. Average file size 300k crashed. For overall number of files, roughly linear progression, but still blows up at large volumes.

Enter the OmniMark DITA Accelerator. Behavior modeled after toolkit, but minus the limits (streaming). Uses referents (placeholders left in place, filled in later; 2-pass algorithm). Base speed improvement 4X. Works well past where the Toolkit runs out of memory. Because DITA is standardized, the accelerated implementation can be easily plugged in.

Usability: XSLT exists somewhat uneasily with DITA. DITA Accelerator augments OmniMark with DITA-specific rules.

Conclusion: Standards are about choice of tools. (But how many OmniMark implementations are there?) Still, this makes me think I should check out the OmniMark language. I remain skeptical on DITA.

-m

Monday, December 8th, 2008

XML 2008 liveblog: Content Modeling with XSD Schema

Delivered by Pradeep Jain, Ictect Inc. He has a handout available: “Intelligent Content Plug-In for Microsoft Word”, though it’s not obvious from the program that Word is involved.

What is content modeling? “Getting inside of” content, semantics, from there syntax and XML tagging.

Challenges: art vs. science, tacit vs. written documentation, future-proofing, technical vs. business communication, flexibility vs. stability. Getting knowledge workers to participate. Correctness (an emphasis of Ictect).

What is correctness of a model? More than valid XML. Litmus test: SME says “yep, I think you got it!”. But some machine-generated tests are possible.

Shows a Word doc with different kinds of bibliographic references (articles vs. books). Shows Schema code not visible from the back of the room. Word plug-in displays sidebar with a “convert” function, with several possible Schemas available to work against. Automatically detected sections in the document and added <section> elements. Progressively more complex examples of generated markup.

It seems like this is actually a pretty clever application, though it is hard to tell from this talk. -m

Monday, December 8th, 2008

XML 2008 liveblog: Ubiquity XForms

I will talk about one or more sessions from XML 2008 here.

Mark Birbeck of Web Backplane talking about Ubiquity XForms.

Browsers are slow to adopt new standards. Ajax libraries have attempted to work around this. Lots of experimentation which is both good and bad, but at least has legitimzed extensions to browsers. JavaScript is the assembly language of the web.

Ubiquity XForms is part of a library, which wil also include RDFa and SMIL. Initially based on YUI, but in theory sould be adaptable to other libraries like jQuery.

Declarative: tools for creation and validation. Easier to read. Ajax libraries are approaching the level of being their own language anyway, so might as well take advantage of a standard.

Example: setting the “inner value” of a span: <span value="now()"></span>.

Script can do this easily: onclick="this.innerHTML = Date().toLocaleString();" But crosses the line from semantics to specific behavior. The previous one is exactly how xforms:output works.

Another exapmple: tooltips. Breaks down to onmouseover, onmouseout event handlers, show and hide. A jQuery-like approach can search the document for all tooltip elements and add the needed handlers, avoiding explicit behavioral code. This is the essence of Ubiquity XForms (and in fact XForms itself).

Patterns like these compose under XForms. A button (xf:trigger) or any form control can easily have a tooltip (xf:hint). These are all regular elements, stylable with CSS, accesible via DOM, and so forth. Specific events (like xforms-hint) fire for specific events, and a spreadsheet-like engine can update interdependencies.

Question: Is this client-side? A: Yes, all running within Firefox. The entire presentation is one XForms document.

Demo: a range control with class=”geolocation” that displays as a map w/ Google Maps integration. The Ubiquity XForms library contains many such extensibility points.

Summary: Why? Simple, declarative. Not a programming language. Speeds up development. Validatable. Link: ubiquity.googlecode.com.

Q&A: Rich text? Not yet, but not hard (especially with YUI). Formally XForms compliant? Very nearly 1.1 conforming.

-m

Thursday, July 10th, 2008

Easing back into xml-dev

Traffic ain’t what it used to be there. But since I’m at a core xml technology company, it makes sense to participate again. Now, are there any topics left that haven’t been hashed to death? (hint: yes) -m

Wednesday, July 9th, 2008

Google Protocol Buffers: what’s missing from this picture?

Today Google announced Protocol Buffers, described as “think XML, but smaller, faster, and simpler“. Language bindings for C++, Java, and Python. Oddly not even a whisper about JSON, which is a much more apt comparison. And along with that, no JavaScript implementation. So why the omission?

My guess is that it wouldn’t compare that favorably with JSON. The extra needed compile step is a hassle, and doesn’t give enough of a relative benefit for Ajax applications. But perhaps this will unleash a torrent of people asking for ‘binary JSON’. OK, maybe not… -m