Archive for the 'languages' Category

Thursday, May 5th, 2016

Unsafe Java

Pop quiz. Why is the following Java 8 code unsafe? UPDATE: this code is fine, see comments. Still good to think about, though.

Entity e = new Entity();
e.setName("my new entity");
persistanceLayer.put(e);

To provide some context, Entity is a POJO representing something we want to store in a database. And persistanceLayer is an instance of a Data Access Object responsible for storing the object in the database. Don’t see it yet?

Here’s the function signature of that put method

@Async
<E extends AbstractEntity> CompletableFuture<E> put(E newEntity);

Ah, yes, Spring, which because of the @Async annotation will proxy this method and cause it to execute in a different thread. The object getting passed in is clearly mutable, and by my reading of the Java Memory Model, there’s no guarantee that the other thread’s view of the update that happened in the setName method will be visible. In other words, there’s no guarantee the two threads idea of “before-ness” will be the same; who knows what might have been reordered or cached?

I’m having trouble coming to terms with what seems like a gaping hole. Do you agree with this analysis? If so, how do you deal with it? Immutable object classes seem like an obvious choice, but will introduce lots of overhead in other parts of the code that need to manipulate the entities. A builder pattern gets awkward and verbose quickly, as it hoists all that complexity on to the caller.

How do you handle situations like this? I’d love to hear it.

If you produce libraries, please include documentation on what is and isn’t guaranteed w.r.t. async calls. -m

Sunday, July 27th, 2014

Prime Number sieve in Scala

There are a number of sieve algorithms that can be used to list prime numbers up to a certain value.  I came up with this implementation in Scala. I rather like it, as it makes no use of division, modulus, and only one (explicit) multiplication.

Despite being in Scala, it’s not in a functional style. It uses the awesome mutable BitSet data structure which is very efficient in space and time. It is intrinsically ordered and it allows an iterator, which makes jumping to the next known prime easy. Constructing a BitSet from a for comprehension is also easy with breakOut.

The basic approach is the start with a large BitSet filled will all odd numbers (and 2), then iterate through the BitSet, constructing a new BitSet containing numbers to be crossed off, which is easily done with the &~= (and-not) reassignment method. Since this is a logical bitwise operation, it’s blazing fast. This code takes longer to compile than run on my oldish MacBook Air.

import scala.collection.mutable.BitSet
import scala.collection.breakOut
println(new java.util.Date())
val top = 200000
val sieve = BitSet(2)
sieve |= (3 to top by 2).map(identity)(breakOut)
val iter = sieve.iteratorFrom(3)
while(iter.hasNext) {
 val n = iter.next
 sieve &~= (2*n to top by n).map(identity)(breakOut)
}
println(sieve.toIndexedSeq(10000)) // 0-based
println(new java.util.Date())

As written here, it’s a solution to Euler #7, but it could be made even faster for more general use.

For example

  • I used a hard-coded top value (which is fine when you need to locate all primes up to  n). For finding the nth prime, though, the top limit could be calculated
  • I could stop iterating at sqrt(top)
  • I could construct the removal BitSet starting at n*n rather than n*2

I suspect that spending some time in the profiler could make this even faster. So take this as an example of the power of Scala, and a reminder that sometimes a non-FP solution can be valid too. Does anyone have a FP equivalent to this that doesn’t make my head hurt? :-)

-m

Thursday, August 23rd, 2012

Super simple tokenizer in XQuery

A lexer might seem like one of the boringest pieces of code to write, but every language brings it’s own little wrinkles to the problem. Elegant solutions are more work, but also more rewarding.

There is, of course, a large body of work on table-driven approaches, several of them listed here (and bigger list), though XQuery seems to have been largely left out of the fun.

In MarkLogic Search API, we implemented a recursive tokenizer. Since a search string can contain quoted pieces which need to be carefully maintained, first we split (in the fn:tokenize-sense, discarding matched delimiters) on the quote character, then iterate through the pieces. Odd-numbered pieces are chunks of tokens outside of any quoting, and even-numbered pieces are a single quoted string, to be preserved as-is. We recurse through the odd chunks, further breaking them down into individual tokens, as well as normalizing whitespace and a few other cleanup operations. This code is aggressively optimized, and it removes any searches for tokens known to not appear in the overall string. It also preserves the character offset positions of each token relative to the starting string, which gets used downstream, so this makes for some of the most complicated code in the Search API. But it’s blazingly fast.

When prototyping, it’s nice to have something simpler and more straightforward. So I came up with an approach using fn:analyze-string. This function, introduced in XSLT 2.0 and later ported to XQuery 3.0, takes a regular expression, and returns all of the target string, neatly divided into match and non-match portions. This is great, but difficult to apply across the entire string. For example, potential matches can have different meaning depending on where they fall (again, quoted strings as an example.) But if every regex starts with ^ which anchors the match to the front of the string, the problem simplifies to peeling off a single token from the front of the string. Keep doing this until there’s no string left.

This is a particularly nice approach when parsing a grammar that’s formally defined in EBNF. You can pretty much take the list of terminal expressions, port them to XQuery-style regexes, add a ^ in front of each, and roll.

Take SPARQL for example. It’s a reasonably rich grammar. The W3C draft spec has 35 productions for terminals. I sketched out some of the terminal rules (note these are simplified):

declare variable $spq:WS     := "^\s+";
declare variable $spq:QNAME  := "^[a-zA-Z][a-zA-Z0-9]*:[a-zA-Z][a-zA-Z0-9]*";
declare variable $spq:PREFIX := "^[a-zA-Z][a-zA-Z0-9]*:";
declare variable $spq:NAME   := "^[a-zA-Z][a-zA-Z0-9]*";
declare variable $spq:IRI    := "^<[^>]+>";
...

Then going through the input string, and seeing which of these expressions match, and if so, calling analyze-string and adding the matched portion as a token, and recursing on the non-matched portion. Note that we need to go through longer matches first, so the rule for ‘prefix:qname’ comes before the rule for ‘prefix:’ which comes before the rule for ‘string’

declare function spq:tokenize-recurse($in as xs:string, $tl as json:array) {
    if ($in eq "")
    then ()
    else spq:tokenize-recurse(
        switch(true())
        case matches($in, $spq:WS)     return spq:discard-tok($in, $spq:WS)
        case matches($in, $spq:QNAME)  return spq:peel($in, $spq:QNAME, $tl, "qname")
        case matches($in, $spq:PREFIX) return spq:peel($in, $spq:PREFIX, $tl, "prefix", 0, 1)
        case matches($in, $spq:NAME)   return spq:peel($in, $spq:NAME, $tl, "name")
        ...

Here, we’re co-opting a json:array mutable object as a convenient way to store tokens as we peel them off. There’s not actually any JSON involved here. The actual peeling looks like this:

declare function spq:peel(
    $in as xs:string,
    $regex as xs:string,
    $toklist as json:array,
    $type as xs:string,
    $triml, $trimr) {
    let $split := analyze-string($in, $regex)
    let $match := string($split/str:match)
    let $match := if ($triml gt 0) then substring($match, $triml + 1) else $match
    let $match := if ($trimr gt 0) then substring($match, 1, string-length($match) - $trimr) else $match
    let $_ := json:array-push($toklist, <searchdev:tok type="{$type}">{$match}</searchdev:tok>)
    let $result := string($split/str:non-match)
    return $result
};

Some productions, like a <iri> inside angle brackets, contain fixed delimiters which get trimmed off. Some productions, like whitespace, get thrown away. And that’s it. As it stands, it’s pretty close to a table-driven approach. It’s also more flexible than the recursive approach above–even for things like escaped quotes inside a string, if you can write a regex for it, you can lex it.

Performance

But is it fast? Short answer is that I don’t know. A full performance analysis would take some time. But a few quick inspections shows that it’s not terrible, and certainly good enough for prototype work. I have no evidence for this, but I also suspect that it’s amenable to server-side optimization–inside the regular expression matching code, paths that involve start-anchored matches should be easy to identify and in many cases avoid work farther down the string. There’s plenty of room on the XQuery side for optimization as well.

If you’ve experimented with different lexing techniques, or are interested in more details of this approach, drop me a line in the comments. -m

Saturday, January 14th, 2012

Call a Spade a Spade

A cautionary tale of language from Ted Nelson:

We might call a common or garden spade–

  • A personalized earth-moving equipment module
  • A mineralogical mini-transport
  • A personalized strategic tellurian command and control module
  • An air-to-ground interface contour adjustment probe
  • A leveraged tactile-feedback geomass delivery system
  • A man-machine energy-to-structure converter
  • A one-to-one individualized geophysical restructurizer
  • A portable unitized earth-work synthesis system
  • An entrenching tool
  • A zero-sum dirt level adjuster
  • A feedback-oriented contour management probe and digging system
  • A gradient disequilibrator
  • A mass distribution negentroprizer
  • (hey!) a dig-it-all system
  • An extra terrestrial transport mechanism

Spades, not words, should be used for shoveling. But words should help us unearth the truth.

–Computer Lib (1974), Theodor Nelson, p44

Thursday, September 2nd, 2010

Is XForms really MVC?

This epic posting on MVC helped me better understand the pattern, and all the variants that have flowed outward from the original design. One interesting observation is that the earlier designs used Views primarily as output-only, and Controllers primarily as input-only, and as a consequence the Controller was the one true path for getting data into the Model.

But with browser forms, input and output are tightly intermingled. The View takes care of input and output. Something else has primary responsibility for mediating the data flow to and from the model–and that something has been called a Presenter. This yields the MVP pattern.

The terminology gets confusing quickly, but roughly

XForms Instance == MVP Model

XForms Model == MVP Presenter

XForms User Interface == MVP View

It’s not wrong to associate XForms with MVC–the term has become so blurry that it’s easy to lump variants like MVP into the same bucket. But to the extent that it makes sense to talk about more specific patterns, maybe we should be calling the XForms design pattern MVP instead of MVC. Comments? Criticism? Fire away below. -m

Sunday, May 30th, 2010

Balisage contest: solving the wikiml problem

I wish I could say I had something to do with the planning of this: part of Balisage 2010 is a contest to “encourage markup experts to review and to research the current state of wiki markup languages and to generate a proposal that serves to de-babelize the current state of affairs for the long haul.”  To enter, you must propose a set of concrete steps (organizational, social, and/or technological) that will enable wiki content interchange, a real WYSIWYG editor, and/or wiki syntax standardization.

This pushes all of my buttons. It’s got structured documents, Web, parser geekery, writing, engineering, and standards. There’s a bunch of open source prior art, including PyXMLWiki, which I adapted from some fantastic earlier work from Rick Jelliffe.

Sadly, MarkLogic employees aren’t eligible to enter. Get your write-up done by July 15 and sent to balisage-2010-contest at marklogic dot com. The winner will be announced at Balisage and will take home some serious prize winnings, and also will be strongly encouraged (but not required) to give a brief summary (~10 minutes) of their winning entry.

Can’t wait to see what comes out of this. -m

Friday, May 15th, 2009

A nugget from _A Canticle for Leibowitz_

This brilliant bit is almost a throwaway paragraph on page 304, near the end.

[Two men in a satirical dialog] managed only to demonstrate that the mathematical limit of an infinite sequence of “doubting the certainty with which something doubted is known to be unknowable  when the ‘something doubted’ is still a preceding statement ‘unknowability’ of something doubted,” that the limit of this process at infinity can only be equivalent to a statement of absolute certainty, even though phrased ans an infinite series of negations of certainty.

It’s not like the whole book is like this…far from it. But it is chock full of little gems.

-m

Tuesday, May 12th, 2009

Google Rich Snippets powered by RDFa

The new feature called rich snippets shows that SearchMonkey has caught the eye of the 800 pound gorilla. Many of the same microformats and RDF vocabularies are supported. It seems increasingly inevitable that RDFa will catch on, no matter what the HTML5 group thinks. -m

Wednesday, March 11th, 2009

Geek Thoughts: omito

Omito:

(Spanish) First-person singular (yo) present indicative form of omitir (to omit).

(Proto-English) Shortened word form of an error of omission, e.g. in written.

More collected Geek Thoughts at http://geekthoughts.info.

Sunday, March 8th, 2009

Wolfram Alpha

The remarkable (and prolific) Stephen Wolfram has an idea called Wolfram Alpha. People used to assume the “Star Trek” model of computers:

that one would be able to ask a computer any factual question, and have it compute the answer.

Which has proved to be quite distant from reality. Instead

But armed with Mathematica and NKS [A New Kind of Science] I realized there’s another way: explicitly implement methods and models, as algorithms, and explicitly curate all data so that it is immediately computable.

It’s not easy to do this. Every different kind of method and model—and data—has its own special features and character. But with a mixture of Mathematica and NKS automation, and a lot of human experts, I’m happy to say that we’ve gotten a very long way.

I’m still a SearchMonkey guy at heart, so I wonder how much Wofram’s team is familiar with existing Semantic Web research and practice–because at a high level this seems very much like RDF with suitable queries thereupon. If that’s a good characterization, that’s A Good Thing, since practical application has been one of SemWeb’s weak spots.

-m

Monday, February 16th, 2009

Crane Softwrights adds XQuery training

From the company home page, reknown XSLT trainer and friend G. Ken Holman has expanded his offerings to include XQuery training. The first such session is March 16-20, alongside XML Prague.

I’ve always thought there is great power in having both XSLT and XQuery tools at one’s disposal. I’ve seen people tend to polarize into one camp or the other, but in truth there is a lot of common ground, as well as cases where the right technology makes for a much more elegant solution. So learning both is easier than it seems, and more useful than it seems.

If you will be around the conference, take a look at the syllabus. I’m curious to see others’ reactions toward the combined XSLT + XQuery toolset. -m

Wednesday, January 7th, 2009

On porting WebPath to Python 3k

I’ve started looking into porting the WebPath code (and eventually XForms Validator) over to Python 3. The first step is external libraries, of which there is only one. WebPath uses the lex.py module from PLY. I had got it into my head that Python 2.x and 3.x were thoroughly incompatible, but leave it to the remarkable David Beazley to blow that assumption out of the water: the latest version of lex.py from SVN works in both 2.x and 3.x.

From there the included 2to3 tool was easy enough to run. (Relatively more difficult was getting 2.6 and 3.0 versions of Python frameworks installed on Mac, but even that wasn’t too bad.) The tool made some moderate changes, and I can run the unit tests, and a few even pass!

The primary remaining problem stems from code where the documentation is a little unclear, and my inexperience is severe. The part of the code in platonicweb.py that reads nasty, grotty HTML via Tidy and produces a clean DOM throws an exception every time. Seems to be a mismatch between String and Byte (encoded string) types, but manifested as a failed XML parse. Sans exception handling, the code looks like:

    page = urllib.request.urlopen(fullurl)
    markup = page.read()
    dom = xml.dom.minidom.parseString(markup)

urlopen() returns a file-like object, but the docs didn’t seem clear on whether it’s like a file opened in byte or string mode. In any case, I’m almost certainly doing it wrong. Suggestions?

-m

Tuesday, December 9th, 2008

XML 2008 liveblog: Introduction to Schematron

Wendell Piez, Mulberry Technologies

Assertion-based schema language. A way to test XML documents. Rule-based validation language. Cool report generator. Good for capturing edge cases.

Same architecture as XSLT. (Schematron specifies, does not perform)

<schema xmlns="http://purl.cclc.org/dsdl/schematron">
  <title>Check sections 12/07</title>
  <pattern id="section-check">
    <rule context="section">
      <assert test="title">This section has no title</assert>
      <report test="p">This section has paragraphs</report>
      ...

Demo. OxygenXML has support. Assert vs. Report – essentially opposites. Assert means “tell me if this if false”. Report means “tell me if this is true”.

“Almost as if Schematron is a harness for XPath testing.”

More examples:

<rule context="note">
  <report test="ancestor::note">A note appears in a note. OK?</report>
</rule>

Binding: Default is XSLT 1, but flexible enough to allow other query langauges via attribute @queryBinding at the top. Many processors allow mix-and-match between XSLT and Schematron. Examples showing just that.

Some tests can be very useful:

test=”every $line in tokenize(., $newline) satisfies string-length($line) le 72″

Q: What if the destination is not a human, but another part of a pipeline? Varies by implementation, but SVRL is standardized as an annex in the ISO spec, part of DSDL.

Use as little or as much as you want, at different times in the document lifecycle. “Schematron is a feather duster that reaches areas other schema languages cannot.” – Rick Jelliffe

As time permits section of the talk:

Other top-level elements: title, pattern, ns, let, p, include, phase, diagnostics.

-m

Friday, November 28th, 2008

Fun with xdmp:value()

Lately I’ve been playing with some more advanced XQuery. One thing nearly every XQuery engine supports is some kind of eval() function. MarkLogic has several, but my favorite is xdmp:eval. It’s lightweight because it reuses the entire calling context, so for instance you can write let $v := 5 return xdmp:value("$v"). Not too useful, but if the expression passed in comes from a variable, it gets interesting.

Now, quite a few standards based on XPath depend on the context node being set to some particular node. This turns out to be easy too, using the path operator: $context/xdmp:value($expr). According to the definition of the XPath path operator, the expression to the right is evaluated with the results of the expression on the left setting the context node.

OK, how about setting the context size and position? More difficult, but one could use a sequence on the left-hand side of the path operator, with the desired $context node in somewhere in the middle. Then last() will return the length of the sequence, and position() will return, well, the position of $context in the sequence. But it’s kind of hacky to manufacture a bunch of temporary nodes, only to throw them away in the next step of the path.

I’m curious if anyone else has done something similar. Comments? -m

Friday, October 24th, 2008

Online etymology database

I’ve been playing lately with this site, and it’s a fantastic resource. The word carboy probably comes from Persian qarabah “large flagon.” Who knew? -m

Wednesday, September 17th, 2008

The case for native higher-order functions in XQuery

The XQuery Working Group is debating the need for higher-order functions in the language. I’m working on honing my description of why this is an important feature. Does this work? What would work better?

Imagine you are writing a smallish widget app, in an environment without a standard library. When you need to sort your widgets, you’d write a simple function with a signature like sort(sequence-of-widgets). That’s great.

Now imagine you find your app to be steadily growing. An accumulation of smaller one-off solutions won’t work anymore, you need a general solution. What you’ll end up with is something like qsort in C, which takes a pointer to a comparator function. By providing different comparators, you can sort anything any way you like, all through only a single sort function. C and C++ have something like this, as do PHP, Python, Java, JavaScript, and even assembly language. XSLT has it, as proven by Dimitre.

XQuery doesn’t. It should, because people are now using it for more than short queries. People are writing programs in it. -m

P. S. Comment please.

Wednesday, September 10th, 2008

Geek Thoughts: English is funny, part 1

Has there ever been a case of mitigated gall?

More collected Geek Thoughts at http://geekthoughts.info/.

Friday, August 8th, 2008

It would be awesome if somebody…

It would be awesome of someone made a site that catalogued all the common mis-encodings. Even in 2008, I see these things all over the web–mangled quotation marks, apostrophes, em-dashes. I’d love to see a pictoral guide.

curly apostrophe looks like ?’ – original encoding=_________ mislabeled as __________ .

That sort of thing. Surely somebody has done this arleady, right? -m

Wednesday, May 28th, 2008

XForms Validator on Google App Engine?

I registered ‘xfv’ on Google App Engine. Too bad there doesn’t appear to be any significant XML libraries supported. I have XPath covered by my pure-python WebPath, but what about Relax NG? Anyone know of anything in pure python? -m

Tuesday, May 20th, 2008

The two-line CV

In my about page, I’ve written my CV in two lines. Why don’t you try it, then link back to here?

I’ve been known to use this as an interview question, and it’s quite a bit harder than it looks. A clever candidate will turn the paper sideways giving themselves more room to write “two lines”, but that’s not the point. This exercise forces one to really think about their qualifications, skills, and experience; one’s “unique selling proposition”.

Writing short, as opposed to rambling on, is notoriously difficult. Someone who can do that with their own CV is off to a good start in my book. -m

P. S. Mark Logic is looking for some high-caliber XML and web folks. Contact me offline if you know anyone looking…

Thursday, May 8th, 2008

14 ways…

When making hash browns xkcd style, there are at least 14 ways it could go badly.

  1. That’s not a potato, it’s a misshapen rock.
  2. Unexpectedly flammable tennis racket.
  3. Sparks landing on gas can.
  4. Food poisoning via undercooked hash browns due to limited flame contact time.
  5. Broken plate fragments.
  6. Dripping, flaming gasoline.
  7. Swing and a miss; balance lost.
  8. Flaming potato fragments in the eye socket.
  9. Diving catch ends badly.
  10. Spontaneous combustion.
  11. Tennis elbow.
  12. Repetitive stress injury.
  13. Fork misfire.
  14. Heat death of the universe.

(17 if that fork is a dangerous crossbreed) -m

Monday, March 10th, 2008

Dear readers…

You are awesome. Just sayin’. -m

Friday, November 30th, 2007

4 things I’ve learned writing (mostly) 4 novels

If you want to get anything done, give it to a busy person…

In my life, I’ve started four novels, completed my goals on three, gotten to “The End” on two, and completely flamed out on one.

The first was in 2001. I hadn’t written much since high school. Something clicked in my head that made me realize that writing wasn’t some kind of black art (as one particular teacher had drilled into his credulous students). It was doable. You take pencil and paper and write one word after another. Voilà. I was so taken with this simple idea that every single thing I ever learned about writing went out the window. I had Swifties, danglers, tell-vs-show, you name it. There’s enough material in there for several Bulwer-Lytton contests. By the time I had 70 hand-written pages, the thing collapsed under it’s own weight and the story reached an abrupt, borderline-surrealistic “ending” to abuse the term. I have evidence that I even typed it all in and pressed on for a 2nd draft.

By 2003 my non-fiction book was published–my writing career was under way! Part of the elaborate book proposal dance involved me writing some online articles, including one piece of fiction that was well-received in the tiny circle that was its intended audience. At this stage I adopted electronic writing, and ditched my crashy Windows laptop for a Mac, a vast improvement.

In 2005 I discovered NaNoWriMo, and though I thought it would be a lost cause, I signed up. No way it could be as bad as the previous attempt. I had a new job, and was able to skip a few lunches to write, not to mention intense evenings and weekends. The end goal is 50,000 words during the 30 days of November, that’s 1,666 and two-thirds words per day. All of the prior month I spent outlining, making maps, creating my universe. I used the simplest of tools, my text editor and one file per chapter. I learned that the command wc *.txt could easily give me a combined word count. To my surprise, it worked. I reemerged into daylight with a completed a full story arc loosely based on the earlier story, and ended up with just over 50,000 words. The text itself was very rough, but I read the whole thing out loud in a podcast to edit it. In terms of improvement, it was huge, but still far from publishable.

2006 and another NaNoWriMo rolled around, and I took off on a more ambitious storyline with far fewer notes going into it. The story itself involved the same general characters of the previous two episodes, but with a deeper, more mature feeling to it. In short, I finally wrote a piece of fiction to be proud about afterwards, though when I hit 50,000 words I felt really burned out; hit “save” and left the story arc unfinished.

The pull to dig in to an intensive 2nd draft of the story was immense, but just too many things were going on, including a new arrival in the family and a new set of job responsibilities. I never got more than a few dozen pages into the rewrite. When NaNoWriMo 2007 came upon me, I had a tough choice…do I write something fresh, or try to rework the previous novel? Fresh. A completely new story line, new characters, new setting, new everything. As of a few days ago, I finished the draft, compressing parts of the story as needed to meet both the 50 kiloword goal and the complete story arc. In preparation, I read a number of books, but as far as written outlines, maps, etc. go, almost nothing happened before November 1. I saved enough of the “fun stuff” that a second revision of this story will be a joy. Overall, another improvement year-over-year.

There’s only one kink to the “if you want to get something done…” idea: my slides for the XML Conference talk I have in a few days are still unfinished… -m

Monday, November 19th, 2007

Kindle my disappointment

Where’s Project Gutenberg? One difficulty in launching an ebook platform is the lack of available titles. I keep hearing about 80,000+ titles, but expressed as a percentage of Amazon’s book catalog, it’s minuscule. There should be all kind of public domain titles ready to go on day one. And where’s the Creative Commons books?
There’s some public domain books to be found, but none are free. Take, for example, A Connecticut Yankee in King Arthur’s Court, a book (in paper form) sitting just out of arm’s reach as I write this, waiting to be read. If I had it on a device, particularly one with a good screen, I’d be more inclined to keep it, and dozens others, on hand in my backback and be ready to read at a moment’s notice. But no.

The problem is the the “we take care of the wireless delivery” part, called Whispernet(tm). It’s not really free, nor bundled in the service price. It’s bundled in to the cost of every media access. Is it fair to pay $9.99 for a New York Times bestseller? Sure. But it sucks to pay $1 for an A-list blog that’s free everywhere else, or to get literally nickeled and dimed for the privelege of “converting” and delivering your own content to your own device.

By the way, who gets the money paid for accessing, say, a CreativeCommons non-commercial licensed blog via the Kindle? Somebody should look into that.

I applaud Amazon for pushing to innovate in a space that badly needs it, but the financial model behind the wireless access encourages the wrong kind of things. Exceptions, like unlimited Wikipedia access (be still my heart!) still need to be hand approved by the gatekeeper. Information wants to be free, it doesn’t want to be a service, though that’s hard to see when the dollar signs get in your eyes.

Many folks are comparing this to the original iPod launch–remember, the huge klunky one with a tiny capacity, black and white screen, and a mechanical click-wheel? There’s some strong points of similarity, but stronger differences. For one, anyone with an iPod can easily rip their existing CDs, not to mention obtain MP3s from other methods (so I hear). There’s nothing like that yet for books.
Where’s the documentation for the new, proprietary ebook format? I don’t care about the DRM crap. I care about being able to create new content, or repackage existing content for which I have the rights, and for that, I’m having trouble coming up with a rationale for an entire new format. I would love to do some cool things with this platform. Perhaps I will some day, though my enthusiasm is somewhat lessened by the difficulties I would face getting anything cool onto the devices. -m

Sunday, November 18th, 2007

Gettysburg Address PowerPoint

As one who, in the all-too-near future, will be hammering out the visuals to go with my talk at XML 2007, this made my day. (be sure to check out the deeper pages too) -m

Saturday, November 10th, 2007

RDFa question

What is the difference between placing instanceof=”prefix:val” vs. rel=”prefix:val” on something? How do I decide between the two?

In the example of hEvent data, why is it better/more accurate to use instanceof=”cal:Vevent” instead of a blank node via rel=”cal:Vevent”?

-m

Monday, October 22nd, 2007

Is there fertile ground between RDFa and GRDDL?

The more I look at RDFa, the more I like it. But still it doesn’t help with the pain-point of namespaces, specifically of unmemorable URLs all over the place and qnames (or CURIEs) in content.

Does GRDDL offer a way out? Could, for instance, the namespace name for Dublin Core metadata be assigned to the prefix “dc:” in an external file, linked via transformation to the document in question? Then it would be simpler, from a producer or consumer viewpoint, to simply use names like “dc:title” with no problems or ambiguity.

This could be especially useful not that discussions are reopening around XML in HTML.

As usual, comments welcome. -m

Monday, October 1st, 2007

simple parsing of space-seprated attributes in XPath/XSLT

It’s a common need to parse space-separated attribute values from XPath/XSLT 1.0, usually @class or @rel. One common (but incorrect) technique is simple equality test, as in {@class=”vcard”}. This is wrong, since the value can still match and still have other literal values, like “foo vcard” or “vcard foo” or ” foo vcard bar “.

The proper way is to look at individual tokens in the attribute value. On first glance, this might require a call to EXSLT or some complex tokenization routine, but there’s a simpler way. I first discovered this on the microformats wiki, and only cleaned up the technique a tiny bit.

The solution involves three XPath 1.0 functions, contains(), concat() to join together string fragments, and normalize-space() to strip off leading and trailing spaces and convert any other sequences of whitespace into a single space.

In english, you

  • normalize the class attribute value, then
  • concatenate spaces front and back, then
  • test whether the resulting string contains your searched-for value with spaces concatenated front and back (e.g. ” vcard “

Or {contains(concat(‘ ‘,normalize-space(@class),’ ‘),’ vcard ‘)} A moment’s thought shows that this works well on all the different examples shown above, and is perhaps even less involved than resorting to extension functions that return nodes that require further processing/looping. It would be interesting to compare performance as well…

So next time you need to match class or rel values, give it a shot. Let me know how it works for you, or if you have any further improvements. -m

Sunday, September 16th, 2007

Evaluating fiction vs. evaluating libation

My Copious Free Time(tm) has been filled lately by two different evaluation projects. One is the 2nd Annual Writing Show Best First Chapter of a Novel Contest, for which the first round of judging is just winding up. The main benefit for contest entrants is that every submission gets a professional critique of at least 750 words. But additionally, each submisison gets a score on a 50-point scale, based on:

  • 10 points for Story. Is it a compelling read with a great hook? Are we engaged?
  • 10 points for Style. Is the writing smooth and tight, without awkward constructions, extraneous verbiage, and redundancies?
  • 10 points for Dialog. Is the dialog natural and does it move the story along?
  • 10 points for Character. Are the characters interesting? Do we care about them?
  • 10 points for Mechanics. Are grammar, spelling, and punctuation correct?

I’m also attending some classes aiming toward becoming a Certified Beer Judge (details on Meadblog). This isn’t as fun as it sounds. (Well, OK, maybe it is…). The idea is to build up better sensory perception so that my personal brewing and cooking projects can benefit. But the upcoming test is 70% written essay questions like “Identify three distinctly different top-fermenting beer styles with a starting gravity of 1.070 or higher, and describe the similarities and differences between the styles”. 30% of the test is based on actual tasting and filling out a tasting sheet. Of interest, the scoring here is also based on a 50-point scale:

  • 12 points for Aroma.
  • 3 points for Appearance.
  • 20 points for Taste.
  • 5 points for Mouthfeel.
  • 10 points for Overall Impression.

The interesting part is that there’s similarities between the two tasks. For both, I need to work off of physical paper, not in my head on on a computer screen. For both, I first “skim”, building an overall impression, then dig down into individual categories to assign a score for each one. Then I step back and look at my numbers, and check whether everything makes sense and accurately records my impressions. When I’m satisfied, I add everything up and am done.

Most day-to-day problems aren’t so well structured or normalized, but nonetheless, I find myself tackling all kinds of problems with a similar approach. There you have it. Writing and drinking beer make you a better person. :) -m

Wednesday, August 22nd, 2007

What I’m reading

Yeah, they’re related. -m