Newest Post

March 14th, 2014

I am not a robot spammer

Based on the huge number of mail bounces I’ve been getting today, it looks like an unscrupulous somebody forged my return address on a bunch of mail. Perhaps you even sought out this blog based on the distinctive domain name.

Some subject lines in use:

it’s so nice to write to u

maybe your lady

Hi:))

It is me!

It wasn’t me. It’s all to easy to claim an email is from somebody. And putting an unsuspecting schmo through all this apparently makes the message 0.003% more likely to get through filters deliberately trying to block abusive behavior.

And don’t worry: I haven’t (yet) got any really irate people. It’s mostly automated bounces from when an email address on the spammer’s list no longer exists. Carry on. -m

February 18th, 2014

Fitness

I can’t blog about secret projects I’m working on, so how about something completely different?

I’ve improved my fitness level substantially over the last five years. (On index cards, I have my daily weight and body fat percentage, according to the bathroom scale, back to November 2009). Here’s some things I’ve learned:

  • Moving counts. A lot. The difference between being completely sedentary and moving a bit (easy walks, standing desk, etc.) is the biggest leap. Everything after that is incremental.
  • Spending $99 in a Fitbit is the best health investment I’ve made, dollar-for-dollar, ever.
  • Expensive shoes don’t help much. My current main shoes were $40 online, and they’re just as good, if not better, than the $120 shoes from Roadrunner.
  • Pilates looks easy if you’ve never tried it.
  • Once you reach a certain level, you will plateau there unless you challenge yourself further.
  • Strength training is helpful for just about everything, even improving your running times.
  • Foam rollers are super useful for managing sore muscles and tendons. Highly-recommended.
  • Boosting your VO2Max is painful–interval training is the gasping-for-air kind of torture many people think of when they hear the word ‘exercise’–but it’s also important if you want to improve your run times.
  • But you shouldn’t try to improve your run times or anything else unless you have specific bigger-picture goals in mind.
  • Seriously–sitting is terrible for you. Get a standing desk.

Invest in yourself. -m

October 29th, 2013

Skunklink a decade later

Alex Milowski asks on Twitter about my thoughts on Skunklink, now a decade old.

Linking has long been thought one of the cornerstones of the web, and thereby a key part of XML and related syntaxes. It’s also been frustratingly difficult to get right. XLink in particular once showed great promise, but when it came down to concrete syntax, didn’t get very far. My thinking at the time is still well-reflected in what is, to my knowledge, the only fiction ever published on XML.com: A Hyperlink Offering. That story ends on a hopeful note, and a decade out, I’m still hoping.

For what it purports to do, Skunklink still seems like a good solution to me. It’s easy to explain. The notion of encoding the author’s intent, then letting devices work out the details, possibly with the aid of stylesheets and other such tools, is the right way to tackle this kind of a problem. Smaller specifications like SkunkLink would be a welcome breath of fresh air.

But a bigger question lurks behind the scenes: that of requirements. Does the world need a vocabulary-independent linking mechanism? The empirical answer is clearly ‘no’ since existing approaches have not gained anything like widespread use, and only a few voices in the wilderness even see this as a problem. In fact, HTML5 has gone in quite the opposite direction, rejecting the notion of even a vocabulary-independent syntax, to say nothing of higher layers like intent. I have to admit this mystifies me.

That said, it seems like the attribute name ‘href’ has done pretty well in representing intended hyperlinks. The name ‘src’ not quite as well. I still consider it best practice to use these names instead of making something else up.

What do you think? -m

 

October 14th, 2013

ASLbot

If you’ve come here because of something you noticed in your HTTP access logs, read on.

Who is doing this? This is a personal project of Micah Dubinko. It is completely separate from anything related to any employer.

What is ASLbot? In the immediate future, ASLbot is no more than a personal research project. It consists of a web crawler, like Google, with an emphasis on sites centered around American Sign Language, and in particular reference materials relating to particular signs. At the moment, there is no publicly available search site, but I would like to set that up as time allows. My long term goal is to promote ASL as an effective means of communication while at the same time making it easier to research and learn about.

Will this affect my site? No. I have the crawl settings turned down very low, so that sites crawled have no discernible impact on performance. I also crawl very infrequently, as ASL dictionaries don’t tend to change terribly often. Once a search site is operating, you may notice an increase in traffic as more people are able to find and visit your site.

What do you intend to do with the crawled data? First off, this is a technology experiment. I’ve noticed that Google/Bing/Yahoo do only an “OK” job on queries like “asl sign for awesome” and think a dedicated site can do better. Once the basics are up, I’d like to do a lot more, but this will necessarily take a long time, as this is not my full-time work. For example, I would like to (possibly with manual input, especially from native signers) categorize signs by handshape, position, and movement in a manner similar to William Stokoe‘s groundbreaking research on ASL linguistics. Keep in mind that this, if it happens at all, is far in the future—imagine someone searching for “M handshape shoulder” and getting a list of hits that link to existing ASL dictionaries.

Do you plan to charge money to access the site? Never.

Do you automatically download videos? No. Only web pages.

How do I make it stop? Think of it this way: Does your site appear in Google? If so, people will be searching and finding particular signs anyway, but without the aid of an ASL-positive web tool. But if you really want to, put an entry for “ASLbot” in your robots.txt file, which this crawler fully honors.

This is awesome, how do I help?  Or, I still have questions: Feel free to email me using the contact information listed on this site, or ( <my first name> @ <this domain.info> )

August 10th, 2013

XForms in 2013

This year’s Balisage conference was preceded by the international symposium on Native XML User Interfaces, which naturally enough centered around XForms.

As someone who’s written multiple articles surveying XForms implementations, I have to say that it’s fantastic to finally see one break out of the pack. Nearly every demo I saw in Montreal used XSLTForms if it used XForms at all. And yet, one participant I conversed with afterwards noted that very little that transpired at the symposium couldn’t have been done ten years ago.

It’s safe to say I have mixed emotions about XForms. One one hand, watching how poorly the browser makers have treated all things XML, I sometimes muse about what it would look like if we started fresh today. If we were starting anew, a namespace-free specification might be a possibility. But with  XForms 2.0 around the corner, it’s probably more fruitful to muse about implementations. Even though XSLTForms is awesome, I still want more. :-)

  • A stronger JavaScript interface. It needs to be possible to incrementally retrofit an existing page using POHF (plain old HTML forms) toward using XForms in whole or in part. We need an obvious mapping from XForms internals to HTML form controls.
  • Better default UI. I still see InfoPath as the leader here. Things designed in that software just look fantastic, even if quickly tossed together.
  • Combining the previous two bullets, the UI needs to be more customizable, and easier so. It needs to be utterly straightforward to make XForms parts of pages fit in with non-XForms parts of pages.
  • Rich text: despite several assertions during the week, XForms can actually handle mixed text, just not very well. One of the first demo apps (in the DENG engine–remember that?) was an HTML editor. The spec is explicitly designed in such a way as to allow new and exciting forms widgets, and a mixed-content widget would be A Big Deal, if done well.
  • Moar debugging tools

During the main conference, Michael Kay demonstrated Saxon-CE, an impressive tour-de-force in routing around the damage that is browser vendors’ attitudes toward XML. And though he didn’t make a big deal of it, it’s now available freely under an open source license. This just might change everything.

Curious about what others think here–I welcome your comments.

-m

May 20th, 2013

Five years at MarkLogic

This past weekend marked my five-year anniversary at MarkLogic. It’s been a fun ride, and I’m proud of how much I’ve accomplished.

It was the technology that originally caught my interest: I saw the MarkMail demo at an XML conference, and one thing led to another. The company was looking to expand the product beyond the core database–they had plans for something called a “utility layer” though in reality it wasn’t really a utility nor a separate layer. It started with Search API, though the very first piece of code I wrote was an RDFa parser.

But what’s really held my interest for these years is a truly unmatched set of peers. This place is brimming with brilliant minds, and that keeps me smiling every day on my way in to work.

Which leads my thoughts back to semantics again. This push in a new direction has a lot of echoes with the events that originally brought me on board. This is going to be huge, and will move the company in a new direction. Stay tuned. -m

April 13th, 2013

Semantics!

This week marked the MarkLogic World conference and with it some exciting news. Without formally “announcing” a new release, the company showed off a great deal of semantic technology in-progress. Part of that came from me, on stage during the Wednesday technical keynote. I’ve been at MarkLogic five years next month, and the first piece of code I wrote there was an RDFa parser. This has been a long time coming.

It was an amazing experience. I was responsible for sifting through the huge amounts of public data–both in RDF formats and on public web pages–and writing the semantic code to pull everything together, culminating in those ten minutes on stage.

Picture this: just behind the big stage and the projected screens was a hive of impressive activity. I counted 8 A/V people backstage, plus 4 more at the back of the auditorium. The conference has reached  a level of production values that wouldn’t be vastly different if it was a stadium affair. So in back there’s a curtained-off “green room” with some higher-grade snacks (think PowerBars and Red Bull) with a flatscreen that shows the stage. From back there you can’t see the projected slides or demos, but if you step just outside, you’re at the reverse side of the screen, larger-than-life. The narrow walkway leads to the “chute”, right up the steps onto the main stage. As David Gorbet went through the opening moments of his talk in fine form, I did some stretches and did everything I could think of to prepare myself.

Then he called me up and the music blasted out from the speakers. I had been playing through my mind all the nightmare scenarios–tripping on the stairs and falling on my face as I come onstage (etc.)–but none of that happened. I’ve done public speaking many times before so I had an idea what to expect, though on a stage like that the lights are so bright that it’s hard to see beyond about the third row. So despite the 300-400 people in the room, it didn’t even feel much different than addressing an intimate group of peers. It was fun. On with the demos:

The first showed our internal MarkMail cluster with a simple ‘infobox’ of the sort that all the search engines are doing these days. This was an icebreaker to talk about semantics and how it works–in this case locate the concept of Hadoop in the database, and from there find all the related labels, abstracts, people, projects, releases, and so on. During the construction of the demo, we uncovered some real world facts about the author of the top-ranked message for the query, including a book he wrote. The net effect was that these additional facts made the results a lot more useful by providing a broader context for them.

The second demo showed improved recall–that is finding things that would otherwise slip under the radar. The existing [from:IBM] query in MarkMail does a good job finding people that happen to have the letters i-b-m in their email address. The semantic query [affiliation:IBM] in contrast knows about the concept of IBM, the concept of people, and the relationship of is-affiliated-with (technically foaf:affiliation) to run a query that more closely models how a person would ask the question: “people that work for IBM” as opposed to “people that have i-b-m in their email address”. This the results included folks posting from gmail accounts and other personal addresses, and the result set jumped from about 277k messages to 280k messages.

At this point, a pause to talk about the architecture underlying the technology. It turns out that a system that already supports shared-nothing scale out, full ACID transactions, multiple HA/DR options, and a robust security model is a good starting point for building semantic capabilities.  (I got so excited at this point that I forgot to use the clicker for a few beats and had to quickly catch-up the slides.) SPARQL code on the screen.

Then the third demo, a classic semantic app with a twist. Pulling together triples from several different public vocabularies, we answered the question of “find a Hadoop expert” with each row of the results representing not a document, as in MarkMail results, but an actual person. We showed location data (which was actually randomized to avoid privacy concerns) and aggregate cost-of-living data for each city. When we added in a search term, we drew histograms of MarkMail message traffic over time and skipped over the result that had no messages. The audience was entranced.

This is exciting work. I had several folks come up to me afterwards with words to the effect of they hadn’t realized it before, but boy do they ever need semantics. I can’t think of a better barometer for a technical keynote. So back to work I go. There’s a lot to do.

Thanking by name is dangerous, because inevitably people get left out, but I would like to shout out to David Gorbet who ran the keynote, John Snelson who’s a co-conspirator in the development effort, Eric Bloch who helped with the MarkMail code more than anyone will ever know, Denis Shehan who was instrumental in wrangling the cloud and data, and Stephen Buxton who patiently and repeatedly offered feedback that helped sharpen the message.

I’ll post a pointer to the video when it’s available. -m

March 31st, 2013

Introducing node-node:node.node

Naming is hard to do well, almost as hard as designing good software in the first place. Take for instance the term ‘node’ which depending on the context can mean

  1. A fundamental unit of the DOM (Document Object Model) used in creating rich HTML5 applications.
  2. A basic unit of the Semantic Web–a thing you can say stuff about. Some nodes are even unlabeled, and hence ‘blank nodes’.
  3. In operations, a node means, roughly, a machine on the network. E.g. “sixteen-node cluster”
  4. A software library for event-driven, asynchronous development with JavaScript.

I find myself at the forefront of a growing chorus of software architects and API designers that are fed up with this overloading of a perfectly good term. So I’m happy today to announce node-node:node.node.

The system is still in pre-alpha, but it solves all of the most pressing problems that software developers routinely run in to. In this framework, every node represents a node, for the ultimate in scalable distributed document storage. In addition, every node additionally serves as a node, which provides just enough context to make open-world assumption metadata assertions at node-node-level granularity. Using the power of Node, every node modeled as a node has instant access to other node-node:nodes. The network really is the computer. You may never write a program the old way again. Follow my progress on Sourceforge, the latest and most cutting-edge social code-sharing site. -m

March 1st, 2013

WFH

The valley is buzzing about Marissa’s edict putting the kibosh on Yahoos working from home. I don’t have any first-hand information, but apparently this applies somewhat even to one-day-a-week telecommuters. Some are saying Marissa’s making a mistake, but I don’t think so. She’s too smart for that. There’s no better way to get extra hours of work out of a motivated A-lister than letting them skip the commute, and I work regularly with several full-time telecommuters. It works out just fine.

This is a sign that Y is still infested with slackers. From what I’ve seen, a B-or-C-lister will ruthlessly take advantage of a WFH policy. If that dries up, they’ll move on.

If I’m right, the policy will indeed go into effect at Yahoo starting this summer, and after a respectable amount of time has passed (and the slackers leave) it will loosen up again. And Yahoo will be much stronger for it. Agree? -m

February 18th, 2013

Nerve-wracking

So I did it.

I stood up on a platform in front of a room of native signers, and delivered a (pre-prepared) five minute presentation without making a sound. In front of cameras, with my ugly face beamed out to multiple large screens.

That was stressful, though less so then many different public speaking engagements I’ve participated in. It was a different kind of stress. I’m sure I made all kinds of mistakes of which I wasn’t even aware of. ASL books, videos, and web sites tend to focus on particular signs, and vocabulary is one important part of learning the language–but not the only part. A huge amount of the communication comes through facial expression, body shifting and language, and other “non-manual markers.” I’m learning, if slowly.

It’s also helping me in everyday situations, among hearing folks. I’m better able to express myself and I’ve picked up some new gestures (like non-dominant-hand indexing…more on that later), and I tend to, even if in the back of my mind, think about how you’d express such-and-such an idea in ASL, and having thought it through more, better express it in writing or speech.

It’s also helping to finally tame my inner-introvert. When a fundamental part of communication involves displaying play-by-play emotions on your face (and indeed, entire body) it changes you. Better than acting lessons.

What have you done lately to push yourself out of your comfort zone? -m

December 31st, 2012

New Year’s Resolution

Holding steady at 1440 x 900.

Relevant. -m

December 25th, 2012

Fluency

My journey into ASL continues. I’ve been reading Oliver Sacks _Seeing Voices_ and Harlan Lane, Robert Hoffmeister, and Ben Bahan’s _A Journey into the DEAF-WORLD_. In short, learning a language in your thirties is a whole different ballgame than learning as a toddler. There are a few different brain plasticity cliffs you drop off especially at around age 6 and again at age 12.

And I’m completely OK with this. I don’t expect to ever get confused for a native signer, which is fine with me. I do expect, however, to become a better communicator–to develop sufficient skill to be clearly understood in ASL. I prefer to think of it like someone with a suave British accent in America. You’d never mistake them for a native and yet they are a joy to converse with. In the right circumstances, they can even grab your attention moreso than someone with a native accent.

This can only do good things for my spoken communication skills as well. It’s a lot like acting classes in some respects, which is a marked departure from my normally taciturn personality. This is encouraging me to quit holding everything inside quite so much, with encouraging results. If you see me walking a little taller, speaking a bit more emphatically, or better conveying emotion to get my point across, now you know what’s behind that. -m

December 8th, 2012

Mistakes

I’ve been learning a new language lately: American Sign Language aka ASL. Along with the language, I’ve picked up lots of new friends as part of a thriving culture. A big part of learning is through mistakes, and a big part of said culture is helpful bluntness. The combination of these can be a little rough on your ego sometimes.

Sometimes I notice that, when I’m corrected–say I make a sign incorrectly and my conversational partner demonstrates the correct way to do it–I often can’t tell any difference between what I was supposed to do and what my hands actually did. This kind of fundamental error in cognition seems to happen all the time with me. My helpful friends tell me that’s a good sign. (no pun intended)

A less-bruising kind of error is the “oops” kind–the instant you commit the error, you know you messed up. This, however, can sometimes throw you off if you get self-conscious about it. A third kind of error is when you know exactly what to do, but your physiology holds you back–for instance the ASL sign for either ’6′ or ‘W’ (made the way most hearing people show a ’3′ on their fingers; thumb holding down the pinky) is difficult for me to make without slowing way down. And to think, only 13 years ago I was playing keyboards in a little garage band. Guess I need some stretches. It’s good to loosen up.

In ASL, though, there’s a weird kind of middle ground. Sometimes people who don’t know Spanish kind of ‘fake it’ — “Yo no speako español” and the like, which has always come across to me as vaguely offensive. Being overly terrified of making a mistake is itself a fourth kind of mistake. ASL is remarkably flexible; even though it’s a complete language, it has aspects based on pantomime and sometimes “classifiers”, where your hands and fingers can stand in for people, vehicles, or many other things of particular shapes/sizes. I watch some very well-made ASL productions that have equally well-made English paragraphs alongside, and the ASL version uses all of these techniques and more. No word-for-word correspondence here: every time, I’m surprised by the versitility of the language. My theory is that for an earnest student, it’d be a lot harder to accidentally come across as offensive or mocking the language in ASL compared to other spoken languages. And thus, I’m probably committing the fourth kind of error too much.

It’s good to loosen up. -m

November 20th, 2012

Hedgehogs and Foxes

In Nate Sliver’s new book, he mentions a classification system for experts, originally from Berkeley professor Philip Tetlock, along a spectrum of Fox <—> Hedgehog. (The nomenclature comes from an essay about Tolstoy.)

Hedgehogs are type A personalities who believe in Big Ideas. The are ideologues and go “all-in” on whatever they’re espousing. A great many pundits fall into this category.

Foxes are scrappy creatures who believe in a plethora of little ideas and in taking different approaches toward a problem, and are more tolerant of nuance, uncertainty, complexity, and dissent.

There are a lot of social situations (broadly construed) where hedgehogs seem to have the upper hand. Talking heads on TV are a huge example, but so are many fixtures in the tech world, Malcolm Gladwell, say. Most of the places I’ve worked at have at least a subtle hedgehog-bias toward hiring, promotions, and career development.

To some degree, I think this stems from a lack of self-awareness. Brash pundits come across better on the big screen; they grab your attention and take a bold stand for sometihing–who wouldn’t like that? But if you take pause and think about what they’re saying or (horror) go back an measure their predictions after-the-fact, they don’t look nearly so good. Foxes are better at getting things right.

It seems like we’ve just been through a phase of more-obnoxious-than-usual punditry, and I found this spectrum a useful way to look at things. How about you? Are you paying more attention to hedgehogs when you probably should be listening to the foxes?

-m

September 28th, 2012

Virgil Matheson: mentor

I’ve mentioned Virgil Matheson in these pages a few times, but never made a full accounting. When I had my O’Reilly book published, I submitted a simple dedication in the manuscript:

for Virgil

But for whatever reason, it didn’t make it into the printed edition. This post is a small step toward letting the world know about someone important to me.

We first met in 1985 or thereabouts. One day while riding my bike through a back-alley, I stopped to look at an equipment rack set outside a spare garage. Virgil came out to give a get-off-my-lawn kind of speech, and somehow we ended up talking about electronics.  This led to discussions about crystal radios, and in a subsequent visit, we built one, he explaining the principles of operation. Virgil, it turns out, was a retired teacher at the North Dakota State School of Science, where he taught AC theory and thermodynamics. I was going through some rough times, and Virgil ended up being a much-needed role model.

Around that time, I had ttempted to build a Heathkit radio set, but couldn’t quite get it working. I brought it to Virgil, and we traced through the schematic diagrams, eventually getting it working. Along the way, Virgil introduced me to all kinds of electronic test equipment, including oscilloscopes and galvanometers that he had hand-wound in his younger days.

The next year, I needed a science project, and I had become fixated on Tesla Coils. Virgil had worked at Westinghouse (but not in overlap with the good N. Tesla) and found this project right up his alley. We used his wood lathe to turn a base for the coil, and a standard lathe to wind a primary and two perfectly-spaced secondary coils on PVC pipe, after which we sprayed them down with insulating paint. We built a high-voltage power supply out of a car battery, ignition coil, and relay-type regulator from the junkyard. The thing would turn out serious spark on the primary side, and at one point, I accidentally made contact with it, knocking me clear off the metal bench I was sitting on. We used a spark gap and high-voltage capacitors from old equipment to make a resonator, and got the coil working. It could light a fluorescent tube from my full arm-span away. It was a smash hit at the science fair, too.

For one so knowledgable about the foundations of technology, he was awfully curmudgeonly about it. He bemoaned the day students started showing up in his class with hand-calculators instead of slide rules. He would never answer the phone (but would speak on it, if you could get his brother to pick up).

We kept meeting on and off, and we would have epic discussions/debates about technology, thermodynamics, perpetual motion machines, higher mathematics, theology, building test equipment, and logic puzzles. He taught me, in short, how to think.

A non-exhaustive list of things he taught me:

  • How to build a crystal radio set
  • How to troubleshoot a conventional radio (hint–check for signal at the volume control–that will narrow down the problem to either the front-end or back-end)
  • How to compute resonant LC circuits
  • How to use a slide rule
  • How to pick locks
  • How to compute power factor and plot phasor diagrams for AC circuits
  • The value of good tools and how to care for them
  • How to build a Tesla Coil
  • How to debate
  • Respect for high voltage
  • The joy of back-issues of Scientific American
  • The trouble with Pascal’s wager
  • How to debunk perpetual motion claims
  • How (and why) to use a planimeter

On a recent vacation, I went to see Virgil again–now in his 90s. He’s still vigorous and feisty, though his memory is starting to slip a little. It was difficult to come to terms with the possibility that, given the frequency with which I make it to that part of the country, it may be the last time I see him. Since this is posted online, he’ll probably never see it. But if he could speak to each one of you, I think he’d offer advice something like this:

Cherish the people in your life. Treat every meeting as if it might be the one that sets you on a new course–one that you’ll look back at years later in wonder. Don’t worry what others think of you, and never stop learning.

Thank you, Virgil, for all you’ve given me. -m

September 17th, 2012

MarkLogic 6 is here

MarkLogic 6 launched today, and it’s full of new and updated goodies. I spent some time designing the new Application Builder including the new Visualization Widgets. If you’ve used Application Builder in the past, you’ll be pleasantly surprised at the changes. It’s leaner and faster under the hood. I’d love to hear what people think of the new architecture, and how they’re using it in new and awesome ways.

If I had to pick out a common theme for the release, it’s all about expanding the appeal of the server to reach new audiences. The Java API makes working with the server feel like a native extension to the language, and the REST API makes it easy to extend the same to other languages.

XQuery support is stronger than ever. I liked Ryan Dew’s take on some of the smaller, but still useful features.

This wouldn’t be complete without thanking my teammates who really made this possible. I had the great pleasure of working with some top-notch front-end people recently, and it’s been a great experience. -m

 

August 23rd, 2012

Super simple tokenizer in XQuery

A lexer might seem like one of the boringest pieces of code to write, but every language brings it’s own little wrinkles to the problem. Elegant solutions are more work, but also more rewarding.

There is, of course, a large body of work on table-driven approaches, several of them listed here (and bigger list), though XQuery seems to have been largely left out of the fun.

In MarkLogic Search API, we implemented a recursive tokenizer. Since a search string can contain quoted pieces which need to be carefully maintained, first we split (in the fn:tokenize-sense, discarding matched delimiters) on the quote character, then iterate through the pieces. Odd-numbered pieces are chunks of tokens outside of any quoting, and even-numbered pieces are a single quoted string, to be preserved as-is. We recurse through the odd chunks, further breaking them down into individual tokens, as well as normalizing whitespace and a few other cleanup operations. This code is aggressively optimized, and it removes any searches for tokens known to not appear in the overall string. It also preserves the character offset positions of each token relative to the starting string, which gets used downstream, so this makes for some of the most complicated code in the Search API. But it’s blazingly fast.

When prototyping, it’s nice to have something simpler and more straightforward. So I came up with an approach using fn:analyze-string. This function, introduced in XSLT 2.0 and later ported to XQuery 3.0, takes a regular expression, and returns all of the target string, neatly divided into match and non-match portions. This is great, but difficult to apply across the entire string. For example, potential matches can have different meaning depending on where they fall (again, quoted strings as an example.) But if every regex starts with ^ which anchors the match to the front of the string, the problem simplifies to peeling off a single token from the front of the string. Keep doing this until there’s no string left.

This is a particularly nice approach when parsing a grammar that’s formally defined in EBNF. You can pretty much take the list of terminal expressions, port them to XQuery-style regexes, add a ^ in front of each, and roll.

Take SPARQL for example. It’s a reasonably rich grammar. The W3C draft spec has 35 productions for terminals. I sketched out some of the terminal rules (note these are simplified):

declare variable $spq:WS     := "^\s+";
declare variable $spq:QNAME  := "^[a-zA-Z][a-zA-Z0-9]*:[a-zA-Z][a-zA-Z0-9]*";
declare variable $spq:PREFIX := "^[a-zA-Z][a-zA-Z0-9]*:";
declare variable $spq:NAME   := "^[a-zA-Z][a-zA-Z0-9]*";
declare variable $spq:IRI    := "^<[^>]+>";
...

Then going through the input string, and seeing which of these expressions match, and if so, calling analyze-string and adding the matched portion as a token, and recursing on the non-matched portion. Note that we need to go through longer matches first, so the rule for ‘prefix:qname’ comes before the rule for ‘prefix:’ which comes before the rule for ‘string’

declare function spq:tokenize-recurse($in as xs:string, $tl as json:array) {
    if ($in eq "")
    then ()
    else spq:tokenize-recurse(
        switch(true())
        case matches($in, $spq:WS)     return spq:discard-tok($in, $spq:WS)
        case matches($in, $spq:QNAME)  return spq:peel($in, $spq:QNAME, $tl, "qname")
        case matches($in, $spq:PREFIX) return spq:peel($in, $spq:PREFIX, $tl, "prefix", 0, 1)
        case matches($in, $spq:NAME)   return spq:peel($in, $spq:NAME, $tl, "name")
        ...

Here, we’re co-opting a json:array mutable object as a convenient way to store tokens as we peel them off. There’s not actually any JSON involved here. The actual peeling looks like this:

declare function spq:peel(
    $in as xs:string,
    $regex as xs:string,
    $toklist as json:array,
    $type as xs:string,
    $triml, $trimr) {
    let $split := analyze-string($in, $regex)
    let $match := string($split/str:match)
    let $match := if ($triml gt 0) then substring($match, $triml + 1) else $match
    let $match := if ($trimr gt 0) then substring($match, 1, string-length($match) - $trimr) else $match
    let $_ := json:array-push($toklist, <searchdev:tok type="{$type}">{$match}</searchdev:tok>)
    let $result := string($split/str:non-match)
    return $result
};

Some productions, like a <iri> inside angle brackets, contain fixed delimiters which get trimmed off. Some productions, like whitespace, get thrown away. And that’s it. As it stands, it’s pretty close to a table-driven approach. It’s also more flexible than the recursive approach above–even for things like escaped quotes inside a string, if you can write a regex for it, you can lex it.

Performance

But is it fast? Short answer is that I don’t know. A full performance analysis would take some time. But a few quick inspections shows that it’s not terrible, and certainly good enough for prototype work. I have no evidence for this, but I also suspect that it’s amenable to server-side optimization–inside the regular expression matching code, paths that involve start-anchored matches should be easy to identify and in many cases avoid work farther down the string. There’s plenty of room on the XQuery side for optimization as well.

If you’ve experimented with different lexing techniques, or are interested in more details of this approach, drop me a line in the comments. -m

August 7th, 2012

Balisage Bound

I’m en route to Balisage 2012, though beset by multiple delays. The first leg of my flight was more than two hours delayed, which made the 90 minute transfer window…problematic. My rebooked flight, the next day (today, that is) is also delayed. Then through customs. Maybe all I’ll get out of Tuesday is Demo Jam. But I will make it.

I’m speaking on Thursday about exploring large XML datasets. Looking forward to it!

-m

June 4th, 2012

Relax NG vs XML Schema: ten year anniversary

Today is the 10-year anniversary of this epic message from James Clark on the relative merits of Relax NG vs. XML Schema, and whether the latter should receive preferential treatment. Still relevant today–the discussion is still going, although an increasing number of human-readable web specifications have adopted RelaxNG in some form. -m

April 26th, 2012

MarkLogic World 2012

I’m getting ready to leave for MarkLogic World, May 1-3 in Washington, DC, and it’s shaping up to be one fabulous conference. I’ve always enjoyed the vibe at these events–it has a, well, cool-in-a-data-geeky-way thing going on (like the XML conference in the early 2000′s where I got to have lunch with James Clark, but that’s a different story). Lots of people with big data problems will be here, and I always enjoy talking to these kinds of people.

I’m speaking on Wednesday at 3:30 with Product Manager extraordinaire Justin Makeig about big data visualization. If you’ll be at the conference, come look me up. And if you won’t, well, forgive me if I need a few extra days to get back to any email you send this way.

Follow me on Twitter and look for the #MLW12 tag for live coverage.

-m

April 15th, 2012

Actually using big data

I’ve been thinking a lot about big data, and two recent items nicely capture a slice of the discussion.

1) Alex Milowski recounting working with Big Weather Data. He concludes that ‘naive’ (as-is) data loading is a “doomed” approach. Even small amounts of friction add up at scale, so you should plan on doing som in-situ cleanup. He came up with a slick solution in MarkLogic–go read his post for details.

2) Chris Dixon on Making Large Datasets Useful. Typical approaches like machine learning only solve 80-90% of the problem. So you need to either live with errorful data, or invoke manual clean-up processes.

Both worth a read. There’s more to say, but I’m not ready to tip my hand on a paper I’m working on…

-m

February 1st, 2012

Googlebot submitting Flash forms

I’m sure this is old news by now, but here’s one more data point.

As it turns out, XForms Institute uses an old skool XForms engine written in Flash, dating approximately back to the era when Flash was necessary to do XForms-ey things in the browser. The feedback form for the site is, quite naturally, implemented in XForms. Submissions there ultimately make it into my inbox. Here’s what I see:

Tue Jan 31 12:19:22 2012 66.249.68.249 Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

An iPhone running Flash? I doubt it. That’s quite an agent string! Organic versioning in the wild. -m

January 15th, 2012

The ultimate breakfast smoothie

I’ve used this same recipe for three things: weight loss, after-exercise protein, and sore-teeth liquid diet. It’s great.

1 cup 2% milk

1 cup Dannon Fit & Light vanilla yogurt

1 scoop Syntha-6 protein powder (banana is great)

Mix.

This yields 450 calories with a whopping 39g of protein, 48g of carb (but only 30g of that simple sugars), 11g of fat, and 5g of fiber.

You could live off 3 or 4 of these a day. (and I have)

January 15th, 2012

Five iOS keyboard tips you probably didn’t know

Check out these tips. The article talks about iPad, but they work on iPhone too, even an old 3G.

One one hand, it shows the intense amount of careful thought Apple puts into the user experience. But on the other hand, it highlights the discovery problem. I know people who have been using iOS since before it was called iOS, and still didn’t know about these. How do you put these kinds of finishing touches into a product and make sure the target audience can find out about them? -m

January 14th, 2012

Call a Spade a Spade

A cautionary tale of language from Ted Nelson:

We might call a common or garden spade–

  • A personalized earth-moving equipment module
  • A mineralogical mini-transport
  • A personalized strategic tellurian command and control module
  • An air-to-ground interface contour adjustment probe
  • A leveraged tactile-feedback geomass delivery system
  • A man-machine energy-to-structure converter
  • A one-to-one individualized geophysical restructurizer
  • A portable unitized earth-work synthesis system
  • An entrenching tool
  • A zero-sum dirt level adjuster
  • A feedback-oriented contour management probe and digging system
  • A gradient disequilibrator
  • A mass distribution negentroprizer
  • (hey!) a dig-it-all system
  • An extra terrestrial transport mechanism

Spades, not words, should be used for shoveling. But words should help us unearth the truth.

–Computer Lib (1974), Theodor Nelson, p44

December 8th, 2011

Resurgence of MVC in XQuery

There’s been an increasing amount of talk about MVC in XQuery, notably David Cassel’s great discussion and to an extent Kurt Cagle’s platform discussion that touched on forms interfaces. Lots of Smart People are thinking in this area, and that’s a good thing.

A while back I recorded my thoughts on what I called MET, or the Model Endpoint Template organizational pattern, as used in MarkLogic Application Builder. One difference between 2009 and now, though, is that browsers have distanced themselves even farther from XML, which tends to undercut the eliminate-the-impedance-mismatch argument. In particular, the forms model in HTML5 continues to prefer flat data, which to me indicates that models still play an important role in XQuery web apps.

So I envision the app lifecycle like this:

  1. The browser requests a particular page, say the one that lets you configure sorting options in the app you’re building
  2. An HTML page loads.
  3. Client-side script requests the project state from a designated endpoint, the server transforms the XML into a flat list, and delivers it as JSON (as an optimization, the server can package the initial data into the page delivered in the prior step)
  4. Standard form interaction and client-side scripting happens, including manipulation of repeating structures mediated by JavaScript
  5. A standard form submit happens (possibly via script), sending a flat list back to the client, which performs an update to the stored XML.
It’s pretty easy to envision data-mapping tools and libraries that help automate the construction of the transforms mentioned in steps 3 and 5.

Another thing that’s changed is the emergence of XQuery plugin technology in MarkLogic. There’s a rapidly-growing library of reusable components, initially centered around Information Studio but soon to cover more ground. This is going to have a major impact on XQuery app designs as components of the app (think visualization widgets) can be seamlessly added to apps.

Endpoints still make a ton of sense for XQuery apps, and provide the additional advantage that you now have a testable, concern-separated data layer for your app. Other apps have a clean way to interop, and even command-line operaton is possible with off-the-shelf-tools like wget.

Lastly, Templates. Even if you use plugins for the functional core of your app, there’s still a lot of boilerplate stuff you’d not want to repeat. Something like Mustache.xq is a good fit for this.

Which is all good–but is it MVC? This organizational pattern (let’s call it MET 2.0) is a lot closer to it. Does MET need a controller? Probably. (MarkLogic now ships a pretty good one called rest:rewrite) Like MVC, MET separates the important essences of your application. XQuery will never be Ruby or Java, and its frameworks will never be Rails or Spring, but rather something uniquely poised to capture the expressive power of the language to build apps on top of unstructured and big data. -m

November 1st, 2011

5 things to know about MarkLogic 5

MarkLogic 5 is out today. Here’s five things beyond the official announcement that developers should know about it:

  1. If you found the CQ sample useful, you’ll love Query Console, which does everything CQ does and more (syntax highlighting!)
  2. Better Search API support for metadata: MarkLogic has always had support for storing metadata separately from documents. With new Search API support, it’s easy to set up, and it works great with databases of binary documents.
  3. The Hadoop connector, while not officially supported in this configuration, works on Mac. I know a lot of developers use Mac hardware. Once you get Hadoop itself set up (following rules like these), everything works great in my experience.
  4. “Fields” have gotten more general and more powerful. If you haven’t set aside named portions of your documents or metadata for special indexing and access, you should look in to this feature–it will rock your world.
  5. To better understand what your system is doing at any point in time, you can now use the built-in Monitoring Dashboard, which runs in-browser.
And let’s not leave out the Express license, which makes it easier to get started. Check it out.
-m

September 29th, 2011

facebook Challenge results

Andromeda took the facebook Challenge, and found 52 separate requests in 24 hours that would have gone to the facebook mothership. Watch her blog for more updates. How about you?

If you look through these logs, pay particular attention to the referer field. This tells you on which site you were browsing when the data set out on its voyage toward facebook.

September 27th, 2011

Take the facebook Challenge

Worried about how much data facebook is collecting on you, even on 3rd party sites, even if you’re signed out? Try this for 24 hours:

  1. Find a file named ‘hosts’ on your computer. On Mac/Linux systems, it’s under /etc/. On Windows, it used to be under System32 somewhere, but who knows now. Stash a backup copy somewhere.
  2. Add the following on a new line:     127.0.0.1 www.facebook.com
  3. Configure a web server running on your local machine.
This will forcibly redirect all calls to facebook to local. At the end of 24 hours, take a look at your web server’s access log. Every line in there is something that would have gone to facebook. Every ‘like’ button, every little banner, all those things track your movements across the web, whether you are signed in to facebook or not. You’ll marvel at how many blank rectangles appear on sites you visit.
Bonus points: At the end of the 24 hours, don’t restore your hosts file.
Please post your facebook-free experiences here.
-m

July 5th, 2011

Geek Thoughts: how I take my tea

Having been recently accused of “vile” habits in regard to tea-drinking, I feel that I need to clear the air. :)

I’ve never been officially tested, but I am almost certainly a supertaster. (This explains, among other things, my aversion to most vegetables and my status as a nationally ranked beer judge). I’ve never been medically tested, but I did go through the BBC test and some rough taste-bud-counting with blue dye and a mirror.

So I do not generally follow accepted wisdom with tea. To prepare tea, I get a nice glass of cold water and plunk in a tea bag. Same goes for other tea-like substances, such as yerba mate. The result is a much slower steeping process, where subtle flavors shift throughout the day and with different refills. Does it get bitter? While tannins are part of the tea flavor, you don’t get that intense, mouth-puckering astringency like you would hot-steeping tea for too long. It’s more gradual and interesting.

Different kinds of tea have different spectrums of flavor, as revealed over the course of a day. Earl Grey and green tea are particularly nice. Some interesting combinations are possible too, by combining two teas which reach their flavor peaks at different times.

I say keep an open mind, and don’t knock it if you haven’t tried it. :) -m