Semantics!

This week marked the MarkLogic World conference and with it some exciting news. Without formally “announcing” a new release, the company showed off a great deal of semantic technology in-progress. Part of that came from me, on stage during the Wednesday technical keynote. I’ve been at MarkLogic five years next month, and the first piece of code I wrote there was an RDFa parser. This has been a long time coming.

It was an amazing experience. I was responsible for sifting through the huge amounts of public data–both in RDF formats and on public web pages–and writing the semantic code to pull everything together, culminating in those ten minutes on stage.

Picture this: just behind the big stage and the projected screens was a hive of impressive activity. I counted 8 A/V people backstage, plus 4 more at the back of the auditorium. The conference has reached Â a level of production values that wouldn’t be vastly different if it was aÂ stadiumÂ affair. So in back there’s a curtained-off “green room” with some higher-grade snacks (think PowerBars and Red Bull) with a flatscreen that shows the stage. From back there you can’t see the projected slides or demos, but if you step just outside, you’re at the reverse side of the screen, larger-than-life. The narrow walkway leads to the “chute”, right up the steps onto the main stage. As David Gorbet went through the opening moments of his talk in fine form, I did some stretches and did everything I could think of to prepare myself.

Then he called me up and the music blasted out from the speakers. I had been playing through my mind all the nightmare scenarios–tripping on the stairs and falling on my face as I come onstage (etc.)–but none of that happened. I’ve done public speaking many times before so I had an idea what to expect, though on a stage like that the lights are so bright that it’s hard to see beyond about the third row. So despite the 300-400 people in the room, it didn’t even feel much different than addressing anÂ intimateÂ group of peers. It was fun. On with the demos:

The first showed our internal MarkMail cluster with a simple ‘infobox’ of the sort that all the search engines are doing these days. This was an icebreaker to talk about semantics and how it works–in this case locate the concept of Hadoop in the database, and from there find all the related labels, abstracts, people, projects, releases, and so on. During the construction of the demo, we uncovered some real world facts about the author of the top-ranked message for the query, including a book he wrote. The net effect was that these additional facts made the results a lot more useful by providing a broader context for them.

The second demo showed improved recall–that is finding things that would otherwise slip under the radar. The existing [from:IBM] query in MarkMail does a good job finding people that happen to have the letters i-b-m in their email address. The semantic query [affiliation:IBM] in contrast knows about the concept of IBM, the concept of people, and the relationship of is-affiliated-with (technically foaf:affiliation) to run a query that more closely models how a person would ask the question: “people that work for IBM” as opposed to “people that have i-b-m in their email address”. This the results included folks posting from gmail accounts and other personal addresses, and the result set jumped from about 277k messages to 280k messages.

At this point, a pause to talk about the architecture underlying the technology. It turns out that a system that already supports shared-nothing scale out, full ACID transactions, multiple HA/DR options, and a robust security model is a good starting point for building semantic capabilities. Â (I got so excited at this point that I forgot to use the clicker for a few beats and had to quickly catch-up the slides.) SPARQL code on the screen.

Then the third demo, a classic semantic app with a twist. Pulling together triples from several different publicÂ vocabularies, we answered the question of “find a Hadoop expert” with each row of the results representing not a document, as in MarkMail results, but an actual person. We showed location data (which was actually randomized to avoid privacy concerns) and aggregate cost-of-living data for each city. When we added in a search term, we drew histograms of MarkMail message traffic over time and skipped over the result that had no messages. The audience was entranced.

This is exciting work. I had several folks come up to me afterwards with words to the effect of they hadn’t realized it before, but boy do they ever need semantics. I can’t think of a better barometer for a technical keynote. So back to work I go. There’s a lot to do.

Thanking by name is dangerous, because inevitably people get left out, but I would like to shout out to David Gorbet who ran the keynote, John Snelson who’s a co-conspirator in the development effort, Eric Bloch who helped with the MarkMail code more than anyone will ever know, Denis Shehan who was instrumental inÂ wranglingÂ the cloud and data, and Stephen Buxton who patiently and repeatedly offered feedback that helped sharpen the message.

I’ll post a pointer to the video when it’s available. -m

Related Posts

Open to work: What I’m Looking For

Cleaning Data by David Mertz now available